Alma EU00 Instance RCA April 17 and May 4,11,13 2016
Confidential Information, Disclaimer and Trade Marks
Introduction
This document serves as a Root Cause Analysis for the Alma service interruption experienced by Ex Libris customers on April 17 and May 4,11 and 13 - 2016.
The goal of this document is to share our findings regarding the event, specify the root cause analysis, outline actions to be taken to solve the downtime event, as well as preventive measures Ex Libris is taking to avoid similar cases in future.
Event Timeline
Service interruption was experienced by Ex Libris customers served by the Alma EU00 instance at the Europe Data Center during the following hours:
April 17,2016 from 11:32 AM until 11:58 AM Central Europe time zone (CET) .
May 4,2016 from 5:50 PM until 5:55 PM Central Europe time zone (CET) .
May 11,2016 from 8:29 AM until 8:52 AM Central Europe time zone (CET) .
May 13,2016 from 2:31 PM until 2:39 PM Central Europe time zone (CET) .
During the event, the service was unavailable for the environment.
Root Cause Analysis
Ex Libris Engineers investigated this event to determine the root cause analysis with the following results:
- The Database had suffered a short network disconnect. The disconnection had triggered a failover to the redundant database, which had failed to start.
Technical Action Items and Preventive Measures
Ex Libris has taken the following action and preventive measures to avoid such an occurrence in future:
- Ex Libris Engineers are working closely with the Database vendor experts to identify the problematic trigger for the disconnection. Several configuration change options are being looked at and will be tested in the lab before being implemented in production.
- Following the later events and consultant with the Database vendor experts, Ex Libris Engineers have done hardware upgrades to the server on which the Database resides on.
- With the Database vendor experts recommendation, a configuration change had been performed to allow a higher tolerance of the system to network disconnects.
Customer Communication
ExLibris is committed to providing customers with prompt and ongoing updates during Cloud events. Ongoing and prompt updates on service interruptions appear in the system status portal at this address: http://status.exlibrisgroup.com/