Alma Instances Amsterdam Data center - RCA - January 28, 2020
Introduction
This document serves as a Root Cause Analysis for the service interruption experienced by Ex Libris customers.
The goal of this document is to share our findings regarding the event, specify the root cause analysis, outline actions to be taken to solve the downtime event, as well as preventive measures Ex Libris is taking to avoid similar cases in future.
Event Timeline
Service slowness was experienced by Ex Libris customers served by the Alma instances at the Amsterdam Data Center during the following hours:
January 28th, 2020 from 05:00 AM until 05:50 AM Amsterdam time.
During the event, the service was unavailable for the environment
Root Cause Analysis
Ex Libris Engineers investigated this event to determine the root cause with the following results:
Due to an issue was found in our LB software an unplanned service restart took place, once the LB came back online, some of the network configurations were missing from it, causing downtime to serval environments.
As an immediate action Ex Libris network engineers failed-over to the other LB in the cluster in order to bring the environments back up.
Technical Action Items and Preventive Measures
Ex Libris has taken the following action and preventive measures to avoid such an occurrence in future:
- Ex Libris network engineers failed over to the other LB in the cluster to bring the environments back up.
- The LB configurations were re-applied and failback was done.
- Ex Libris engineers worked with the LB vendor to implement a workaround solution until a software with a fix to this bug will be available.
Customer Communication
ExLibris is committed to providing customers with prompt and ongoing updates during Cloud events. Ongoing and prompt updates on service interruptions appear in the system status portal at this address: http://status.exlibrisgroup.com/
These updates are automatically sent as emails to registered customers.