Alma APAC API Gateway RCA April 4, 2016
Confidential Information, Disclaimer and Trade Marks
Introduction
This document serves as a Root Cause Analysis for the Alma API Gateway service interruption experienced by Ex Libris customers on April 4, 2016. The
goal of this document is to share our findings regarding the event, specify the root cause analysis, outline actions to be taken to solve the downtime event, as
well as preventive measures Ex Libris is taking to avoid similar cases in future.
Event Timeline
Service interruption was experienced by Ex Libris customers using Alma API gateway systems on:
April 4, 2016 from 07:30 AM until 10:30 AM Singapore Time zone.
April 4, 2016 from 2:40 PM until 3:00 PM Singapore Time zone.
During the event, Alma API gateway was inconsistent and was causing service disruption at times.
Root Cause Analysis
Ex Libris Engineers investigated this event to determine the root cause analysis with the following results:
During a planned activity (during the Sunday maintenance window) to improve the storage infrastructure of the API gateway in the APAC data center, a human
error resulted in a configuration error in the database synchronization process between the two redundant databases servicing the API gateway. This
configuration error caused a temporary database synchronization issues, which have caused the application to be unavailable at times.
Technical Action Items and Preventive Measures
Ex Libris has taken the following action and preventive measures to avoid such an occurrence in future:
- To immediately resolve the issue, the identified problematic service was shut off, allowing the redundant application service to take affect and the application to become available.
- Following the event, a manual synchronization was triggered, this had caused the second short outage. Following the synchronization, the secondary application service became available again and allowed returning to complete regular service.
- The configuration causing the original event is being investigated to identify cause of fault.
- Monitoring is being investigated in order to allow a faster and more accurate identification of a problematic situation as occurred in this case.
Customer Communication
ExLibris is committed to providing customers with prompt and ongoing updates during Cloud events. Ongoing and prompt updates on service interruptions appear in the system status portal at this address: http://status.exlibrisgroup.com/
Confidential Information, Disclaimer and Trade Marks