CampusM AP01 - RCA – February 27, March 5, March 6, 2018
Confidential Information, Disclaimer and Trade Marks
Introduction
This document serves as a Root Cause Analysis for the campusM service interruption experienced by Ex Libris customers on February 27th, March 5th & March 6th, 2018.
The goal of this document is to share our findings regarding the event, specify the root cause analysis, outline actions to be taken to solve the downtime event, as well as preventive measures Ex Libris is taking to avoid similar cases in future.
Event Timeline
Service interruption was experienced by Ex Libris customers served by the campusM AP01 instance at the APAC Data Centre during the following hours:
February 27th, 2018 from 7:31 AM until 9:29 AM Singapore time zone – Service Disruption
March 5th, 2018 from 9:45 AM until 10:17 AM Singapore time zone – Service Disruption
March 6th, 2018 from 8:10 AM until 11:10 AM Singapore time zone – Performance Issues
Root Cause Analysis, Technical Action Items and Preventive Measures
Ex Libris Engineers investigated this event to determine the root cause analysis with the following results:
During the first event on February 27th, the Ex Libris engineers discovered the root cause for the service disruption was a combination of two issues:
- Firewall issue
- Pileup on the thread pool causing its exhaustion
Both issues caused slowness in outgoing traffic that led to performance degradation and eventually complete system disruption.
Following this event, Ex Libris implemented the following:
- Disabling the problematic configuration in the firewall that initiated the bug
- Removing the problematic configuration across all firewall devices
- Triple the application thread pool
The second event on March 5th indicated that the previous thread pool change did not completely resolve the issue, but only delayed the thread pileup and hanging threads and Ex Libris manually removed the hanging thread to restore the system back to its normal service.
Following this event, Ex Libris implemented the following:
- Adding a safety system that identifies and removes hanging threads
During the third event on March 6th, while there was no complete service disruption (system down), the application still suffered performance degradation. The Ex Libris engineers determined that the safety system introduced created in cases false-positive alerts causing removal of long sessions that were not actually hanging, just taking longer.
Following this event, Ex Libris implemented the following:
- Adjust and remove all false-positive alerts and triggers by fine tuning the safety system to identify and remove only hanging sessions
Customer Communication
ExLibris is committed to providing customers with prompt and ongoing updates during Cloud events. Ongoing and prompt updates on service interruptions appear in the system status portal at this address: http://status.exlibrisgroup.com/
These updates are automatically sent as emails to registered customers.
Publication History
Date | Publication History |
---|---|
March 8, 2018 | Initial Publication |