Skip to main content
ExLibris
  • Subscribe by RSS
  • Ex Libris Knowledge Center

    CampusM AP01 - RCA – February 27, March 5, March 6, 2018

    Confidential Information, Disclaimer and Trade Marks

    Introduction

    This document serves as a Root Cause Analysis for the campusM service interruption experienced by Ex Libris customers on February 27th, March 5th & March 6th, 2018.

    The goal of this document is to share our findings regarding the event, specify the root cause analysis, outline actions to be taken to solve the downtime event, as well as preventive measures Ex Libris is taking to avoid similar cases in future.

    Event Timeline

    Service interruption was experienced by Ex Libris customers served by the campusM AP01 instance at the APAC Data Centre during the following hours:

    February 27th, 2018 from 7:31 AM until 9:29 AM Singapore time zone – Service Disruption
    March 5th, 2018 from 9:45 AM until 10:17 AM Singapore time zone – Service Disruption
    March 6th, 2018 from 8:10 AM until 11:10 AM Singapore time zone – Performance Issues

    Root Cause Analysis, Technical Action Items and Preventive Measures

    Ex Libris Engineers investigated this event to determine the root cause analysis with the following results:

    During the first event on February 27th, the Ex Libris engineers discovered the root cause for the service disruption was a combination of two issues:

    • Firewall issue
    • Pileup on the thread pool causing its exhaustion

    Both issues caused slowness in outgoing traffic that led to performance degradation and eventually complete system disruption.
    Following this event, Ex Libris implemented the following:

    • Disabling the problematic configuration in the firewall that initiated the bug
    • Removing the problematic configuration across all firewall devices
    • Triple the application thread pool

    The second event on March 5th indicated that the previous thread pool change did not completely resolve the issue, but only delayed the thread pileup and hanging threads and Ex Libris manually removed the hanging thread to restore the system back to its normal service.
    Following this event, Ex Libris implemented the following:

    • Adding a safety system that identifies and removes hanging threads

    During the third event on March 6th, while there was no complete service disruption (system down), the application still suffered performance degradation. The Ex Libris engineers determined that the safety system introduced created in cases false-positive alerts causing removal of long sessions that were not actually hanging, just taking longer.
    Following this event, Ex Libris implemented the following:

    • Adjust and remove all false-positive alerts and triggers by fine tuning the safety system to identify and remove only hanging sessions

     

    Customer Communication 

     

    ExLibris is committed to providing customers with prompt and ongoing updates during Cloud events. Ongoing and prompt updates on service interruptions appear in the system status portal at this address: http://status.exlibrisgroup.com/

    These updates are automatically sent as emails to registered customers.

    Publication History 

     

    Date Publication History
    March 8, 2018 Initial Publication