CampusM EU01 - RCA - October 1st & 2nd, 2018

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Introduction

This document serves as a Root Cause Analysis for the European Data Center intermittent service interruption experienced by Ex Libris customers on October 1st and October 2nd, 2018.

The goal of this document is to share our findings regarding the event, specify the root cause analysis, outline actions to be taken to solve the downtime event, as well as preventive measures Ex Libris is taking to avoid similar cases in future.

Event Timeline

A service interruption was experienced by Ex Libris campusM customers served by the European Data Center during the following hours:

October 1^st, 2018 from 9:45 AM until 6:27 PM Amsterdam time

October 2^nd, 2018 from 9:45 AM until 2:15 PM Amsterdam time

During the event intermittent performance issues were encountered on the campusM platform. The intermittent service interruption on October 1^st will be calculated as a full system down in the Ex Libris availability reports.

The following timeline provides a detailed description of the events, including Ex Libris staff involved and ongoing measure to mitigate the issues:

October 1^st, 2018

The event was first observed on October 1st, at 08:00 am CET. A “war room” was created including engineers from our cloud and product teams as well as Ex Libris executive management members including the Corp VP. for Technology and Platform, the Chief Operating Office and VP of Cloud Engineering. It is important to note that the “war room” was active throughout the day and was also set up for the following morning at 7:45am CET as well as on October 3rd at 7:45am CET.
The immediate symptom was that apache threads that have started to increase on our cloud environment to the environment limit. This increase was causing partial services as part of the incoming requests hanged.
It became evident that the apache threads are slowing down as their access to the database was much slower than usual. However, it was noted that the load on the database was not significant, nor did any long running SQLs were occurring.
Our Network engineers identified an increase in the packet loss which shifted the focus on the Network as the source of the problem. Unfortunately, this turn out to be a mislead, only causing further delays for the issue’s root cause mitigation.
During the following three hours, a clear pattern of degradation was seen, close to each hour closing and it was assumed that it is due to increase product usage causing a scalability issue.
At 12:00pm CET the cloud team started to scale up all our resource including CPU, RAM and number of threads, to alleviate any environment resources scalability issues. This was done gradually to keep service running in parallel.
At 2:00pm CET the resources ramp-up process was completed, resulting in only a minimal effect on the service.
As a result of the activities taken so far, as well as the recurring pattern, we identified that it is not a standard scalability issue and included further engineers from R&D team to our “war room”.
At 3:00pm CET a team of engineers focused on the slowness of the threads and found it to be related to the time it takes to obtain a connection from the connect pool. This was in line with earlier evidence of database slowness from the application, while the database was not reaching any capacity of load at all.
At this point we decided to increase the connection pool definition dramatically, and since it was close to the of the day its impact could not be properly analyzed.
To ensure the event is still investigated we proactively established a “war room” meeting for October 2nd at 7:45am CET, prior to the start of day.
It was also decided that in case similar issues appear we will disable live tiles to alleviate the siltation as we analyze further.

October 2nd, 2018

At 7:45am CET our engineers identified that increase of the connection pool did not have the desired effect. To ensure product stability and to allow its use by students, it was decided to disable live tiles, which prevented further denial of service.
In parallel our engineers continued to analyze the connection pool behavior and identified the problem was not in the size of the pool but rather the “keep alive idle” threshold was too low. It was decided to change 2 nodes to analyze the impact of the change and see how they behave at 08:45am CET.
At 9:15am CET it was visible that the 2 nodes showed remarkable results and our engineers started to align the rest of the campusM cloud environment to those definitions.
At 10:30am CET the environment alignment was completed on all other nodes by resolving the issue.
At 12:00pm CET, following several hours of further analysis the live tiles were enabled again.

In addition, later in the afternoon there were two database failovers, 4 minutes each, which was a symptom of the connection pool changes. To fix this, our engineers had to do a further small adjustment to the settings.

Root Cause Analysis

Ex Libris Engineers investigated this event to determine the root cause analysis with the following results:

After a thorough investigation done by Ex Libris experts, the eventual root cause was found to be a configuration setting in the connection pool and the keep alive idle threshold setting.

Technical Action Items and Preventive Measures

Ex Libris has taken the following action and preventive measures to avoid such an occurrence in future:

Ex Libris engineers implemented the changes in settings across the campusM cloud environment
Ex Libris has commissioned an external export to continue to analyze the environment and explore means to further monitor and better configure to environment.

Customer Communication

ExLibris is committed to providing customers with prompt and ongoing updates during Cloud events. Ongoing and prompt updates on service interruptions appear in the system status portal at this address: http://status.exlibrisgroup.com/

These updates are automatically sent as emails to registered customers.

Publication History

Date	Publication History
October 9, 2018	Initial Publication