Summon - RCA - May 29, 2018
Introduction
This document serves as a Root Cause Analysis for the Summon service interruption experienced by Ex Libris customers on Mayt 29, 2018
The goal of this document is to share our findings regarding the event, specify the root cause analysis, outline actions to be taken to solve the downtime event, as well as preventive measures Ex Libris is taking to avoid similar cases in future.
Event Timeline
Service interruption was experienced by Ex Libris customers served by theSummon instance at the North America Data Center during the following hours:
May 29th, 2018 from 09:30 AM until 12:00 PM Chicago time.
During the event the linking option from link resolver didn’t work
Root Cause Analysis
Ex Libris Engineers investigated this event to determine the root cause analysis with the following results:
The May Summon release (May 10, 2018) included additional calls from the Summon Services API to the OBI Beacon server to log click events.
This new software increases the number of calls made from Summon to the Beacon to log usage activity.
Statistics and logging show that we encountered a period where the calls to the Beacon became slow (or unresponsive). The Summon software continued to make additional calls to the Beacon while previous calls had not yet completed. Eventually, the Summon servers ran out of available network sockets. While these calls to the Beacon were performed on background threads, running the system out of available sockets had the effect of impacting other required network functionality and eventually end user functionality. This resulted in slowness and errors when users tried to link to record full text.
Technical Action Items and Preventive Measures
Ex Libris has taken the following action and preventive measures to avoid such an occurrence in future:
- The software was rolled back to a previous version to alleviate the issues for our clients while we research and test.
- R&D are currently both enhancing the Summon software to handle this situation in a way that protects Summon end users as well as researching what is possible to prevent the slowness that was seen with the Summon/Beacon connection during this period, particularly focusing on the new Summon data center in DC01.
Customer Communication
ExLibris is committed to providing customers with prompt and ongoing updates during Cloud events. Ongoing and prompt updates on service interruptions appear in the system status portal at this address: http://status.exlibrisgroup.com/
These updates are automatically sent as emails to registered customers.
Publication History
Date | Publication History |
---|---|
May 29, 2018 | Initial Publication |