CN01 - RCA - March 9 - April 20, 2025
Introduction
This document serves as a Root Cause Analysis for the service interruption experienced by Ex Libris customers on Alma CN01.
The goal of this document is to share our findings regarding the event, specify the root cause analysis, outline actions to be taken to solve the downtime event, as well as preventive measures Ex Libris is taking to avoid similar cases in future
Effected Products
HEP CN01
Event Timeline
Service interruption was experienced by Ex Libris customers served by the Alma CN01 instance at the Chinese Data Center during the following hours:
March 9, 2025 until April 21, 01:30 AM Beijing time.
During the event, Services on the Data center experienced intermittent Performance issues.
Root Cause Analysis
Ex Libris Engineers investigated this event to determine the root cause analysis with the following results:
On March 9, 2025, one of the services of the SFX product hosted in DC81 became unavailable to users in China. Performance degradation was observed across multiple applications within the data center, including Alma. Investigation revealed that the SFX service was generating an unusually high volume of outbound traffic.
To prevent further disruption to other critical applications including HEP CN01, the affected SFX service was intentionally shut down on April 09, 2025.
The incident was caused by customizations made to the SFX service, which led to excessive network usage and overloaded the DC81 network. This may have been due to unintended use or misuse of the customized features.
We implemented a change to optimize the server responses, significantly reducing the amount of data sent over the network. This improvement has been validated and is permanent, ensuring better performance without any operational impact.
Technical Action Items and Preventive Measures
Ex Libris has taken the following action and preventive measures to avoid such an occurrence in future:
-
Implement stricter code review and validation processes to ensure that any customizations do not negatively impact network performance.
-
Bandwidth Monitoring: Enhance bandwidth monitoring to detect and address unusual traffic patterns promptly.
-
Access Controls: Limit third-party access to the SFX knowledgebase to prevent unauthorized usage.
-
The local China office will provide customers with guidelines and best practices for customizing the SFX service to avoid unintended consequences.
Customer Communication
ExLibris is committed to providing customers with prompt and ongoing updates during Cloud events. Ongoing and prompt updates on service interruptions appear in the system status portal at this address: http://status.exlibrisgroup.com/
These updates are automatically sent as emails to registered customers

