North America Data Center RCA March 5, 2018
Confidential Information, Disclaimer and Trade Marks
Introduction
This document serves as a Root Cause Analysis for the North America performance issues experienced by Ex Libris customers on March 5, 2018.
The goal of this document is to share our findings regarding the event, specify the root cause analysis, outline actions to be taken to solve the downtime event, as well as preventive measures Ex Libris is taking to avoid similar cases in future.
Event Timeline
Periodic service disruption were experienced by Ex Libris customers on the North America Data center during the following hours:
March 5th, 2018 from 12:25 PM until 12:43 PM US Central Time.
During the event, all services were unavailable for the environment.
Root Cause Analysis
Ex Libris Engineers investigated this event to determine the root cause analysis with the following results:
An incorrect network connection between the primary and redundant backbone switches, performed by mistake by our internet service provider (ISP), caused a network loop and service interruption for all inbound sessions. There was no data impact.
Technical Action Items and Preventive Measures
The Ex Libris ISP provider reported on the following corrective actions that were taken
- Data Center managers discussed with field technicians the importance of carefully reading tickets to identify wiring diagrams and follow them.
- The ISP provider’s written process was updated with a step to validate server ports before loading configurations on the backbone
Customer Communication
ExLibris is committed to providing customers with prompt and ongoing updates during Cloud events. Ongoing and prompt updates on service interruptions appear in the system status portal at this address: http://status.exlibrisgroup.com/