Skip to main content
ExLibris
  • Subscribe by RSS
  • Ex Libris Knowledge Center

    COO Blog

     

     

    Yair Amsterdam

    SVP and GM

    Customer Experience and Operations

    ProQuest


    Published on: 23 April 2018

     

    Dear Colleagues,

    As SVP and GM responsible for ProQuest Customer Experience and Operations, I would like to apologize for the service interruptions in the Primo APAC01 and Alma AP01 environments that occurred between March 26th and April 16th 2018. The three service interruption events detailed below are unrelated and occurred due to an unfortunate chain of events.

    It is our mission to provide excellent uptime, stability, and communication. Regrettably, a sequence of highly unusual usage occurrences in these environments resulted in these service interruptions.     

    All of our teams worked throughout these service interruptions as part of our 24x7 operations until they were able to restore full service.

    We have launched the Root Cause Analysis and Lesson Learned internal processes and have posted the summary documents on the Ex Libris Knowledge Center. The documents can be found at: Primo > Product_Materials > RCA_Reports > APAC

    While we are proud of our overall cloud uptime and stability, I assure you that we will do everything we can to learn from these events and continue to improve our services and communication.  

    I have included a high level summary of what happened and additional technical details below to answer any questions that you may have.

    If you have additional questions, please do not hesitate to contact me at Yair.Amsterdam@proquest.com.

    Sincerely,

    Yair Amsterdam,

    Senior VP and GM,

    Customer Experience and Operations.

    ProQuest

    Summary

    For Primo APAC01 – Disruption dates of March 26th, April 3rd,and April 6th :

    An investigation by the Ex Libris engineering team determined that the issue was caused by a combination of high peak usage and specific search queries that consume high memory usage. As a result of the high memory consumption, search threads were stuck and were not handled properly, which generated a general unavailability of the entire system.

    This issue was the root cause of the following events: March 26th, April 3rd, and April 6th

    Due to the complexity of the combination of the event triggers, it took significant time for Ex Libris engineering to pinpoint the exact root cause, although additional monitoring and debugging tools were added after the first event.  

    We took a number of steps to prevent this type of failure going forward:

    • The memory allocation of the physical machine and the search engine Java Virtual Machine (JVM) were adjusted to handle similarly unusual peak usage patterns.
    • We adjusted our proactive monitoring of resource usage to take into consideration the high usage, in order to alert and prevent the recurrence of such an outage.

     

    Why did it take so long to restore service? 

    The failure impacted entire search engines and required an intervention to restart all of them and bring the service back online.

    What are you doing to improve customer communications?

    As part of our learning organization methodology, we are looking at improving both our communication and our transparency.  Here are the conclusions from our internal lesson learned review and customer feedback:

      1. The Ex Libris Cloud Services team needs to post more frequent updates in our system status page to reflect the high effort invested in restoring services.
      2. Ex Libris needs to foster clearer customer expectations regarding our communication frequency and methods. To this end, we will share our "Service Disruption Communication Policy" with customers in the coming quarter. The policy will be published on the Customer Knowledge Center.

    Primo APAC01 – Disruptions dates of April 9th and April 16th :

    An investigation by the Ex Libris engineering team determined that the issue was caused by a high load on the Oracle database, which led to a sever service interruption.

    The root cause for the high load on the Oracle database was the result of a frequent and un-optimized application SQL query and the increased size of the Oracle table that led to a change in the query execution plan.

    After the first event, we enabled advanced Oracle monitoring which highlighted the problematic query.

    Ex Libris DBA added an index to the table which prevented a full table scan and reduced the load on the Oracle database.

    This issue was the root cause of the following events: April 9th and April 16th

    We took a number of steps to prevent this type of failure going forward:

    • Performed an audit to the Oracle databases in order to make sure they are optimized and indexed.
    • The relevant index was added as part of the out-of-the box Primo configuration
    • Improved the cleaning process mechanism of the relevant Oracle tables

     

    Alma AP01 – Disruption date of April 16th

    A severe load on the database server was caused by a defect in a frequently used SQL query. This resulted in a general unavailability of the entire system.

    Once the problematic SQL query was identified, database statistics were re-created to restore normal system behavior.

    An application fix that addresses this defect from the root has been applied.  


     

     

    • Was this article helpful?