Skip to main content
ExLibris
  • Subscribe by RSS
  • Ex Libris Knowledge Center

    Indexing jobs should stop when critical errors occur

    • Article Type: General
    • Product: Aleph
    • Product Version: 16.02

    Description:
    The sys admin enhancements group had put in a request about stopping indexing jobs at the point of critical errors.

    Resolution:
    It has been decided that doing the same for the ue_01 process would need to be an enhancement request.

    rep_ver 004942
    Description: Job monitoring and failure detection in batches with parallel processing
    General: Job failure detection mechanism has been added. Whenever a failure is detected, the entry in the cycles table currently processed is converted into "!" (all unhandled ("-") entries become "*"), the job is suspended (handling of all cycles table entries currently processed is gradually terminated, as a result of detection of a "!" entry), and the relevant structured error message is written into the file <cycles_table_name>.err (e.g. p_manage_05.err) in the data scratch directory. The batch processes into which the failure detection mechanism has been incorporated are the following:
    p_manage_01
    p_manage_02
    p_manage_05
    p_manage_07
    p_manage_12
    p_manage_16
    p_manage_17
    p_manage_27
    p_manage_32
    p_manage_35
    p_manage_102
    p_manage_103

    The original error messages can be viewed in the log files. Here is an example of such an error file:

    FAILURE Wed Jan 29 12:03:29 IST 2003 ================== step 1 Cycle: 4 Sort Failure -rw-rw-r-- 1 aleph exlibris 1163499 Jan 29 12:03 /exlibris/aleph/a16_1/usm01/scratch/manage_05_1_4 Job Suspended !!! To identify this error in your log files use the command: grep "FAILURE Wed Jan 29 12:03:29 IST 2003" $data_scratch/*.log

    If a library was locked at the beginning of the job, it remains locked in case of job suspension. Main checks are on the following issues:

    - Failure in record retrieval or writing to an output file by an ALEPH program.
    - File sorting failure for various reasons (e.g. disk space problem)
    - Failure in Oracle load, checks:
    - - The number of records successfully loaded to the database in relation to the number of records in the sequential file:
    - - - Equal - No errors
    - - - Less than 99% - Error and job suspension
    - - - Greater than 99% - Warning message without Job suspension
    - - No fatal Oracle errors were detected during the load
    - - All table indexes were updated and are stable after the load
    - - All table indexes were updated and are stable after the load

    Keeping files for later analysis: It is possible to keep the files generated by each step (such as retrieval files or sequential files for loading), regardless of whether there has been a failure or not, by defining the following environment variable before running the job:

    setenv keep_tmp_files Y

    These files may be later used for detailed analysis of the job, especially in cases of failure.

    Additional Information

    indexing, jobs, critical errors


    • Article last edited: 10/8/2013