Indexing jobs should stop when critical errors occur
- Article Type: General
- Product: Aleph
- Product Version: 16.02
Description:
The sys admin enhancements group had put in a request about stopping indexing jobs at the point of critical errors.
Resolution:
It has been decided that doing the same for the ue_01 process would need to be an enhancement request.
rep_ver 004942
Description: Job monitoring and failure detection in batches with parallel processing
General: Job failure detection mechanism has been added. Whenever a failure is detected, the entry in the cycles table currently processed is converted into "!" (all unhandled ("-") entries become "*"), the job is suspended (handling of all cycles table entries currently processed is gradually terminated, as a result of detection of a "!" entry), and the relevant structured error message is written into the file <cycles_table_name>.err (e.g. p_manage_05.err) in the data scratch directory. The batch processes into which the failure detection mechanism has been incorporated are the following:
p_manage_01
p_manage_02
p_manage_05
p_manage_07
p_manage_12
p_manage_16
p_manage_17
p_manage_27
p_manage_32
p_manage_35
p_manage_102
p_manage_103
The original error messages can be viewed in the log files. Here is an example of such an error file:
FAILURE Wed Jan 29 12:03:29 IST 2003 ================== step 1 Cycle: 4 Sort Failure -rw-rw-r-- 1 aleph exlibris 1163499 Jan 29 12:03 /exlibris/aleph/a16_1/usm01/scratch/manage_05_1_4 Job Suspended !!! To identify this error in your log files use the command: grep "FAILURE Wed Jan 29 12:03:29 IST 2003" $data_scratch/*.log
If a library was locked at the beginning of the job, it remains locked in case of job suspension. Main checks are on the following issues:
- Failure in record retrieval or writing to an output file by an ALEPH program.
- File sorting failure for various reasons (e.g. disk space problem)
- Failure in Oracle load, checks:
- - The number of records successfully loaded to the database in relation to the number of records in the sequential file:
- - - Equal - No errors
- - - Less than 99% - Error and job suspension
- - - Greater than 99% - Warning message without Job suspension
- - No fatal Oracle errors were detected during the load
- - All table indexes were updated and are stable after the load
- - All table indexes were updated and are stable after the load
Keeping files for later analysis: It is possible to keep the files generated by each step (such as retrieval files or sequential files for loading), regardless of whether there has been a failure or not, by defining the following environment variable before running the job:
setenv keep_tmp_files Y
These files may be later used for detailed analysis of the job, especially in cases of failure.
Additional Information
indexing, jobs, critical errors
- Article last edited: 10/8/2013