File locked" error; job being started before preceding job has finished
- Article Type: General
- Product: Aleph
- Product Version: 20, 21, 22, 23
Description:
Certain p_cir_51 and p_cir_10 runs which should produce output do not. Some of the $alephe_scratch logs have this error at the end:
I/O error : file 'TP1'
error code: 9/065 (ANS74), pc=0, call=1, seg=0
65 File locked
And, looking at the timestamps of the files in $alephe_scratch, we see that, despite the fact that the job_list entries have "Y" in column 4, indicating that they should be queued, -- in certain cases a job is being started before the preceding job for the same library (ABC50) has completed.
For instance, we see this:
abc50_p_cir_10.09209.dllaw_circ
Sat Oct 7 00:01:10 2006
42890 END READING AT 00:01:23
abc50_p_cir_10.09210.dlsxt_circ
Sat Oct 7 00:01:14 2006
I/O error : file 'TP1'
error code: 9/065 (ANS74), pc=0, call=1, seg=0
65 File locked
abc50_p_cir_10.09211.dlwkk_circ
Sat Oct 7 00:01:17 2006
I/O error : file 'TP1'
error code: 9/065 (ANS74), pc=0, call=1, seg=0
65 File locked
We see that abc50_p_cir_10.09209.dllaw_circ doesn't complete until 00:01:23, but abc50_p_cir_10.09210.dlsxt_circ starts at 00:01:14 and abc50_p_cir_10.09211.dlwkk_circ at 00:01:17; that is, they start before abc50_p_cir_10.09209.dllaw_circ has completed.
Resolution:
There are two situations:
1. Where the que batch (lib_batch) process was started as root
It seems that this problem was caused by multiple lib_batch processes running due to aleph being started as "root" (rather than aleph).
We saw this in util c/1:
root 19228 1 0 05:10:29 ? 1:26 /exlibris/aleph/a16_1/aleph/exe/rts32 ue_11_a ABC50.a16_1
root 17433 1 0 05:04:51 ? 7:45 /exlibris/aleph/a16_1/aleph/exe/rts32 ue_06_a ABC50.a16_1
root 15807 1 0 05:03:27 ? 0:00 /exlibris/aleph/a16_1/aleph/exe/lib_batch ABC50
The following command shows all the processes on your system which are owned by root (because aleph was started as root) which should not have been:
ps -ef | grep root | grep aleph/exe
We suggest that you:
(1) run $alephe_root/aleph_shutdown;
(2) do the above command to verify that these root/aleph processes have been stopped;
(3) change the ownership of the que_batch_lock file in each library's $data_files (from root to aleph);
(4) run $alephe_root/aleph_startup .
2. Where the $aleph_files/que_batch_lock file was deleted and the system has let a second que_batch (lib_batch) be started. (When que_batch_lock is present, the system does *not* permit a second lib_batch to be started: you get the message: "lib_batch is already running".)
Check in util c/1 to see if there is more than one lib_batch process running for the same Aleph instance. (There should *not* be.)
Also check with "ps -ef |grep":
ps -ef | grep a22_1 | grep lib_batch | grep XXX50
<where "a22_1" is this specific (v22) aleph instance and "XXX50" is your adm library>
Such as this:
aleph@aleph-bib(a22_2) XXX50> ps -ef | grep a22_2 | grep lib_batch | grep XXX50
aleph 8497 1 0 2017 ? 00:12:33 /exlibris/aleph/a22_2/aleph/exe/lib_batch XXX50
aleph 16498 1 0 Aug11 ? 00:00:00 /exlibris/aleph/a22_2/aleph/exe/lib_batch XXX50
If so, kill both of them:
aleph@aleph-bib(a22_2) XXX50> kill -9 8497 16498
Then do util c/2 to restart the batch que (lib_batch) process.
- Article last edited: 13-Aug-2018