File locked" error; job being started before preceding job has finished

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Article Type: General
Product: Aleph
Product Version: 20, 21, 22, 23

Description:
Certain p_cir_51 and p_cir_10 runs which should produce output do not. Some of the $alephe_scratch logs have this error at the end:

I/O error : file 'TP1'
error code: 9/065 (ANS74), pc=0, call=1, seg=0
65 File locked

And, looking at the timestamps of the files in $alephe_scratch, we see that, despite the fact that the job_list entries have "Y" in column 4, indicating that they should be queued, -- in certain cases a job is being started before the preceding job for the same library (ABC50) has completed.

For instance, we see this:
abc50_p_cir_10.09209.dllaw_circ
Sat Oct 7 00:01:10 2006
42890 END READING AT 00:01:23

abc50_p_cir_10.09210.dlsxt_circ
Sat Oct 7 00:01:14 2006

I/O error : file 'TP1'
error code: 9/065 (ANS74), pc=0, call=1, seg=0
65 File locked

abc50_p_cir_10.09211.dlwkk_circ
Sat Oct 7 00:01:17 2006

I/O error : file 'TP1'
error code: 9/065 (ANS74), pc=0, call=1, seg=0
65 File locked

We see that abc50_p_cir_10.09209.dllaw_circ doesn't complete until 00:01:23, but abc50_p_cir_10.09210.dlsxt_circ starts at 00:01:14 and abc50_p_cir_10.09211.dlwkk_circ at 00:01:17; that is, they start before abc50_p_cir_10.09209.dllaw_circ has completed.

Resolution:

There are two situations:

1. Where the que batch (lib_batch) process was started as root

It seems that this problem was caused by multiple lib_batch processes running due to aleph being started as "root" (rather than aleph).

We saw this in util c/1:
root 19228 1 0 05:10:29 ? 1:26 /exlibris/aleph/a16_1/aleph/exe/rts32 ue_11_a ABC50.a16_1
root 17433 1 0 05:04:51 ? 7:45 /exlibris/aleph/a16_1/aleph/exe/rts32 ue_06_a ABC50.a16_1
root 15807 1 0 05:03:27 ? 0:00 /exlibris/aleph/a16_1/aleph/exe/lib_batch ABC50

The following command shows all the processes on your system which are owned by root (because aleph was started as root) which should not have been:
ps -ef | grep root | grep aleph/exe

We suggest that you:

(1) run $alephe_root/aleph_shutdown;

(2) do the above command to verify that these root/aleph processes have been stopped;

(3) change the ownership of the que_batch_lock file in each library's $data_files (from root to aleph);

(4) run $alephe_root/aleph_startup .

2. Where the $aleph_files/que_batch_lock file was deleted and the system has let a second que_batch (lib_batch) be started. (When que_batch_lock is present, the system does *not* permit a second lib_batch to be started: you get the message: "lib_batch is already running".)

Check in util c/1 to see if there is more than one lib_batch process running for the same Aleph instance. (There should *not* be.)

Also check with "ps -ef |grep":

ps -ef | grep a22_1 | grep lib_batch | grep XXX50
<where "a22_1" is this specific (v22) aleph instance and "XXX50" is your adm library>

Such as this:

aleph@aleph-bib(a22_2) XXX50> ps -ef | grep a22_2 | grep lib_batch | grep XXX50
aleph 8497 1 0 2017 ? 00:12:33 /exlibris/aleph/a22_2/aleph/exe/lib_batch XXX50
aleph 16498 1 0 Aug11 ? 00:00:00 /exlibris/aleph/a22_2/aleph/exe/lib_batch XXX50

If so, kill both of them:

aleph@aleph-bib(a22_2) XXX50> kill -9 8497 16498

Then do util c/2 to restart the batch que (lib_batch) process.

Article last edited: 13-Aug-2018