Batch que won't shut down; lib_batch_log at 2 Gig; "lib_batch already running"
- Article Type: General
- Product: Aleph
- Product Version: 20
Description:
Since a v20 instance was installed on the server with our existing v17 instance, we have been having the following (batch queue) problems:
(1) aleph_shutdown does not complete. The log shows that it is stopping in trying to stop the batch queue for the abc30 library.
(2) The abc30 UTIL-C-1 shows that some job is always running repeatedly;
(3) After killing this job and stopping the queue, when we try starting the abc30 batch queue (UTIL-C-2), there is no error message but it doesn't actually start either.
(4) I find that if I scroll up from the screen where I did the UTIL-C-2, that a message (which apparently flashed by very quickly) appears:
/exlibris/aleph/u18_1/abc30/files/lib_batch_log
Filesize limit exceeded
"ls -lrt" in the ./abc30/files/ directory shows that the lib_batch_log is at 2 Gig (2147483647 bytes).
(5) When I connected to v17 and tried to start the abc50 batch queue (UTIL-C-2), I got the message "lib_batch already running for ABC50" -- even though UTIL-C-1 did *not* show any lib_batch running.
>> ps -ef | grep lib_batch | grep ABC50
showed that the only lib_batch ABC50 process running was the v20 process.
Thinking it might be confused, I killed this v20 process.
When I then did UTIL-C-2 for the v17 abc50, the (v17) lib_batch process started.
Now when I do
>> ps -ef | grep lib_batch | grep ABC50
I see the v17 lib_batch process but (of course) no v20.
When I do UTIL-C-2 in v20 to try to start the lib_batch it says "lib_batch already running for ABC50" -- even though the only ABC50 lib_batch running is the v17.
Resolution:
The problem was that the v20 scratch and files directories have had symlinks to the v17 instance:
lrwxrwxrwx 1 aleph exlibris 27 Jun 15 13:52 scratch -> /wrkspc/u17_1/abc50/scratch/
lrwxrwxrwx 1 aleph exlibris 25 Jun 15 13:52 files -> /wrkspc/u17_1/abc50/files/
Thus, there was an unhealthy connection between these two instances.
For instance, when you would start the v20 lib_batch writes a que_batch_lock file to the /wrkspc/u17_1/abc50/files/ directory. The v17 UTIL-C-2 would then find this que_batch_lock and conclude that lib_batch was already running.
The solution was to remove these symlinks.
- Article last edited: 10/8/2013