/exlibris at 98% used: z52.comp.seq / clear_vir01
- Product: Aleph
- Product Version: 20, 21, 22, 23
- Relevant for Installation Type: Multi-Tenant Direct, Dedicated-Direct, Local, TotalCare
Description:
The /exlibris filesystem is at 98% used -- and growing. What is causing this? How can it be stopped?
Resolution:
[Note: For more common causes of the /exlibris filesystem filling up see the article " Our /exlibris file system keeps filling up. How can we tell what's causing this? "]
Using the "find . -size +1000000 -print" command, described in the article "/exlibris filesystem 100% full; locating files to delete **MASTER RECORD**", the 193G z52.comp.seq file was located:
-rw-r--r-- 1 aleph exlibris 193G Dec 22 01:29 /exlibris/aleph/a23_1/vir01/files/z52.comp.seq
The following process was writing to this file:
# fuser /exlibris/aleph/a23_1/vir01/files/z52.comp.seq
/exlibris/aleph/a23_1/vir01/files/z52.comp.seq: 3827
# ps -p 3827 -f -ww
UID PID PPID C STIME TTY TIME CMD
aleph 3827 3654 99 Dec21 ? 20:29:22 /exlibris/aleph/a23_1/aleph/exe/compress_seq_file /exlibris/aleph/a23_1/vir01/scratch/z52.length /exlibris/aleph/a23_1/vir01/files/z52.seq /exlibris/aleph/a23_1/vir01/files/z52.comp.seq
This process had the following parent process:
aleph 3654 3643 0 Dec21 ? 00:00:00 /bin/csh -f /exlibris/aleph/a23_1/aleph/proc/p_file_11 VIR01,T,z52,/exlibris/aleph/a23_1/vir01/files/z52.seq,/exlibris/aleph/a23_1/vir01/files/z52.comp.seq,VIR01
/exlibris/aleph/a23_1/vir01/files/z52.seq /exlibris/aleph/a23_1/vir01/files/z52.comp.seq
The clear_vir01 proc calls p_file_04 to load the vir01 z52 table. p_file_04, in turn, calls p_file_04_1, which calls aleph_load, which calls p_file_11.
The ./z52.comp.seq file is normally tiny but it seems the program got stuck and was writing repeatedly to this file.
clear_vir01, released by the job daemon at 5:00 AM each day, normally runs for 1 minute, but we see that the vir01_clear_vir01.00401 run, started at 5:00 on Dec 21, continued running (and writing to the vir01 z52) until 01:36 on Dec 22.
The command "ps -ef |grep vir01" was used to locate the various processes generated by clear_vir01. Then "kill -9 xxxx yyyy zzzz etc" was done to kill all of these processes.
Another "ps -ef |grep vir01" showed that all the processes had been killed.
Note: killing these processes also caused the super-big ./vir01/files/z52.comp.seq to be removed.
The log shows this:
...
SQL*Plus: Release 12.1.0.2.0 Production on Thu Dec 21 05:00:14 2017
...
vir01@ALEPH23> > > > > 2 0
0
NO COLUMNS FOUND FOR TABLE
No errors.
1* begin f1_Z52; end;
2 20
20
9
1
30
1
No errors.
...
in_fl: /exlibris/aleph/a23_1/vir01/files/z52.seq out_fl: /exlibris/aleph/a23_1/vir01/files/z52.comp.seq
Terminated compress_seq_file /exlibris/aleph/a23_1/vir01/scratch/z52.length /exlibris/aleph/a23_1/vir01/files/z52.seq /exlibris/aleph/a23_1/vir01/files/z52.comp.seq
SQL*Plus: Release 12.1.0.2.0 Production on Fri Dec 22 01:36:14 2017
<end log excerpt>
The "Terminated ..." line indicates the point at which the compress_seq_file (3827) process was killed.
It's unclear just what caused this compress_seq_file process to be repeatedly writing to the ./z52.comp.seq file, but the clear_vir01 released 3.5 hours later (at 5:00) ran normally/successfully.
Additional Information for PACSCL
You need to log-in as "exlibris" (with the standard password) and then do "sudo su - aleph". (With any PACSCL log-in you will get "Permission denied in trying to do the "kill".) Note that PACSCL has two Aleph instances on their server: the a23_1 (Prod) and the a23_2 (Test). This problem can occur with either, that is, the super-big file can be either ./a23_1/vir01/files/z52.comp.seq or ./a23_2/vir01/files/z52.comp.seq . But the commands you enter to solve the problem are the same regardless of which instance it is.
- Article last edited: 22-Dec-2017