Job started by job daemon at midnight every Sunday morning brings system down
- Article Type: General
- Product: Aleph
- Product Version: 18.01
Every week at midnight on Sunday morning our system goes down. All of the servers, daemons, etc., on the server stop.
Looking at the $alephe_scratch/jobd log, we see a line like this in the log, with a blank line where the job name should be:
Current time: Sunday 30 March 2008 00:00:00
And we see the following entry in $alephe_scratch at this time:
0 ,1, 00:00:00
That is, this is a 0-length file with the name ",1," at 00:00:00.
A core dump is produced by a segmentation fault in connection with this event.
This problem was caused by a malformed entry in job_list. While a correctly-formatted entry might look like the following:
W3 12:22:00 Y VIR01 clear_vir01 VIR01
this entry looked like this:
The ",1," was being interpreted as the name and -- since no day/time was included -- the system was defaulting to the first second of the first minute of the first hour of the first day of the week.
Eliminating the preceding line from job_list, and doing util e/15/2 to kill the job daemon, followed by util e/15/1 to re-activate it, corrected the problem.
- Article last edited: 10/8/2013