Aleph Down yesterday

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Article Type: General
Product: Aleph
Product Version: 18.01

Description:
Yesterday at approx. 5:09pm aleph servers stopped. However apache stayed up. Therefore, it wasn't something with the OS or the System. I was able to bring all servers back up easily, using util/w. There is no explanation in the logs as to what could have caused this.

Resolution:
We see in the $LOGDIR directory that all of the servers were stopped at 17:09:

www_server_4500.log.1306.1709:

2012-06-13 17:09:58 06 [000] [log] Wrote 213 bytes
Server received stop command
Server killing child pid: 1174
Server killing child pid: 446
...

pc_server_6505.log.1306.1709:

2012-06-13 17:09:58 06 [000] [log] Wrote 213 bytes
Server received stop command
Server killing child pid: 25393
Server killing child pid: 3721
...

Etc.

The stoppage finished with the last server around 17:10:01. The libraries' batch queues and daemons were *not* stopped. (Except the job daemon.)

Thus, it's clear that this server-stoppage was not the result of a run of aleph_shutdown or util w/5 (which can toggle SYSTEM down), but something else.

The logs show that the servers were immediately restarted (starting at 17:02). For example, pc_server_6505.log.1306.1730:

Previous server killed (pid 20927)
2012-06-13 17:10:02 48 [000] [log] Message length 133
... .

util w/2/1 (Stop all servers) and the command "server_monitor -ks" both stop all the servers, but there is no parallel command to *start* all the servers.

It may be that the restarts of the servers which we see were simply your restarting each server individually, manually. But aleph_startup not preceded by aleph_shutdown could also give this result: the servers and job daemon would be restarted but the other queues and daemons which were already running would not be restarted.

Article last edited: 10/8/2013