Aleph Down yesterday
- Article Type: General
- Product: Aleph
- Product Version: 18.01
Description:
Yesterday at approx. 5:09pm aleph servers stopped. However apache stayed up. Therefore, it wasn't something with the OS or the System. I was able to bring all servers back up easily, using util/w. There is no explanation in the logs as to what could have caused this.
Resolution:
We see in the $LOGDIR directory that all of the servers were stopped at 17:09:
www_server_4500.log.1306.1709:
2012-06-13 17:09:58 06 [000] [log] Wrote 213 bytes
Server received stop command
Server killing child pid: 1174
Server killing child pid: 446
...
pc_server_6505.log.1306.1709:
2012-06-13 17:09:58 06 [000] [log] Wrote 213 bytes
Server received stop command
Server killing child pid: 25393
Server killing child pid: 3721
...
Etc.
The stoppage finished with the last server around 17:10:01. The libraries' batch queues and daemons were *not* stopped. (Except the job daemon.)
Thus, it's clear that this server-stoppage was not the result of a run of aleph_shutdown or util w/5 (which can toggle SYSTEM down), but something else.
The logs show that the servers were immediately restarted (starting at 17:02). For example, pc_server_6505.log.1306.1730:
Previous server killed (pid 20927)
2012-06-13 17:10:02 48 [000] [log] Message length 133
... .
util w/2/1 (Stop all servers) and the command "server_monitor -ks" both stop all the servers, but there is no parallel command to *start* all the servers.
It may be that the restarts of the servers which we see were simply your restarting each server individually, manually. But aleph_startup not preceded by aleph_shutdown could also give this result: the servers and job daemon would be restarted but the other queues and daemons which were already running would not be restarted.
- Article last edited: 10/8/2013