- Article Type: General
- Product: Aleph
- Product Version: 19
We ran p_manage_01 to reindex words in our bib library via a parallel process and, while the process appears to have completed successfully, an analysis of the logs shows that a problem was encountered during the p_manage_01_e step. After the error was logged, the process continued with indexing the z97 and z98 tables. However, there were no records in the z98 table and a p_manage_01_e.err file was created, containing this message:
FAILURE Sun Nov 9 13:29:30 EST 2008 ==================
b_manage_01_4: /exlibris/aleph/u19_1/pab01/scratch/manage_01_6 build failure
Job Suspended !!!
Given that all previous steps appeared to have completed successfully, I restarted the p_manage_01_e process today according to the documented instructions to do so, and the process failed with the same error.
Can you tell me what would cause this process to not complete? I'm not sure if space for log files is a concern; we have 32+ GB available in the $data_scratch directory for the parallel library in case that is relevant.
The "b_manage_01_4: .../manage_01_6 build failure" message is issued by the $aleph_proc/p_manage_01_e proc when the return code from the b_manage_01_4.c routine which it has called is not zero.
The log which you show has this message:
start - handling manage_01_4.49
"49" is the $dd_WORD4.
I understand that you did at least two p_manage_01_e runs. Do both of them have the "start - handling manage_01_4.49" preceding the failure ... or is it a different number than "49"?
I am trying to see whether the problem is occurring with exactly the same $data_scratch/manage_01_4.nn records -- or different ones.
Also, what does the p_manage_01_e.cycles file show?
Note: KB 16384-8753 is also a "build failure", but in a different step of p_manage_01_e -- and, I believe, with a different cause.
[From site:] A second restart of p_manage_01_e worked. We do understand that the process stopped due to an error being returned by a sub-routine, but what I was trying to get at, which is still unknown, is what was causing the return code from the b_manage_01_4.c routine to not be 0. None of the logs gave an indication of the root cause of the problem, as far as I could tell. At this point, it is not worth pursuing, nor is it possible to determine the answer this long after the fact, but I will let you know if we encounter the issue again in production.
See Article 000037057 for a specific cause of this error message which has been observed. Link: manage-01 fails with "manage_01_6 build failure" .
- Article last edited: 5/1/2014