Z39_server bottleneck between Aleph and OCLC
- Product: Aleph
- Product Version: 20, 21, 22, 23
- Relevant for Installation Type: Dedicated-Direct, Direct, Local, Total Care
Description
We are seeing queries submitted in WCL (World Cat Local) (coming in directly to port 9991) timing out with no OPAC data received. When that happens, the z39.50 log doesn't even see the query - not in the log at all. So it seems to be an incoming problem. It is fine when there is not a lot of activity (i.e. Saturday and Sunday), but started up as a problem again today - Monday.
Our IT guys says: I can see a few blocked requests on the firewall:
"Sep 8 15:16:31 libprod1 ipmon[454]: [ID 702911 local0.warning] 15:16:31.327347 bge219000 @11016:2 b 132.174.100.234,58625 -> 132.216.30.61,9991 PR tcp len 20 40 -AR IN
"Sep 10 00:20:08 libprod1 ipmon[454]: [ID 702911 local0.warning] 00:20:08.666420 bge219000 @11016:2 b 132.174.100.234,20196 -> 132.216.30.61,9991 PR tcp len 20 40 -AR IN
"Sep 10 00:23:08 libprod1 ipmon[454]: [ID 702911 local0.warning] 00:23:08.413673 bge219000 @11016:2 b 132.174.100.234,20394 -> 132.216.30.61,9991 PR tcp len 20 40 -AR IN
"Sep 10 00:24:13 libprod1 ipmon[454]: [ID 702911 local0.warning] 00:24:13.192987 bge219000 @11016:2 b 132.174.100.234,44390 -> 132.216.30.61,9991 PR tcp len 20 40 -AR IN "
Resolution
Our Unix admin upped this:
zlogin blink ndd -get /dev/tcp tcp_conn_req_max_q : 128 to 4K !
We think that this OS parameter was the cause of the congestion.
Additional Information
We got OCLC to up the Max sockets to 100. But it's unclear that this helped in resolving this problem.
The z39_server logs showed thousands of processes being started and killed ("Server killing child pid: nnnnn"). But these are still present even though the performance is now much better and CPU usage is much lower, so it doesn't seem that, by themselves, they are an indication of a problem.
- Article last edited: 1-Sep-2017

