- Product: Aleph
- Product Version: 20, 21, 22, 23
- Relevant for Installation Type: Dedicated-Direct, Direct, Local, Total Care
The system is extremely slow. "top" shows 2-5% Cpu %us but %wa is in the 10-25% range. (It's normally in the 1-3% range.) An "rpciod/16" process is consistently the highest CPU process (at 27%). rpciod is related to NFS (Network File System).
A long-running SQL query triggered the problem. (See Additional Information below.) Restarting Aleph and Oracle corrected the problem.
(Note: After the restart, the rpciod/16 process continued to exist and run, but its %CPU fell from 27% to 1%.)
An SQL query triggered the problem. Even when it's just a SELECT, Oracle will still start putting data from the tables into the UNDO tablespace when using a nested query to make sure the data is stable. The query was trying to get data from the Z00P table. This is the largest table in the database and is continually updated by ue_21. Thus, the reason for the larger iowait percentages was the fact that Oracle was keeping track of these changes to the Z00P table in the UNDO tablespace. The documentation also states that killing the SQL doesn't really help things out when the system is in trouble like it was. Oracle needs to put everything back to a stable point in time. The documentation states it takes Oracle longer to rollback the UNDO tablespace then it took to get where it was. The same query on test ran for approximately 11 hours before the VPN connection was lost and the SQL was killed. It took the system 17 hours to get back to normal. During this query the iowait was sitting around 5%. This is on Test where not much is happening. So, while its easy to say don't run that query, this still points to an issue with IO on the system. It seems there is not good IO throughput on this system to the disks that contain the .dbf files for the Oracle database.
- Article last edited: 28-Feb-2016