%Interrupt time is the time that the processor spends servicing interrupts from the hardware aka the time spend talking with the hardware installed in the system. This should be usually about 0-1% on a normal box and about 3-5% on a fairly busy box.
In our case we were seeing about 20% and more.
Implications of high %Interrupt Time
If %Interrupt time is high it usually implies that some piece of hardware is really busy. More often than not, this is probably caused by a faulty hardware and in very rare cases by software that is putting a lot of load on the hardware, like a faulty device driver.
When the problem is caused due to faulty hardware, it (the piece of hardware) stops working properly and hence the OS needs to talk a lot to the concerned process. In fact all of the pages in the web would point to some form of faulty hardware / driver (usually newly installed) that needs to be identified and removed for the system to start behaving properly when this behavior is seen.
Most other times disks are the cause, they are prone to a lot of hardware errors due to their moving parts or in-correct connections. When this happens, system starts using an in efficient form of communication instead of Untra DMA or DMA. More details can be found here).
Our problem specifics
But the issue i was trying to debug was a system which was behaving properly when the offending process (Cisco CUOM) was stopped. No new hardware had been installed so driver issues were ruled out. I did some file copies and browsing to rule out hardware issues. The %Interrupt time did not peak in these tests so we had to conclude there were no issues with disk or network equipment.
The issue peaked only when our software was running. Was our software stressing the hardware in some unforeseen manner or was the software plain inefficient ?
In either case, we had to look into what was executing inside the Kernel to decide where the OS was spending its time. There are a couple of ways to do this (which i will detailing in my next post) which should tell us the call stack (ie sequence of function calls that traces it’s path through the modules that make it) and thence to the code where the system spends most of its time.
Verify software configuration
Trying to look into the kernel innards is fine if you have the stomach for it. But prudence calls for verifying the software constitution of the system to get some primary clues before trying to do anything as involved as kernel debugging. My first task therefore was to check what was installed and running on the system. A simple glance at the task manager showed me a process named ~9.exe and svchosty.exe (y has two dots over it).
The immediate thing to look for when you see junk process names is to see if anti-virus software is present in the system. No surprises there – none was present. Downloading and installing Norton caused the Norton scan window to close / vanish as soon as it was opened. This made me suspect the worse and true enough a brief search on the internet confirmed that behavior was caused by a class of virus called w32/Deborm.worm.gen (in mcafee world) / W32.HLLW.Nebiwo (in norton av world). It spreads using file shares and it is one of the most common viruses out there. Any un-protected systems in big networks soon get infected with this worm.
Anti – Anti Virus
This virus will soon cause other vulnerabilities to be executed on the machine including backdoors and, -check this out- ANTI – Anti Virus products. :). This was why the norton windows kept closing. The only cure then is to clean out the system manually by hand killing all the processes that start with ~, “Explorer .exe”, “Winlogon .exe ” and performing all the steps as recommended in this advisory.
- The Anti AntiVirus trojan was Trojan.KillAV
- One of the backdoor trojans was Backdoor.SdBot
- One of the backdoor trojans was Backdoor.Litmus
Folks assume that by rooting out the virus that caused the initial infection, the machine becomes ready for use. However, the trojan’s advisory on norton web site reads – “If the Trojan was run and a hacker executed files on the computer, it may be difficult to determine exactly what was done, even after the Trojan was removed”
We are now in the process of changing the enterprise wide credentials of everyone who had used the system and also contemplating a possible re-installation of the entire system Now you know how serious a virus infection can be 🙂 and how important an enterprise wide anti virus solution is to your organization.
Happy debugging !!!!