Last week I was working on a system would stop responding to the user after many hours of running. All we knew was that the CPU was still running software, and we only knew that because RAM was still being accessed. The other thing we knew was that the customer was not happy and we needed to learn more and fast.
To make it more interesting, the problem was occurring on systems that we could not use traditional tools to work on. These systems were not sitting in a cozy office, heck we didn’t even have electricity available to plug in workstations or equipment.
So I had two short term goals; prove that the CPU was still running software and identify the thread that was using all of the CPU cycles.
To solve the first goal, I wrote a simple thread to toggle an LED once a second. I set the thread priority to 10, after first proving that there were not any threads running even close to that priority outside of the kernel. This thread looked like this:
static DWORD WINAPI HeartBeatThread(LPVOID p)
{
BOOL Toggle = 0;
CeSetThreadPriority( GetCurrentThread(), 10 );
Sleep( 5000 );
while( 1 )
{
Sleep( 1000 );
if( Toggle )
{
Toggle = 0;
TurnOnLED();
}
else
{
Toggle = 1;
TurnOffLED();
}
}
return 0;
}
TurnOnLED() and TurnOffLED() are just for example purposes, I will leave that to you and your device.
Then I set my sights on goal number 2. Determine which thread is out of control and using the CPU cycles. For this, I created another thread that gets the current tick count and writes it to a global variable. This thread runs at the default thread priority. I set the thread to get the current tick count every second. It looks like this:
DWORD LowPriorityTicks = 0;
static DWORD WINAPI LowPriorityThread(LPVOID p)
{
while( 1 )
{
Sleep( 1000 );
LowPriorityTicks = GetTickCount();
}
return 0;
}
Now all I need to do is monitor for this low priority thread to stop running. I expected that it would stop running when the out of control thread started using all of the CPU cycles. To monitor for the problem, I added the following code to the heart beat thread:
DWORD Count = 1;
BOOL ShowProc = TRUE;
DWORD Priority;
if(LowPriorityTicks && ( GetTickCount() - LowPriorityTicks > 60000 ) )
{
if( ShowProc )
{
ShowRunningProcesses(TH32CS_SNAPTHREAD);
ShowProc = FALSE;
}
Priority = CeGetThreadPriority( hLowPriorityThread );
RETAILMSG( 1, (TEXT("LowPriorityThread Priority %d setting %d\n"), Priority, Priority - 1 ));
CeSetThreadPriority( hLowPriorityThread, Priority - 1 );
}
else
ShowProc = TRUE;
This code checks to see if the low priority thread has updated LowPriorityTicks in the last 60 seconds. If LowPriorityTicks has not been updated, then assume that there is a system problem and do the following:
· Output the process and thread information. For regular readers of this blog, the function ShowRunninProcesses() might look familiar. I provided this function in
Windows CE: Using ToolHelpAPI to get more information about running processes and I used it as is without any changes for this test. The variable ShowProc is used to control the output of the process and thread information. I only want to output that information once when the problem occurs, and then only output it again if the problem clears and then reoccurs.
· Raise the thread priority of the low priority thread by one each time the heart beat thread runs until the low priority thread runs again and updates the LowPriorityTicks.
With this, once the low priority thread starts running again, it must be at the same priority as the out of control thread.
To test that this code works, I then started a thread that goes out of control after a minute and a half. This thread runs at priority 75.
static DWORD WINAPI BusyThread(LPVOID p)
{
volatile DWORD Count = 0;
CeSetThreadPriority( GetCurrentThread(), 75 );
Sleep( 90000 );
while( 1 )
{
Count++;
}
return 0;
}
The good news is that this worked well in the field. With it we did identify the out of control thread and think that we have a solution to the problem.
Copyright © 2009 – Bruce Eitman
All Rights Reserved