Here is the interesting problem of the week... finally figured it out!
All of a sudden this Monday, our clients started getting "Service Unavailable" errors all of a sudden. I looked deeper in to it, and it turned out the Application Pool on IIS6 for the ASP .Net app had crashed. So I restarted it, and everything started working again.
But this started happening intermittently about twice daily. Big problem.
The event log on the server showed:
Application pool 'XXX' is being automatically disabled due to a series of failures in the process(es) serving that application pool.
This "crash" was preceeded by a few warnings showing:
A process serving application pool 'XXX' suffered a fatal communication error with the World Wide Web Publishing Service. The process id was '4280'. The data field contains the error number.
Data:
0000: 8007006d
So googling the errors at first, I came across msdn documents and blogs talking about installing debuggers and get stack trace and thread dumps, analyze and review them for errors in the code. Although this probably would be helpful in some cases, didn't help me. Nothing wrong with my code.
Anyway a couple of passed by and I installed microsoft's debug analyzer. btw, 2 good articles there are:
http://blog.whitesites.com/Debugging-Faulting-Application-w3wp-exe-Crashes__634424707278896484_blog.htm
http://support.microsoft.com/kb/919789
Didn't find anything.
Finally today looking at it again, we came across a setting "Enable rapid-fail protection" in the Application Pool properties. It was enabled and set to 5 failures. Looking back at the event logs, I realized that there was a pattern. Every crash had 5 failures (warnings) right before it. Always exactly 5. Same as the example above, where basically a process errored out and the process id was logged. There were errors before too in the log, but at the same time it was a max of only 3 together at the same time. These errors basically seem to be unhandled exceptions in the Asp .net app, a lot of them coming from the framework, at times they were timeout errors, some times file not found, and other things.
So basically we were telling the server that if you get 5 errors together within a span of 5 minutes, just shut down. It's now increased to 10, so that gives us some breathing room. I don't want to completely turn it off, b/c then if some legit issue causes widespread process crashes, the server will get overloaded and the whole machine will come down. Right now, it prevents that by shutting down the application pool if a bunch of worker processes fail back to back. Aha, it all makes sense now... almost!
I already have global exception handling with logging turned on in the ASP .Net app, so the next step would be to figure out why it's causing the process to fail, instead of just handling the error and dying off peacefully. But at least it's not causing the dreaded service unavailable error anymore. Yay!