Over the past several weeks I’ve found myself having to play the role of server administrator more and more. Being a developer I’ve had some exposure to server admin work in the past, but I certainly don’t consider myself an expert in the area. Unfortunately there was no one else to do it, so theses tasks started to come my way. Just before Christmas I was driving down to south Florida to visit family when I saw an urgent message from work stating that people were having issues logging in to our systems.
For some background, the environment I had been working with is a collection of servers running various pieces of the Microsoft stack. There are SharePoint sites/sub-sites, SQL Server Reporting Services installations, custom ASP .NET applications and some basic static web content. Most of these resources are loosely related at best and span multiple server farms and domain names. The only thing they all do share in common is that they all rely on Active Directory Federation Services (ADFS) to handle authentication and authorization to restricted data and resources. ADFS provides a Single Sign On of sorts, allowing users to authenticate once time per browser session and navigate among all of the reports, web applications, and SharePoint sites they might need to access without having to re-login constantly. The very first time you attempt to access one of the restricted resources you’re bounced to an ADFS proxy server which captures your user name and password. Upon successfully authenticating you’re sent back on your way to the URL that you had originally requested. If you then need to jump over to another application on another machine your authentication credentials move with you and you’re not prompted to login again. When this all works, it works very well. When it doesn’t, however, you’re pretty much dead in the water.
On the occasion I’m writing about now, “dead in the water” was a pretty good way to describe the situation. Users were requesting a restricted resource, being directed to login, entering valid credentials, and then getting the ASP .NET Yellow Screen of Death (it at least was the vague YSOD, not the one that includes the full stack trace). As I was soon to discover, there are a number of moving parts behind the scenes in this login process, which means several potential failure points. At an overly simplified and extremely high level, it looks something like this (if you’re at all familiar with ASP .NET Forms Authentication this might look similar):
- User requests the home page of application XYZ
- IIS on the server hosting the XYZ app determines the web application that needs to handle the request.
- The web application in question has the ADFS Web Agent installed and configured. The agent intercepts this request to see if the proper Single Sign On authentication cookies are present.
- The web agent doesn’t find the needed cookies, and redirects the user to the ADFS proxy server (the URL of which is set in the web agent configuration on each machine).
- The user’s browser is automatically directed to the ADFS proxy server’s login page with several query string parameters, one of which is the URL of the XYZ application. This will allow the proxy to send you back to the originally requested URL after authentication.
- User enters login credentials on the ADFS proxy page. The proxy makes a few web service calls to an ADFS server which in turn talks with an Active Directory domain controller to perform the actual authentication. If successfully authenticated, a SAML token is created by the ADFS server and returned to the proxy. The proxy then constructs an authentication cookie that will allow you to access the various restricted resources that have a Single Sign On relationship.
- Once the auth cookie has been set, the proxy redirects the user back to originally requested resource, the resource sees this auth cookie and access is granted.
With the issue that our setup was experiencing, the error appeared to be occurring somewhere between steps 6 and 7. Immediately after providing credentials, everything was blowing up. We hadn’t made any configuration changes, Windows updates, or deployed any new code anywhere in weeks, so the typical “let’s try rolling the latest change back” wasn’t an option.
I decided to have a look at the Event Log on the ADFS server which showed nothing out of the ordinary. The Event Log on the ADFS Proxy server, however, was showing an error indicating that multiple authentication requests for the same user were coming in too rapid a succession. Not terribly helpful.
I was fortunate to have a test environment configured in a very similar way that as not experiencing the issue, so I decided to fire up Fiddler so I could create a step by step view of what should happen in this process and then compare it with what was happening in our production environment. Fiddler let me compare the HTTP requests and cookies between the two environments and confirmed that behavior was working correctly.
I finally decided to poke around some on one of the web servers that hosts the most widely used application in the environment to see what I could turn up. It turns out that the ADFS Web Agent does some logging to text files. Following these logs I could see a request come in, no cookies being found, and then a subsequent request coming in several seconds later (i.e. after the user had been directed to the ADFS proxy server and had authenticated) and that request failing with a message indicating that the authentication token had expired. In that case, the user was being sent straight back to the ADFS proxy server which was seeing that multiple requests for the same user were coming rapid fire so it blows up. I imagine that this is some kind of throttling mechanism to help prevent denial of service and/or brute force password cracking attempts.
So this explained why the ADFS proxy server was displaying an error and not allowing people in, but it didn’t explain why the web application server was reporting a freshly generated authentication cookie as being “expired”.
I mulled this over for a bit before having the “aha!” moment and going to compare the system clocks among the various servers involved in this process. Sure enough, I could see a difference of just over 5 minutes between the ADFS server and the web application servers. As it turns out, 5 minutes is the allowable threshold with ADFS. This makes sense, as you wouldn’t want some of your more enterprising users sniffing around in their local cookie store, grabbing the data from the authentication cookie and then being able to just use that same cookie value several weeks later after they’ve been let go from the organization. The cookies themselves expire as soon as the browser window closes, but you have to also include some kind of expiring value within the data of the cookie itself to prevent re-use. My best guess is that the authentication cookie value (which appears as just a long string of garbled text) contains encrypted data, one piece of which is UTC timestamp of when the cookie was initially created. The Web Agent that consumes that cookie must compare the timestamp to system time locally. I can only guess that UTC is used to accommodate time zone differences.
I had always known that it was important for servers in an enterprise to have system clocks in-sync, but this was the first time I had been bitten by it. Here’s what had happened to the best of my understanding:
The environment where these servers lived has two domain controllers. Some of the servers (as it happened the majority of the web application and SharePoint servers) were using Domain Controller B as their primary time source. Some of the servers (as it happened the ADFS server) was using Domain Controller A. Domain Controller A is configured to pull time from an external time server that’s synced with the atomic clock and Domain Controller B is supposed to sync its time with Domain Controller A. If this is all working correctly, this highly accurate time from the atomic clock will trickle down to all of the servers in the environment and they’ll stay in sync. For whatever reason, Domain Controller B had stopped syncing with Domain Controller A, and their system clocks slowly started drifting further and further apart. Eventually they reached that magic 5 minute time difference threshold and everything broke.
So what’s the moral? Obviously system time on servers is important for a wide variety of reasons, but that’s not the most important thing I learned out of this experience. If anything, I learned to think outside the box a bit when troubleshooting issues like this. I initially assumed that since the error screen was coming up on the ADFS proxy server this MUST be an ADFS configuration issue. I never bothered checking on the ADFS web agent on the application servers since it seemed like users were never even getting that far in the process.
Oh, and I definitely sharpened by HTTP debugging skills with Fiddler. If you do web development and aren’t using Fiddler or some comparable HTTP tracing utility you really should be :-)