Recently we had a problem with our IIS Hosted WCF services which were listening to the Windows Azure Service Bus Relay. The symptoms were as follows:
- Only the production environment was affected
- All other environments were fine
- Both servers hosting the services went down at approximately the same time
When we tried to restart them we were getting a certificate trust verification error like in the below box.
WebHost failed to process a request.
Sender Information: System.ServiceModel.ServiceHostingEnvironment+HostingManager/35320229
System.ServiceModel.ServiceActivationException: The service '/MyIISApplication/MyWCFService.svc' cannot be activated due to an exception during compilation.
The exception message is: The X.509 certificate CN=servicebus.windows.net, OU=WindowsAzure, O=Microsoft, L=Redmond, S=WA, C=US chain building failed.
The certificate that was used has a trust chain that cannot be verified. Replace the certificate or change the certificateValidationMode.
The revocation function was unable to check revocation because the revocation server was offline.
. ---> System.ServiceModel.Security.SecurityNegotiationException: The X.509 certificate CN=servicebus.windows.net, OU=WindowsAzure, O=Microsoft, L=Redmond, S=WA, C=US chain building failed. The certificate that was used has a trust chain that cannot be verified. Replace the certificate or change the certificateValidationMode. The revocation function was unable to check revocation because the revocation server was offline.
---> System.IdentityModel.Tokens.SecurityTokenValidationException: The X.509 certificate CN=servicebus.windows.net, OU=WindowsAzure, O=Microsoft, L=Redmond, S=WA, C=US chain building failed. The certificate that was used has a trust chain that cannot be verified. Replace the certificate or change the certificateValidationMode. The revocation function was unable to check revocation because the revocation server was offline.
We had come across this previously in the test and development environments but it only ever happened very occasionally. Normally we had been able to clean the credential cache or restart the app pools and it had always just worked. We had also reviewed some of the other articles online about similar errors and possible fixes but none of them had ever seemed to work. Since the problem didn't really affect us in test/dev and always went away easily it had never been given too much airtime.
This week we had a bigger issue where the production service had been running fine for months but suddenly stopped. None of the old workarounds had really made any difference. Cutting through some of the diagnostics steps we had taken while troubleshooting to keep this article short, we had managed to change the user account running the app pool on one server and that server started working. On the other server the same steps didn't work.
We had production service restored but were unable to get it working with the expected configuration and were still getting the above error on one server.
At this point we had engaged with Microsoft support through our Azure support agreement. While working with one of their engineers we found that with netmon and also the CAPI 2.0 logging available via Event Viewer we could see that some of the certificates could not be verified and there were some errors. This corresponded with some information in our proxy server logs about some url's being blocked. The blocked url's were:
Our current configuration is as follows:
- We have two servers which are listeners for Azure Service Bus Relay
- Our firewall allows outbound connection from the 2 servers to the Azure datacentre over ports 80, 443, 9350-9354
- We configure the proxy server for access to our Azure Service Bus namespace's ACS endpoint
- We configure the proxy server for access to a couple of other url's which seem to be required, we used the ones which were out there in the general guidance online and also looked for any others which might be required during out early stages.
Our current proxy configuration was as follows:
- <IP address of the on-premise servers>
- <My namespace>-sb.accesscontrol.windows.net
From the CAPI log and the netmon trace we could see that there were issues accessing these certificate related resources which we assume would be updates to certs or revocation lists. We were seeing things like:
- HTTP 403 Forbidden error code
- Proxy returning error 'X-Squid-Error: ERR_ACCESS_DENIED 0'. So the proxy is not allowing traffic to above URL.
In addition to our configuration above our WCF service which has been in production for a while has been using the 1.6 SDK. This has now been superseded by a few other releases. This hasn't really changed for a while but it hadn't needed to.
Based on the support call your experience with this error could be slightly different depending on what version of the SDK you are using. This is outlined below.
Service Bus SDK 1.8 or Above
You should no longer get this issue because the SDK no longer checks for certificate revocation.
Service Bus SDK 1.7
This can be worked around by using the following snippet in the configuration file.
<add key="Microsoft.ServiceBus.X509RevocationMode" value="NoCheck"/>
You should probably still consider looking into your proxy server to check what is being blocked.
Service Bus SDK 1.6
In our case this was related to blocked addresses on our proxy server. We modified the proxy server settings to have the following as allowed on our Squid proxy.
<My Namespace Goes here>-sb.accesscontrol.windows.net
There are a couple of lessons we can take away from this.
- We need to set something up to report blocked addresses from our proxy server for these kind of situations. This had been working fine for ages and then this week the certificate network retrieval has been blocked and we need to know if this ever happens in the future before it affects service. In terms of our solution, when its configured correctly we don't expect any url's to be used which would be blocked as they should all relate to the solution and it seems we were not aware of all of them
- We need to agree a standard for updating the SDK. This component hadn't changed for months yet it is already 5 versions behind the latest SDK which is not 2.1
To findout more about using the CAPI 2.0 logging refer to http://www.entrust.net/knowledge-base/technote.cfm?tn=8165