Geeks With Blogs

News
Elton Stoneman (@EltonStoneman) IT Consultant, integration specialist, @Microsoft MVP and @Pluralsight author.

I met with a prospective client looking to move an existing API to Azure, and they had an interesting problem.

Part of their API is supporting over-the-air (OTA) auto-update, so an app calls home, finds out if it has the latest version and if not downloads it from a blob. At peak times, they need to handle 2,000 requests per second for a sustained period of 2-3 hours.

That doesn't sound like a lot of scale really, but they need *all* requests to complete successfully, they can't permit any dropped requests or connection timeouts as that would leave users staring at an hourglass and not very happy.

In the current implementation, that runs as a synchronous API call, and they wanted me to whiteboard some alternative options. The API returns a JSON resource describing the current app version - version number, blob URL for the latest version, and MD5 checksum for the blob, something like this:

{
  "version":"2014-04-01_32764532",
  "blobPath":"https://blob.windowsazure.net/xyz/devices/ota/2014/04/2014-04-01_32764532.zip",
  "blobChecksum":"4e107d9d372cc6826bd81d4542a429d3",
  "isOptional":false
}

If that's all there is to it then the obvious answer is to serve that as static content from a CDN, but I guess there's a bit more involved. As a minimum they'll want to log when the API gets accessed, so they can collect usage stats and drive future planning. But that logging process isn't critical, whereas retrieving the update version is.

So how much Azure kit do you need to support 2,000 concurrent requests? I suggested some options and afterwards put together a quick demo to see how much kit you need for four different approaches, from synchronous RPC-style moving to CQRS.

I tested these out using Blitz, configured to send 2,000 concurrent requests with a 30 second timeout. It's not a full-on performance tests, I just wanted to get an idea how the approaches compared, so the test runs are short (30 seconds), and I ran each run just twice - first to warm up the environment and second to get the results. If I were investing more time in this, I'd be running repeated tests for much longer periods.

The Blitz rush was set up to send in a constant load of 2,000 concurrent requests, and the output tells me how many hits per second were served. To meet the requirement of supporting 2,000 requests per seconds, I want all requests to return within a second so we're not building a backlog - so the aim is for a solution where Blitz reports 2,000+ hits per second.

Option 1 – Synchronous End-to-End

 

image

Simulating the current solution, using an Azure Web Role for the REST API, with multiple instance behind the load balancer, and SQL Azure for persistence. The client sends an HTTP GET request to the API, the API logs access synchronously to SQL and returns the current OTA payload when the logging completes.

1.1 - Running with 5 medium instances (2 cores and 2.5Gb RAM).

Without autoscale, this would cost around 250GBP per month.

Results
All 200 responses - no errors
Response time - avg 4.3 sec, min 0.5, max 11.1
Requests processed – 9,601
349 hits/sec

1.2 - Scaling up to 10 medium instances.

Without autoscale this would cost around 570GBP/month.

Results
97% 200 responses; 492x 500 server errors
Response time - avg 1.0 sec, min 0.5, max 1.5
Requests processed – 15,236
554 hits/sec

image

With more kit we see better performance, but the improvement is not linear. 200% scale only yields 160% performance. We can't reliably extrapolate that to find how many instances we need to handle our load, but we could comfortably say 20 instances is probably not going to give us even half of what we need. And we're getting 3% error responses when we up the scale, which is not acceptable, so this architecture isn't going to do it for us.

Option 2 – Synchronous Request-Response, Asynchronous Access Logging

image

Still using an Azure Web role for the REST API and SQL Azure for persistence. The client sends an HTTP GET request to the API, then the API starts a task to log access to SQL asynchronously and returns the OTA payload without waiting for logging to complete.

2.1 - 5 medium instances.

Results
All 200 responses - no errors
Response time - avg 2.1 sec, min 0.5, max 2.9
Requests processed – 14,352
521 hits/sec

2.2 - 10 medium instances

Results
All 200 responses - no errors
Response time - avg 1.7 sec, min .4, max 3.6
Requests processed – 15,051
547 hits/sec

image

This option starts promisingly - with 5 instances, it runs almost as well as option 1.2 with 10 instances, but the scaling isn't good. Doubling the number of instances only gives a tiny improvement in performance. The 2.2 stats are a bit suspicious, so I'd like to look at those again, but we aren't seeing a massive improvement with this approach.

When we create tasks to do the access logging, they'll be running as threads in the IIS process, which uses a thread-per-request model to serve incoming traffic. We're separating the compute load, but it's within the same process so we'll have competition for threads.

Option 3 - synchronous request-response + fire-and-forget messaging

image

Still using an Azure Web Role for the REST API, but pushing the access logs out to a separate component - sending a message to an Azure Service Bus Queue. The client sends an HTTP GET request to the API, the API puts a message on the queue for the access log to be persisted and returns the OTA payload without waiting for the message to be processed.

Not strictly a fair comparison, in my tests I didn't have any Worker Role message handlers polling the queue and writing to SQL. But that's intended to be an offline process, and I know that polling the queue for reads shouldn't significantly impact the performance of the API writing messages.

3.1 - 5 medium instances.

Results 
All 200 responses - no errors
Response time - avg 1.3 sec, min .9, max 3.2
Requests processed – 24,200
880 hits/sec

3.2 - 10 medium instances.


Results 
All 200 responses - no errors
Response time - avg 1.6 sec, min .5, max 3.2
Requests processed – 16,405
597 hits/sec

image

The first run looks good here, as I would expect - posting to a queue is going to support better concurrency than inserting to a SQL database, as SQL needs to manage contention and locking. I was expecting 3.2 to get close to the requirement, not quite linear scaling but maybe up to 1,500 requests per second.

As it happened, 3.2 is very suspicious. I'd say it's an anomaly, but both the 3.2 runs were about the same, so maybe it was something environmental happening at the time of the runs. I'd be very surprised if adding more boxes posting to queues caused the performance to drop, so this is definitely one I'd investigate further with more time.

Option 4 – Client-driven CQRS

image

No API at all. Current OTA payload stored as JSON in a blob on Azure storage. Client gets current OTA from static content, then posts a message to the Service Bus queue to log access. The post to the queue's REST API can be asynchronous, and the client doesn't care when/if it gets processed.

So the client owns the process, and uses HTTP for both parts - querying one resource and sending an access log command to a second resource.

You can't test the Service Bus REST API through Blitz - there's no way to configure it so Blitz knows it's your resource, you'll be paying the bill and you’re not using them for a DDoS attack. So this test was just for the first part, to see if blob storage can support the load. As the client would be posting the message asynchronously without caring about the response, it's worth seeing how the query performs.

Results 
All 200 responses - no errors
Response time - avg 0.1 sec, min 0.07, max 0.6
Requests processed – 58,000
2109 hits/sec

image

So that's pretty impressive. Blob storage can meet the query requirements without any special configuration - I just uploaded the JSON file to a public-blob container and pointed Blitz at the address.

In this test, to serve 58,000 requests we shipped about 50Mb of data from Azure in 30 seconds. That equates to about 6Gb/hour, so to run this level of load 24/7 would cost around 185GBP/month. There's no need for configuring autoscale, as you only pay for the amount of data transferred.

Summary

So the final option is a nice, clear winner. It needs a bit more work from the client, but it's still straightforward and it gives the client control over which parts of the process it thinks are important.

Without knowing the full requirements, there may be more that needs to be done in the API, so option 3 - where the client make a synchronous query call to the API and the API sends a command message to a queue - may be worth pursuing.

A hybrid of 2 and 3 could be possible as well - with the API using an internal in-memory queue, with a dedicated thread to relay messages to the Service Bus queue. That way you wouldn't have thread contention in the IIS process and you'd be able to serve more HTTP requests, with the risk that if you lose the process you lose any pending logs.

So how much Azure kit do you need to serve 2,000 request per second? In this case: just one blob.

And all the architecture and infrastructure that sits behind Azure Storage, of course.

Posted on Thursday, April 10, 2014 1:06 PM Architecture , Azure | Back to top


Comments on this post: How much Azure kit do you need to serve 2,000 requests per second?

# re: How much Azure kit do you need to serve 2,000 requests per second?
Requesting Gravatar...
Have you thought about using mobile services to write directly to SQL Azure to table storage? Or how about directly from the app itself?
Left by Dariel on Apr 10, 2014 3:47 PM

# re: How much Azure kit do you need to serve 2,000 requests per second?
Requesting Gravatar...
Would be nice to see a github repository with the projects to check them out and see our own tests.

J
Left by Jorge on Apr 11, 2014 10:35 AM

# re: How much Azure kit do you need to serve 2,000 requests per second?
Requesting Gravatar...
Dariel - nice idea on mobile services, one to follow up on.

Jorge - github repository is coming soon (although there's not much to it...)
Left by Elton on Apr 11, 2014 10:37 PM

Your comment:
 (will show your gravatar)
 


Copyright © Elton Stoneman | Powered by: GeeksWithBlogs.net | Join free