AKKA Test Patterns for Java

Add Comment | Jan 02, 2014

NOTE: These patterns assume you have a working understanding of Akka and Akka's JavaTestKit

Testing Actors In Complete Isolation

AKKA is all about building hierarchies of actors to represent a system. I've found this is a great way to decompose, design, and if needed distribute a system. This means your system becomes a hierarchy of actors, where parent actors create and manage their child actors, and child actors either carry out some unit of work, and/or themselves become parent actors.

But this can be a bit of a pain when it comes to writing tests for your actors. For example, say we have an actor, actorA, who in turn creates one or more child actors. I could have actorA create its child actor(s) as part of its instantiation. This is attractive especially if I know the exact child actors its going to create.

But how do I go about testing actorA in complete isolation from it's parent or child actors? My test needs to create an instance of the actorA, and not have to create it's parent actors, or any of its child actors. If I create an ActorRef of actorA, its going to automatically create all its child actors along with it. Not what I want. Not only that, but as my unit test sends messages to actorA, its probably going to send messages to its children, who will in turn do some bit of work as a response to that message. Definitely not what I want (at least in an isolated test).

There are two standard ways around this. First, is to pass in Props instances for the desired child actors into the parent actor's constructor. The second way is to pass in a known object as a message to the parent actor, and the parent actor uses this message to create a child actor. I use a simple ChildArgs class for this that has two fields, a Props for actor creation, and a String for the actor's name. Either method will make unit testing easier, since if you want to spin up actor1 for an isolated test, just don't pass in any Props and no child actors will get created. I tend to use the constructor method, because it makes it easier to define your Actor hierarchy in Spring.

Testing a parent actor's child actor received a message

Testing an actor in isolation is a good goal, but often an actor will send a message to its child as a reaction to an incoming message. So how do we test this behavior if we don't want the actor under test to spin up its children actors? Use a ForwarderActor.

The nice thing about having parent actors create their children via passed in Props is that the parent can now create ANY actor as their child, not just the expected one. All you have to do is pass in a Props instance, and the actor will create a child.

Now, in order to capture and test messages that an actor should send to its child, you configure a Props that creates a ForwarderActor, and pass in a JavaTestKit into the ForwarderActor's constructor. Then you pass in this Props to your actor under test's constructor. The actor under test, gets the Props instance, creates its child actor from the Props, and any message the actor under test should send to its child will get captured by the JavaTestKit instance.

For example:

new JavaTestKit(system) {{

  //create a probe instance to capture all messages passed to child
  final JavaTestKit childProbe = new JavaTestKit(system);
  //create Props to create fake child actor
  final Props childProps = Props.create(ForwarderActor.class, childProbe.getRef());
  //create Props for the actor being tested
  final Props workerProps = Props.create(SomeWorkerActor.class, childProps);
  //create the actor being tested and test it
  ActorRef actor = system.actorOf(props, "someName");
  //send actor being tested a message
  actor.tell("do something", this.getRef());
  //test to see if child probe received a message in response to its parent getting "do something"
  childProbe.expectMsgEquals(FiniteDuration.create(0, TimeUnit.SECONDS), "something happened");

This allows us to assert that when we send a message to the actor being tested, it in turn sends a specific message to its child actor. We don't care that the child actor wasn't the one the parent was expecting. This way we can write tests that only exercise a single actor, yet we can also test that the messages it was supposed to send out to other actors were actually sent.

Testing an actor via its actor path

Here is another actor scenario thats hard to test in full isolation. What if the code being tested needs to reference an actor via its full akka path, but you don't want to create all the actor's parents up the hierarchy?

For example, lets say we have a worker actor, who's parent is a supervisor actor, who's parent is a top level system manager actor (this is a totally made up scenario just to demonstrate the problem). The code under test references the worker actor via its path, "/user/system/supervisor/worker". Now, in order to call system.actorSelection("/user/system/supervisor/worker") the test needs to make sure there is a "system" actor, who has a child "supervisor" actor, who has a child "worker" actor.

To do this I use a ChildCreationActor. This is a very straight forward actor that only expects one type of message. A ChildArgs instance, which has a Props field for actor creation and a String field for the actor name. Using this actor we can create the necessary hierarchy of parent actors required to run the test.

new JavaTestKit(system) {{

    //create top level system actor
    ActorRef sysActor = system.actorOf(Props.create(ChildCreationActor.class), "system");

    //tell system actor to create second level supervisor actor
            new ChildCreationActor.ChildArgs(Props.create(ChildCreationActor.class), "supervisor"),

    //tell supervisor actor to create third level worker actor
            new ChildCreationActor.ChildArgs(Props.create(SomeWorkerActor.class), "worker"),

    //now we can test the worker actor via its akka path
    system.actorSelection("user/system/supervisor/worker").tell("stuff", this.getRef());

We now have a way to write tests that interact with a worker actor via its AKKA path, in complete isolation from the worker's parent actors.

Testing functions that send messages to an actor

One last pattern that I run into a lot when testing Akka systems, is writing test for components that send messages to an actor and expect some return message at some point, especially if the function under test uses Patterns.ask(). In order to test this function in isolation, we don't want the actually intended actor receiving the message because this could have unintended downstream affects. But we do require the actor to send a reply message so our unit test will pass.

To handle this situation I use a ParrotActor, which is basically a mock actor. Its an actor that you pre-configure with expected input messages, and the appropriate response message. Combine this with the ChildCreationActor and your test can create an actor with the appropriate path name, which will send the correct response to the input message your test invokes.

//test setup...create the ParrotActor with the actor name/path the test expects
final Props someActorProps = Props.create(ParrotActor.class);
final ActorRef someActor = system.actorOf(someActorProps, "someActor");
//register a response message with the parrot actor
        new ParrotActor.Squawk(String.class, "response message"),

//test code
ActorSelection actor = system.actorSelection("/user/someActor");
final Future<Object> futureRsp = Patterns.ask(actor, "input message", Timeout.apply(3, TimeUnit.SECONDS));
//use the future to get the response message

Currently the ParrotActor uses the class type of the input message to figure out what response message to send. I always send custom classes as actor messages instead of strings, but it would be easy to extend the ParrotActor to register individual string instances as messages and responses if that is what is required.

RIP DryadLinq, or long live LINQ-to-Hadoop?

One Comment | Dec 12, 2011
About a month ago'ish I read some very sad news.  Microsoft announced that they were killing off the DryadLinq (or LINQ-to-HPC) project in favor of Hadoop.

I was one of the first users of DryadLinq outside of Microsoft, back when it was a pre-alpha project inside Microsoft Research.  My company had a running HPC cluster and my boss convinced me to install DryadLinq on it to see what I could make it do.  I worked with it for a year, and being a big LINQ and PLINQ fan, really enjoyed how easy it was to write non-cryptic code and get it to run in parallel across a cluster of machines.  Fast forward two years, after spending a year working with Hadoop, in my opinion DryadLinq beats Hadoop hands down. The key to DryadLinq's goodness was that you wrote the same algorithm code you would normally write, no matter if its executing on one core, multiple cores, or multiple machines; its just LINQ (as long as you keep in mind to pass all data into your lambda functions, and share nothing externally).  You don't have to retrain your ming to think about algorithms in a different paradigm, like grouping and sorting key value pairs.  DryadLinq dramatically sped up the development iteration cycle of designing algorithm compared to Hadoop, because you write your algorithm in the same code style to run across many machines as you would if it was just running on one machine.

The main item that DryadLinq lacked was any kind of DFS. DryadLinq sucked in that respect, and lets face it, without a good DFS, a distributed processing framework isn't all that functional. You basically had to fake a DFS, by manually partitioning your data across all machines, and generate an INI file that described how the data was partitioned. The DryadLinq runtime would read the INI file, and use that as the basis for a DFS. But if you had to write a fare amount of code if you wanted to automate the process of distributing your data.  

But, even though I'm bummed that Microsoft killed off DryadLinq, I do have a glimmer of hope and heres why. First, last month Microsoft announced its support for coming up with a Windows compatible distribution of Hadoop. So Microsoft is committing to getting Hadoop running on Windows. Second, in Hadoop 0.23 one of the things they did was a full rewrite of the distributed execution model. In Next Gen Hadoop (Yarn), MapReduce has been written as an implementation on top of an abstract distributed job execution framework. Another implementation besides MapReduce thats it'll support is a DAG (directed acyclic graph http://en.wikipedia.org/wiki/Directed_acyclic_graph) of jobs. Basically, a graph of nodes (jobs) where any node V can not link back to another node such that following the edges of the graph would loop back to V again.

Now put all that together and here is that glimmer of hope: the core coolness around DryadLinq was that it was a framework for analyzing a .Net LINQ expression tree and generating a DAG structure for it that executed on Windown HPC.  At runtime, LINQ code is represented as an expression tree that gets interpreted.  DryadLinq analyzes the expression tree and figured out how to segment the sequence of lambda functions into a DAG structure. It then generated .dlls that contained all the functions needed to execute the DAG, and shipped them over to HPC, and then started the DAG execution.

So basically Microsoft already has a kick-butt model for distributed processing of LINQ code. Add on top that you can use .Net with Hadoop streaming, and Hadoop comes with an industry tested DFS, the thing DryadLinq sorely missed, and I can potentially see where Microsoft is going with this.

Now the real question is, is Microsoft that forward thinking? Or are they doing what they do best; making a knee jerk reaction to the market.

MongoDb MapReduce Bug: Part 2

Add Comment | May 25, 2011
In my previous post I mentioned a possible bug with MongoDb's MapReduce. Well, I played with it a bit further. I went ahead and added a string field to each document to hold the numeric Twitter post id as a string value. Then changed my map function to use this string value instead of the numeric id. This time the duplicate post query worked perfectly. It's results showed that there were no duplicate posts at all.

So, what is the problem? Is it a bug in MongoDb's MapReduce implementation? I doubt it. Is it a numeric bug with Javascript? Could be. Does my code suck? Totally possible. In looking at my MapReduce query that uses a long as part of the key, I cant see anything that would cause the false positive results. So my bet goes with a numeric issue in Javascript.

MongoDb MapReduce Bug?

Add Comment | May 23, 2011
I wanted to run a simple sanity check to make sure I didn't have any duplicate twitter posts in my database. I didn't think I had any, but you never can be too sure, so I whipped up a simple MapReduce query to check. Right now I'm storing twitter posts in MongoDb using the document schema shown below:


Each document has two fields, a category field that holds what search was used to find the post, and a post field to store the returned Twitter post. I wanted to make sure that I didn't have any duplicate posts for any given category (though it is allowed to have duplicate posts across categories).

My map function created a key by concatenating the category and the post.post_id fields together, and returned this key with the value 1. Then my reduce function just counts how many values have this key...Easy Peasy.

When I ran the MapReduce job on MongoDb, the results said that out of 270,000 posts, I had 5 duplicated posts. I took the returned key which has the category and post id and ran a MongoDb query to search for them...and it returned one document. Hmmm...odd. I did the same test on the other 4 reported duplicates and got the same result. Only one document returned for each category and post id query. Really odd.

Ok, to troubleshoot this further I changed my map function to pass back both the value 1 and the document object as its value, and then changed the reduce function to return not just to sum of all dups, but the actual document objects that made up the dup.

This is where it really got odd. When I ran the MapReduce query again, I got the same results: 5 duplicate documents as expected. Then I printed out the two documents that made up each duplicate pair, and what did I see. They had unique category / post_id combinations!

Red Wine | 16201436807299072
Red Wine | 16201436807299073

For all file duplicate pairs that MongoDb returned, each pair had the same category, but the post ids were off by 1.

I'm not sure what to think about this. It seems like it might be be some kind of numeric handling or overflow error in javascript as a result of Twitter ids being so large. Should I change all my ids into strings when I store them in MongoDb? Maybe that would solve the issue. Also, since all my analytics are going to be MapReduce based, I'm not sure if I can trust the results or not.

Next steps in troubleshooting this is to add a string field to each post to hold the string value of each id. Then re-run the duplicate check to see if I get the same results.

Schema Design in Schema-less Datastores

3 Comments | May 17, 2011
I'm still getting used to this whole schema-less, document oriented database thing. I've been trying to determine out what document structure to store my Twitter data in, and I keep flipping back and forth. I've been writing software in the land of relational databases for 15 years, and consider myself pretty good at designing database schemas for enterprise class solutions. But when it comes to this whole schema-less document store thing, I'm not sure what the best approach is.

With this Twitter data I'm analyzing, I really have only have one chunk of data: a Twitter search result json document. The data of which could be broken into two parts; fields about the post, and fields about who sent the post (there are also fields about who the post is sent too, but in this scenario I don't really care about those). So I have a few choices on how to store this data.

1) One MongoDb collection to store each individual Twitter post. Seems obvious, but then user data is duplicated.

2) Two MongoDb collections, one for the Twitter posts, and one for the users who sent them. Seems too RDBMS to me.

3) One MongoDb collection to store the users, where each user has a sub-document that stores all the posts for that user. No duplication, no RDBMS'isms.

My initial effort went down the all too familiar RDBMS route...setting up two collections; one for the posts, and one for the users. But as soon as I started playing with a few analytic aggregate queries I immediately ran into a wall. There is no join operation across collections in MongoDb. Queries in MongoDb work against one collection, and one collection only (as far as I know). Now this structure doesn't leave me totally hosed. The posts collection does have a user id field. I could do all aggregate queries using this user id, return the results, then as I iterate through the results, lookup the users in the users collection with this id.

But what about my other two options. The software engineer in me just doesn't like option 1. Its the whole duplication of data thing that just goes against the grain of all that I hold near and dear to my heart when it comes to managing data.

So that leaves option 3. I think I'm going to run all my data collection in parallel for a while using option 2 and 3, write some analytic queries against both, and see which one works better.

I do think its kind of funny that I spend so much time trying to figure out what schema to use to represent my data in a schema-less data store.

Twitter Search Results and GMT

Add Comment | May 17, 2011
When using the Twitter Search API, the returned posts contain a "created_at" date time stamp with the time set to GMT time. This becomes an issue when using pymongo to store the datetime object in MongoDB. When a datetime object gets stored in MongoDb, mongo (or the pymongo library) updates the datetime object to reflect the current timezone.

So in my case, since I live in Seattle, the datetime values get offset by 8 hours. For example, a Twitter post with a timestamp of "12/21/2010 06:07:56" gets stored in MongoDB with a value of "12/20/2010 22:07:56 -0800", which is really the same. Its saying its 22:07 plus 8 hours.

This behavior is good to keep this in mind when debugging your analytics, especially where grouping by date.

I'm not sure if the pymongo library is doing this or if its mongodb. But it can make a difference. Consider if the server running your python script is in New York and the server hosting MongoDb is in San Fran. Then things could really get fun in debugging the issue.

Twitter and Rate Limiting

Add Comment | May 04, 2011
In my quest to dig into how people use social media to communicate about wine I ran into a snag. One of the things I really wanted to do was collect geographic data about each post. When you do a twitter search, a post can contain latitude and longitude data if the source supports GPS, but so far almost none of the posts about wine actually have this data.

So the next best thing is to figure out where the user lives. A twitter user profile has a location field. Its free text unfortunately, but its better than nothing. So for every post returned by a query lookup, I would look up in mongodb if I had stored the user profile yet. If not I would call the twitter api to go get the user profile then pull out the location and time zone of the user, and store it back in mongo. This worked great about 150 times. Then I ran into a fun little thing called rate limiting. Apparently twitter limits any application to 150 api calls per hour (the search api has a different threshold which I haven't hit yet). Once you go over this threshold, you get blacklisted for the next 8 or so hours.

So, considering I have about 50,000 users, at 150 lookups per hour, it'll only take me around 14 days to catch up and have location data for all my users. Of course people don't stop posting on twitter, so in 14 days, I'll probably have 50,000 additional users. At some point this curve will flatten off, but not for a while.

From an analysis point of view, this location information would be really interesting specifically with people who post to twitter with geographic data. If I could find these users, who post often about wine, and each post contains lat/lng information, I could see how people communicate about wine in relation to where they are. For example, someone who lives in San Francisco might never talk about wine, except for trips out to Napa Valley, at which point they post a lot about wineries they visit.

Twitter ints, longs, strings...oh my

Add Comment | May 04, 2011
Ok lesson learned. I now know why twitter data gives all ids as numeric and string data format. When I first saw this, I thought "what a waste of data, repeating all numeric fields as both strings and numeric. Thats stupid". Then I started noticing something odd. In my python script, my twitter ids weren't matching up to what I had in MongoDB.

Turns out Python (simplejson to be exact) takes all numerics from the Twitter json document and turns it into an int. But if the number is too big, it overflows back to a negative number, or worse a smaller positive number. So, note to self: explicitly take all numeric data from twitter from the string fields, and manually turn them into longs.

First experience with Twitter Search API and MongoDB

Add Comment | May 02, 2011
I've been playing with a new side project; wine data analytics mined from Twitter, and stored in MongoDb.

I took a first stab at writing a Python celery service to search for wine related twitter posts and dump them in MongoDb. First lesson learned is that a ton of people post about the word 'Wine' on Sunday afternoons. I wrote the service to pull 100 posts, then search again using the lowest post id as the max id to return. After storing 11K posts in MongoDb I looked at the earliest date and realized that it was just a few hours worth of data (vs what I assumed would be several days worth).

Ok...so 'Wine' is just way too broad. So I changed my list of search terms to NOT include just the single word 'Wine'. Depending on how all this data stores, and how much disk space it takes, I might have to go with more of a streaming analytics approach, where I just read the data, add it to the aggregations, and let it go. We'll see.

During all this, it was good to play with MongoDb. I have an instance running on my macbook. Managing it via the terminal was ok, but found a good mac based UI tool called MongoHub that works pretty good. There is also a JNLP browser based tool called MongoBrowser. Its ok for very simple things when you dont have a lot of data, but its not very functional.

Why is my laptop so sluggish? Or Damn You Facebook and Twitter! Or All Hail Chrome!

Add Comment | Jun 07, 2010

In the past three weeks, I've noticed that my laptop (dual core 2.1GHz, 2Gb RAM) has become amazingly sluggish.  I only uses for communications and data lookup workflows, so the slowness was tolerable.  But today I finally got fed up with the suckyness and decided to get to the root of the problem (I do have strong performance roots after all).

It actually didn't take all that long to figure it out.  About a year ago I converted to Google Chrome (away from FireFox).  One of the great tools Chrome has is a "Task Manager" tool, that gives you Windows Task Manager like details for all the tabs open in the browser (Shift + Esc).  Since every tab runs in its own process, its easy from Task Manager (both Windows or Chrome) to identify and kill a single performance offending tab.  This is unlike IE, where you only get aggregate data about all tabs open. 

Anyway, I digress.  Today my laptop sucked.  Windows Task Manager told me that I had two memory hogging Chrome tabs, but couldn't tell me which web page those tabs are showing.  Enter Chrome Task Manager which tells you the page title, along with CPU, memory and network utilization of each tab. 

Enter my amazement.  Turns out Facebook was using just shy of half a Gb of RAM.  Half a Gigabyte!  That's 512 Megabytes!524,288 Kilobytes! 536,870,912 Bytes!  Or 4,294,967,296 Bits!  In other words, that's a frackin boat load of memory. 

Now consider that Facebook is running on pretty much 96.3% (statistics based on absolutely nothing) of every house hold desktop, laptop, netbook, and mobile device in America, that is pretty horrific!

And I wasn't playing any Facebook games like FarmWars or MafiaVille.  I just had my normal, default home page up showing me who just had breakfast, or just got finished with their morning run.

I'm sorry...let me say that again...HALF A GIG OF RAM!  That is just unforgivable.

I can just see my mom calling me up: 
Mom: "John...I think I need a new computer.  Mine is really slow these days"
John: "What do you have running?"
Mom: "Oh, just Facebook"
John: "Ok, close Facebook and tell me how fast your computer feels"
Mom: "Well...I don't know how fast it is.  All I do is use Facebook"
John: "Ok Mom, I'll send you a new computer by Tuesday"

Oh yea...and the other offending web page?  It was Twitter, using a quarter of a Gigabyte.

God I love social networks!

Win7 is not a tablet OS, no matter what the boys in Redmond think.

3 Comments | Jun 01, 2010

Despite what execs at Microsoft think, Windows 7 is NOT a tablet OS.  Just because you can install some software (or OS) on a device, doesn't mean that device is meant to run that software.  This seems to be the step that the non-engineer execs at Microsoft have seem to not understood. 

In order to seamlessly work with a device, the software needs to be designed with that device in mind.  That has been the problem with the Windows PDA platform, the Windows Mobil platform, and now with trying to force fit Windows 7 on a tablet.  Its just not designed for that style of interaction.  

Windows is designed to be interacted with via a mouse and keyboard.  In fact, it is brilliant at that.  But, It is NOT designed to be interacted with by your fingers.  And that is why the Windows tablet failed 10 years ago, and why it will fail today.  Its not the hardware's fault like Microsoft claimed 10 years ago.  Its the User Interaction design that failed.

And this is why the iPhone and Android OS's work wonderfully on a tablet.  The user interaction was designed for small screens, navigated by big fat fingers.  I love these OS's and how I interact with them.  And when I play with a touch screen Windows 7 device, I am feel like I'm playing with a brittle wana-be.  And its not the hardware's fault.  The touchscreen is very responsive.  I actually like the hardware.  But the OS and the software are just not designed to be interacted with, with my big fat fingers. 

In order to be successful, Microsoft needs to start from scratch, and build a platform AND SOFTWARE specifically for use by fingers.  Thats why everyone was so excited when they though Microsoft was going to release the Courier tablet.  Because it looked like a totally different platform.  Something that might actually work.  But Windows 7...I hate to burst your bubble, but you are not a touch platform.

Create Pivot collections much faster than DeepZoomTools CollectionCreator class

Add Comment | Apr 23, 2010

I've been playing with Microsoft Live Labs Pivot to create a hierarchy of collections all linked together to allow someone to explore a hierarchy of data visually. The problem has been the generation time of the entire hierarchy. I end up creating 500 - 600 collections total and it takes hours and hours using the CollectionCreator class that comes with the DeepZoomTools. 

So digging around I found a way to make the actual DeepZoom collection creation wicked fast. Dont use the CollectionCreator! 

Turns out Pivot doesnt actually use the image pyramid generated by the CollectionCreator. Or if it does, its only when you open a new collection it shows all the images zooming in. But once the zoom in is complete, Pivot uses the individual DeepZoom images. What Pivot does need is the xml generated by the CollectionCreator, which is in a very simple format. 

So what i did was manually generate the xml for the collection image pyramid, and then create the folder structure required (one folder per level of the pyramid), and put a single pixel png file in each folder. 

Now, I can create the required files and folders for 500 collections in about 10 seconds. Sweet!

Now you still have to use the ImageCreator to create a DeepZoom image for each image in the collection and that still takes some time, but at least the total processing time is way better.

Creating SparseImages for Pivot

Add Comment | Apr 14, 2010

Learning how to programmatically make collections for Microsoft Live Labs Pivot has been a pretty interesting ride. There are very few examples out there, and the folks at MS Live Labs are often slow on any feedback.  But that is what Reflector is for, right?

Well, I was creating these InfoCard images (similar to the Car images in the "New Cars" sample collection that that MS created for Pivot), and wanted to put a Tag Cloud into the info card.  The problem was the size of the tag cloud might vary in order for all the tags to fit into the tag cloud (often times being bigger than the info card itself).  This was because the varying word lengths and calculated font sizes.

So, to fix this, I made the tag cloud its own separate image from the info card.  Then, I would create a sparse image out of the two images, where the tag cloud fit into a small section of the info card.  This would allow the user to see the info card, but then zoom into the tag cloud and see all the tags at a normal resolution.  Kind'a cool.

But...I couldn't find one code example (not one!) of how to create a sparse image.  There is one page on the SeaDragon site (http://www.seadragon.com/developer/creating-content/deep-zoom-tools/) that gives over the API for creating images and collections, and it sparsely goes over how to create a sparse image, but unless you are familiar with the API already, the documentation doesn't help very much.

The key is the Image.ViewportWidth and Image.ViewportOrigin properties of the image that is getting super imposed on the main image.  I'll walk through the code below.  I've setup a couple Point structs to represent the parent and sub image sizes, as well as where on the parent I want to position the sub image.  Next, create the parent image.  This is pretty straight forward.  Then I create the sub image.  Then I calculate several ratios; the height to width ratio of the sub image, the width ratio of the sub image to the parent image, the height ratio of the sub image to the parent image, then the X and Y coordinates on the parent image where I want the sub image to be placed represented as a ratio of the position to the parent image size.

After all these ratios have been calculated, I use them to calculate the Image.ViewportWidth and Image.ViewportOrigin values, then pass the image objects into the SparseImageCreator and call Create.

The key thing that was really missing from the API documentation page is that when setting up your sub images, everything is expressed in a ratio in relation to the main parent image.  If I had known this, it would have saved me a lot of trial and error time.  And how did I figure this out?  Reflector of course!  There is a tool called Deep Zoom Composer that came from MS Live Labs which can create a sparse image.  I just dug around the tool's code until I found the method that create sparse images.  But seriously...look at the API documentation from the SeaDragon size and look at the code below and tell me if the documentation would have helped you at all.  I don't think so!


public static void WriteDeepZoomSparseImage(string mainImagePath, string subImagePath, string destination)
    Point parentImageSize = new Point(720, 420);
    Point subImageSize = new Point(490, 310);
    Point subImageLocation = new Point(196, 17);
    List<Image> images = new List<Image>();

    //create main image
    Image mainImage = new Image(mainImagePath);
    mainImage.Size = parentImageSize;

    //create sub image
    Image subImage = new Image(subImagePath);
    double hwRatio = subImageSize.X/subImageSize.Y;            // height width ratio of the tag cloud
    double nodeWidth = subImageSize.X/parentImageSize.X;        // sub image width to parent image width ratio
    double nodeHeight = subImageSize.Y / parentImageSize.Y;    // sub image height to parent image height ratio
    double nodeX = subImageLocation.X/parentImageSize.X;       //x cordinate position on parent / width of parent
    double nodeY = subImageLocation.Y / parentImageSize.Y;     //y cordinate position on parent / height of parent

    subImage.ViewportWidth = (nodeWidth < double.Epsilon) ? 1.0 : (1.0 / nodeWidth);

    subImage.ViewportOrigin = new Point(
        (nodeWidth < double.Epsilon) ? -1.0 : (-nodeX / nodeWidth),
        (nodeHeight < double.Epsilon) ? -1.0 : ((-nodeY / nodeHeight) / hwRatio));


    //create sparse image
    SparseImageCreator creator = new SparseImageCreator();
    creator.Create(images, destination);

Seeking questions about creating Microsoft Live Labs Pivot collections

One Comment | Apr 13, 2010

I've spent the past 3 weeks working a lot with Pivot from Microsoft Live Labs (http://getpivot.com/).  Pivot is a tool that allows you to visually explore data. Its an interesting take on visual data mining.

Anyway, I've been writing a lot of code that creates a hierarchy of Pivot collections, where one item in the collection drills down into an entirly new collection.

The dev community around Pivot is still very young, so there isnt much tribal knowledge built up yet.  I've spent a lot of time trying to get things to work through trial and error, as well as digging around in Reflector.  But I've finally got a framework built for programatically creating DeepZoom images, Pivot collections, Sparse Images, etc.  

If anyone has any questions, or suggestions on a post topic, leave a comment and I'll try and answer your question. 

Did Microsoft designers got their butts kicked 3 years ago?

2 Comments | Apr 13, 2010
This is something I've been wondering about for about a year now.  Microsoft has a history of creating very useful products, with lots of useful features.  But useful does not mean usable.  A lot of stuff coming out of Redmond the past 10 years don't really seem to have been well thought out from a user design point of view.  Lots of extra steps, lots of popup windows...very little innovative thinking going on about the user experience of these products.

But about a year ago I started seeing changes in the new products coming out of Microsoft.  Windows 7 is a good example of a big change.  They really got their asses handed to them on Vista, so they had to make a change.  But it looks like this change in philosophy has bled over to other areas.  The new Office (2010) lineup has a lot of changes in it to make it way more usable. 

Given that big changes like this take about 3 years to go from start to actually shipping product, I'm curious what happened internally at Microsoft that really drove this change in product design.  I think that Microsoft got so focused on just adding new functionality for so long, they forgot about the little things that can really make or break a product.  Office 2010 is full of these little things that make it much nicer to use.  I just hope its not too late for them.

Change in Job Title and Responsibilities

Add Comment | Apr 13, 2010

I've spent the past 7 years focused primarily on code and database performance.  It's an area that I have a passion for, as well as a propensity.  But what I've found is that its very hard to change the culture of a development environment.  You can teach performance, you can encourage performance, you might see slight shift in how devs think about performance.  But without full management backing and support you wont get long lasting changes in the development culture.  And in the end, you are back to being the "Perf Guy", fixing performance design flaws, after the fact, one by one by one.

Which is why last year I asked my boss to changed my title and responsibilities to more naturally align with the team I was working for.  So now I'm a Computing Research Engineer (vague, I know), researching in the field of Big Data analytics and visualization.

I've found this change revitalizing and a lot of fun.  And given the nature of Big Data (its, um…big) the performance aspects are always ever present.

MDbg: a managed wrapper around ICorDebug!

Add Comment | Oct 13, 2008

Recently a performance bug came my way.  A highly multithreaded application, that can run for hours depending on the amount of data its processing, was observed having all its CPUs ramping up to 100% utilization, and the amount of data processed per second dropped down to nothing.

Ok, no big deal here.  I've most likely got a state where all threads are stuck in a tight loop (most likely the same loop), and each thread is waiting on the other to set a flag that will allow them to exit the loop.    Your basic deadlock issue.  Pretty easy to fix, if I can reproduce the problem on my dev machine and use the debugger to tell me what the offending function is.

The problem is that I wasn’t able to reproduce it.  Crap.

Ok…on to step two…

Looks like I'll have to find or build a tool that can give me the call stack of all the threads in a managed app.  I started out trying to use System.Diagnostics.StackTrace, and StackFrame.  From a System.Thread instance I can get create a StackTrace object and see what function the thread is in.  But I cant get a list of all System.Thread objects in an app.  I have access to the Process.Threads collection, but that gives me a list of System.Diagnostics.ProcessThread objects, not System.Thread.  Shoot…that’s not going to work.

Ok, next step is to look at creating a really light weight debugger, from .Net's ICorDebug api, to basically break into the app and dump out the call stack of all the managed threads.  I found a couple examples and it didn’t look too bad, but the only issue is that ICorDebug is a COM API.  So I'd have to do all that fun C++ COM stuff…Ick.  And I need the tool yesterday.

After digging around a bit more I found out that the Visual Studio debugger team wrote a very nice managed wrapper around ICorDebug, called MDbg.  Sweet!

There is a bunch of info about it here.

After digging a bit further, I found that someone write a handy little tool called Managed Stack Explorer.  Oh geez!  The gods are smiling at me!  That’s exactly what I need.

This little tool shows all managed apps running on your server.  When you pick an app, it shows all threads in the process.  When you click the thread, it shows you the call stack for that thread.  Simple and nice.

With this tool, I was able to find the offending non-threadsafe function in about 5 minutes.  Fixed, done, yipee.

But this post and about someone's tool, of my bug fixing adventures.  No, its about coming across one of the most useful APIs I've seen in a long time!  A simple and well designed .Net wrapper around ICorDebug, giving .Net developers full access to the CLR debugger.  I'm very excited about the idea of a managed wrapper around ICorDebug.  There are so many diagnostic tools that could be created with this.  I'm looking forward to digging around in the API!

Many core processors and parallel processing

Add Comment | Oct 06, 2008
Although most of topics I've written about are pretty random, I'll try to focus in on a much more narrow (yet incredibly broad) topic: multi core vs many core processing, parallel processing, and the paradigm shift that we software engineers are on the leading edge of having to face.

To put it in short Intel, AMD, and other hardware manufacturers are telling anyone that listens that programmers need to change the way they think about designing enduser software.  End-user software needs to take advantage of multiple cores.  And this doesn't mean spinning up a background thread to do some compute intensive request, so that our UI remains responsive.  It means designing all compute intensive algorithms to scale to multiple processors.

Intel goes on to say that designing for 2, 4, or 8 processors is way to short sighted.  We need to design our software to scale out to N processors; where N could be 16, 64, or 512.

Coding Horror has a great post from last year that demonstrates how well common end user software take advantage of multi core processors.  The results as sad to say the least.

We can no longer just expect our software to get faster with the next chip release by Intel or AMD.  What is worse, our software will most likely run slower on newer desktop and mobile chips.

The trends in processor manufacturing is to have slower, cooler, more efficient individual cores, and to pack more and more of them on a single chip.  This means that end user software that only use 1 or 2 threads will actually run slower on newer processors.

This can be seen with Intel's new quad core mobile processor: QX9300.  It has 4 cores, supporting hyper threading so it shows 8 cores in task manager, but runs at 2.53 GHz.  This is an amazing chip, but only for software that is actually designed to run across multiple cores.

To boil it down to a simplified problem statement: Software outlives hardware, and hardware ain't getting any faster.  (more on that later)

Complexity and Usability

Add Comment | Jul 22, 2008

Generally I'm not one to write a post that does nothing but highlight someone else's blog post...BUT...this one was important enough (IMHO) that I decided to break my own rule. 

Are you building a Leatherman or a Samurai sword?  (stupid linker isnt working)


As programmers we always want to write new functionality...neat, new, COOL functionality.  That's just what we do, and we love it. 

But its hard to keep in mind what our added functionality does to user efficiency.  No matter what we think our job is all about, its really about making the lives of our users easier and more efficient.  That’s it…done…its that simple.

This is easy to understand when writing a UI application.  If a new feature causes the use to perform 5 extra steps with the UI, but those 5 extra steps only give a small return on efficiency (so small it wasn’t worth the time to perform the 5 new steps), than drop the feature, its not worth it.  If the feature is complex or confusing, and will cause the user to misuse it or skip it all together, than drop the feature, its not worth it.

Where this becomes harder to evaluate is in writing an SDK API.  Like the above post states, we all want to write the ultimate architecture.  The one that can do anything and everything.  But "anything and everything" can quickly become a directionless mess, where you have a several hundred of classes with obvious direction on how to weave them together into the next "Wonder Bread".  What you end up with is a big mess that your users (other developers) will mostly likely just pass off as too complex and look for a simpler API.

The last part of the above post states it perfectly. 

"You end up with a million features, which makes it very time-consuming to build, and even when it's done, the number of different gizmos on your Leatherman scare off potential users. You need to have a strong connection to your actual customers, and be hearing about exactly what they need to do. Then you need to design around that, ruthlessly jettisoning anything that distracts from them achieving their goals."

Creating an instance of a generic paremeter is slooooo: part deux

Add Comment | Jun 13, 2008
For grins I looked at my code that calls:

T tmp = new T();

in Reflector, so see if it could shed any light into T instance creation badness.  Well, it turns out that the C# compiler spits out code to call Activator.CreateInstance

T tmp = Activator.CreateInstance<T>();

I kind of get why the C# compiler does this, because it doesnt know what T is at compile time.  But at run time the JIT compiler DOES know.  I'm surprised that the C# team didn't build in the smarts to JIT code to explicitly call the default constructor of whatever type T is.

Creating an instance of a generic paremeter is slooooo

One Comment | Jun 13, 2008
I recently needed to change how an array lookup worked to make it more efficient, and decided to use the List<T>.BinarySearch to do the lookup.  The class that contained this lookup had a generic parameter, and was constrained like so:

public class SortedNameList<T> where T : class, INameValueItem, new()

where the T of List<T> was the same as the class generic parameter.

In order to do the BinarySearch, List<T> required an input of type T to search against.  Since I only had the value of the property that will be compared against (an int), I needed to create a new temp instance of T, set the value, and then pass it into BinarySearch().

My unit tests passed, all the functionality was good, and I was happy.  Then I ran the my app under a profiler to see how much faster my fancy BinarySearch was. 

To my surprise, the time spent doing the binary search calls was almost exactly the same as a linear lookup (over 1.2 million searches)!  What the heck?  I know that creating a new temp object each lookup isn't very efficient, but it shouldn't make that much of a difference.

So after looking a bit deeper and doing some more performance tests, I found out that creating a new instance of a generic ("T tmp = new T()") is sloooooo.  How slow?  How about 30X slower!  WOW...I had no idea!

And its not that it takes the CLR some time to figure out how to create a new T, where most of the time is on the first instance, and the rest speed up.  Nope, the duration to create a new T is consistant, from the first instance to the millionth instance.

Good to know...dont do that in a high volume area

No more null checking on your IEnumerables before you iterate over them

5 Comments | May 29, 2008
I get a bit sick of checking for null on my IEnumerable objects before doing a foreach over them.  In my opinion I think the CLR should check if the list is null, and if it is just exit out of the foreach iteration as if there were no items in it.

Well, I was goofing around with Extension Methods a bit and figured out how to get this kind of functionality (sort of).

Now unfortunatly Extension Methods cant override an existing method on a type, so I cant just create a new GetEnumerator extension method (well, actually i can make one, but it wont get called).  But I can create a new method that returns IEnumerable, and just call the foreach on it.

So in order to do this, first add this class to your code

public static class MyExtnesionMethods
    public static IEnumerable<T> Enum<T>(this IEnumerable<T> input)
        if (input != null)
            foreach (var t in input)
                yield return t;
            yield break;

Now, anything that inherits from IEnumerable<T> will have the Enum method.  Then all you have to do is call foreach on someClass.Enum(), even if someClass is null.  Below is an example of ho this works.

static void Main(string[] args)
    List<string> names = new List<string>()
        {"john", "kim", "jean", "brent"};

    //iterate names using stock enumerator
    foreach (string name in names)               
    //iterate names using extension method
    foreach (string name in names.Enum())

    names = null;

    //oh man!  I have to check for null...I hate that
    if (names != null)
        foreach (string name in names)

    //Yea!  I dont have to check for null anymore!
    foreach (string name in names.Enum())

The extension method uses the "yield return" and "yield break" iterator syntax to let the foreach either spin over the IEnumerable if its not null, or if it is null, "yield break" returns false from the IEnumerable.MoveNext which tells the foreach that there are no more items in the list so it should break out of the loop.

So, no more null checks!

A reader commented that this could be optimized by using the static method Enumerable.Empty<T>.  This would save an object instance from being created by the yield return functionality.  The new and improved Extension Method is as follows:

public static IEnumerable<T> Enum<T>(this IEnumerable<T> input)
    return input ?? Enumerable.Empty<T>();


29 Comments | Apr 16, 2008
I recently profiled a sproc that makes heavy use of the TSQL SUBSTRING function (hundreds of thousands of times) to see how it performs on a SQL 2005 database compared to a SQL 2000 database.  Much to my surprise the SQL 2005 database performed worse...dramatically worse than SQL 2000.

After much researching it turns out the problem is that the column the text was stored in was an NTEXT, but SQL 2005 has deprecated the NTEXT in favor of NVARCHAR(MAX).  Now, you'd think that string functions on NTEXT would have the same performance on 2005 as it did on 2000, but thats not the case. 

Ok, so NTEXT is old badness, and NVARCHAR(MAX) is new goodness.  Then the next logical step would be to convert the column to be a NVARCHAR(MAX) data type, but here lies a little but very important gotcha.

By default NTEXT stores the text value in the LOB structure and the table structure just holds a pointer to the location in the LOB where the text lives. 

Conversely, the default setting for NVARCHAR(MAX) is to store its text value in the table structure, unless the text is over 8,000 bytes at which point it behaves like an NTEXT and stores the text value in the LOB , and stores a pointer to the text in the table.

So, just to recap, the default settings for NTEXT and NVARCHAR(MAX) are completely opposite.

Now, what do you think will happen when you execute an ALTER COLUMN on a NTEXT column that changes the data type to a NVARCHAR(MAX)?  Where do you think the data will be stored?  In the LOB structure or the table structure?

Well, lets walk through an example.  First create a table with one NTEXT column:

CREATE TABLE [dbo].[testTable](
    [testText] [ntext] NULL

Next, put 20 rows in the table:

INSERT INTO testTable SELECT 'hmmm...i wonder if this will work'

Then run a select query with IO STATISTICS:

SELECT * FROM testTable

Now, looking at the IO stats, we see there was only 1 logical read, but 60 LOB logical reads.  This is pretty much as expected as NTEXT stores its text value in the LOB not the table:

Table 'testTable'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 60, lob physical reads 0, lob read-ahead reads 0.

Now, lets alter the table to be an NVARCHAR(MAX):


Now when we run the select query again with UI STATISTICS we still get a lot of LOB reads (though less than we did with NTEXT).  So its obvious that when SQL Server did the alter table, it didn't use the default NVARCHAR(MAX) setting of text in row, but kept the text in the LOB and still uses pointers lookups to get the text out of the LOB.

Table 'testTable'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 40, lob physical reads 0, lob read-ahead reads 0.

This is not as expected and can be devastating for performance if you don't catch it, since NVARCHAR(MAX) with text not in row actually performs WORSE than NTEXT when doing SUBSTRING calls.

So how do we fix this problem?  Its actually fairly easy.  After running your alter table, run an update statement setting the column value to itself, like so:

UPDATE testTable SET testText = testText

SQL server moves the text from the LOB structure to the table (if less than 8,000 bytes).  So when we run the select again with IO STATISTICS we get 0 LOB reads. 

Table 'testTable'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

YEA!  This is what we want.

Now, just for grins, what do you think happens if we change the NVARCHAR(MAX) back to NTEXT?  Well it turns out that SQL Server moves the text back to the LOB structure.  Completely backwards from what it did when converting NTEXT to NVARCHAR(MAX).

Easily write Reflection.Emit code

Add Comment | Mar 28, 2008
I was looking at Refletor addins the other day and ran across one that would be an amazing time saver. 

Its an addin that generates the Reflection.Emit code!

Anyone who has ever spent any time with the Reflection.Emit namespace should immediately realize how wonderful this tool has the potential to be (as long as the generated code is of good quality of course).

Also, the way integrates with Reflector is pretty slick.  It adds a "Reflection.Emit" choice in the list of languages you want Reflection to display the code in.  Then, in the left pane,when you click a module, class, method, property, whatever, it displays the Reflection.Emit code in the right pane that you would have to write to generate the thing you clicked on.

Simple...and amazing!

I recently spent 6 days writing Reflection.Emit code to generate two fairly complex methods.  2 days each for writing the code, and 1 day each for debugging it and making it actually work.  I probably could have cut that down to 1 to 2 days using this tool.

I haven't yet compared the addin's generated Reflection Emit code to the code i've written manually to validate its quality, but just playing around with it, the generated code looks pretty good.

It can be found here:

The Instrumentation Model

Add Comment | Mar 26, 2008

I've spent a lot of time lately thinking about instrumentation and how to integrate it into software projects.


As a performance engineer I tend to think about instrumentation from the point of view of someone who wants to record the details of what a system is doing, and then dig through the data and use it to figure out what is wrong.


But I’ve been talking to people the past few months about instrumentation, I’ve come to realize that instrumentation means different things to different people.  Some people think of instrumentation as a high level, light weight set of metrics that are easy to consume, understand, and extrapolate performance deltas; a management point of view.  Other people, like me, think of it as recording low level details of what’s going on in the call stacks and sql engine; a trouble shooter point of view. And then others think its somewhere in between; everyone else.


Well, I think everyone is correct.  There are different levels of instrumentation that are useful at different points in validating system health.  There should be easy to consume and understand metrics to validate day to day health checks, there is medium level detail instrumentation that is used to figure out where a problem is, but takes a bit more effort to analyze.  And if that isn’t enough to find and fix the problem, there is the dump everything to file model that gives you all the data you need to understand what is going on in the system, but requires internal knowledge of the system and time to analyze the data.  Also, each level builds upon the other, so there is as little duplicated effort as possible.


So I’ve tried to create an instrumentation model demonstrate these different levels, the answers each level tries to answer, and when you move onto the next level


The first level will provide you with the most early bang for your buck, and it’s a easy way to tell if you have a problem, with as little dev effort as possible.  Then as you get the high level metrics in, you can start building in the mid level metrics, and so on.  The main thing is to not try and build the entire instrumentation framework up front before you put anything it.  Start putting high level metrics in early and use then in your automated testing

Instrumentation Model Image