About a month ago'ish I read some very sad news. Microsoft announced that they were killing off the
DryadLinq (or LINQ-to-HPC) project in favor of Hadoop.
I was one of the first users of DryadLinq outside of Microsoft, back when it was a pre-alpha project inside Microsoft Research. My company had a running HPC cluster and my boss convinced me to install DryadLinq on it to see what I could make it do. I worked with it for a year, and being a big LINQ and PLINQ fan, really enjoyed how easy it was to write non-cryptic code and get it to run in parallel across a cluster of machines. Fast forward two years, after spending a year working with Hadoop, in my opinion DryadLinq beats Hadoop hands down. The key to DryadLinq's goodness was that you wrote the same algorithm code you would normally write, no matter if its executing on one core, multiple cores, or multiple machines; its just LINQ (as long as you keep in mind to pass all data into your lambda functions, and share nothing externally). You don't have to retrain your ming to think about algorithms in a different paradigm, like grouping and sorting key value pairs. DryadLinq dramatically sped up the development iteration cycle of designing algorithm compared to Hadoop, because you write your algorithm in the same code style to run across many machines as you would if it was just running on one machine.
The main item that DryadLinq lacked was any kind of DFS. DryadLinq sucked in that respect, and lets face it, without a good DFS, a distributed processing framework isn't all that functional. You basically had to fake a DFS, by manually partitioning your data across all machines, and generate an INI file that described how the data was partitioned. The DryadLinq runtime would read the INI file, and use that as the basis for a DFS. But if you had to write a fare amount of code if you wanted to automate the process of distributing your data.
But, even though I'm bummed that Microsoft killed off DryadLinq, I do have a glimmer of hope and heres why. First, last month Microsoft announced its support for coming up with a Windows compatible distribution of Hadoop. So Microsoft is committing to getting Hadoop running on Windows. Second, in Hadoop 0.23 one of the things they did was a full rewrite of the distributed execution model. In Next Gen Hadoop (Yarn), MapReduce has been written as an implementation on top of an abstract distributed job execution framework. Another implementation besides MapReduce thats it'll support is a DAG (directed acyclic graph http://en.wikipedia.org/wiki/Directed_acyclic_graph) of jobs. Basically, a graph of nodes (jobs) where any node V can not link back to another node such that following the edges of the graph would loop back to V again.
Now put all that together and here is that glimmer of hope: the core coolness around DryadLinq was that it was a framework for analyzing a .Net LINQ expression tree and generating a DAG structure for it that executed on Windown HPC. At runtime, LINQ code is represented as an expression tree that gets interpreted. DryadLinq analyzes the expression tree and figured out how to segment the sequence of lambda functions into a DAG structure. It then generated .dlls that contained all the functions needed to execute the DAG, and shipped them over to HPC, and then started the DAG execution.
So basically Microsoft already has a kick-butt model for distributed processing of LINQ code. Add on top that you can use .Net with Hadoop streaming, and Hadoop comes with an industry tested DFS, the thing DryadLinq sorely missed, and I can potentially see where Microsoft is going with this.
Now the real question is, is Microsoft that forward thinking? Or are they doing what they do best; making a knee jerk reaction to the market.
I recently need to migrate all the data from a Cassandra cluster on EC2 into a Cassandra cluster that was behind our private firewall. Not only that, but the cluster ring sizes of the source and destination cluster were different.
I kicked around some crazy stupid ideas for a while, when someone pointed out that Cassandra 0.8.1 shipped with a new tool called sstableloader (angles start singing here...)
sstableloader is a tool that basically reads a folder full of Cassandra Keyspace data and index files and bulk loads their data into a destination cluster. It can only does this one Keyspace at a time.
After playing with the tool, and working around some gotcha's, I finally figured out the process for pulling off a cluster to cluster data migration. So I thought I'd write a little tutorial share this process with others. The tutorial assumes you're source Cassandra cluster is outside your firewall, and sstableloader doesn't have direct read access to the data folder on the source cluster.
Also, if any of the following steps are stupid or just straight wrong, please leave a comment and I'll update accordingly.
Collect cassandra data files from existing cluster:
On each node in the source cassandra ring, you'll need to collect all the data and index files (*.Data.db and *.Index.db) for the Keyspaces you want to migrate. The data files for a Keyspace are located in a folder named for the Keyspace (by default) under the "/var/lib/cassandra/data/" folder. On a EC2 server you probably changed this (or maybe should have changed this) to a folder on the /mnt folder, since most of the disk space is on /mnt.
Before you package up the data and index files, you'll want to flush the Cassandra memtable to SSTable using the nodetool. This will make sure the SSTables are up to date with all the data written to the Cassandra cluster. You will also want to kick off a data compaction with nodetool as well to minimize the volume of data you are going to be copying over to the destination network.
To package all data and index files for a Keyspace on a single node into a compressed tarball, run the following command, making sure to change the KeyspaceName to the Keyspace you want to collect, and the NodeNumber to the cassandra node you are working on:
find /mnt/cassandra/var/lib/cassandra/data/<KeyspaceName> -type f \( -name \*\Data.db -o -name \*\Index.db \) -print0 | xargs -0 tar -czvf <KeyspaceName><NodeNumber>.tar.gz
This crates a tarball for the Keyspace with only the data and index files in it. Run this on each Keyspace, on each node in the source cassandra cluster (Note: cassandra data files compress nicely to ~25% original size)
Once all cassandra data is packaged into tarball files, SFTP each tarball (one per Keyspace/per Cassandra node) down to the destination network. This should take ~forever.
Setup sstableloader
Since sstableloader uses gossip to communicate with the destination ring, it is going to read the listen_address and storage_port values from the cassandra.yaml file and use this ip-address and port to communicate with the destination Cassandra ring. This means if you want to run sstableloader on the same machine as a running Cassandra instance you'll get the following error because Cassandra is already using this ip-address and port to communicate with the other nodes in the ring:
org.apache.cassandra.config.ConfigurationException: /127.0.0.1:7000 is in use by another process. Change listen_address:storage_port in cassandra.yaml to values that do not conflict with other services
To get around this you'll have to create a new loopback ip address for sstableloader to use. Running the following command will create a new loop back address on ip-address 127.0.0.2:
sudo ifconfig lo0 alias 127.0.0.2
Note: after you are finished using sstableloader and want to remove the new loopback alias run this command:
sudo ifconfig lo0 -alias 127.0.0.2
Also, since sstableloader reads the "../conf/cassandra.yaml" file to figure out what ip-address and port to use, you'll have to make a copy of the Cassandra install folder so you can change the yaml file without affecting the running Cassandra instance. So make a copy of the cassandra install folder, and rename it to something like apache-cassandra-0.8.1-sstableloader. Then open the conf/cassandra.yaml file in an editor and change the listen_address to 127.0.0.2.
sstableloader should now be fully configured to run from the apache-cassandra-0.8.1-sstableloader folder.
Since sstableloader will be dumping A LOT of data into the destination Cassandra cluster, you probably want to disable data file compaction in your destination cluster during this process. This will speedup the import process and consume WAY less disk space on each destination cassandra node.
Run sstableloader
When sstableloader runs, it uses the name of the folder that has the source data and index files as the Keyspace to write to in the destination ring. So for each Keyspace in your source Cassandra cluster, create a folder with the exact name as that Keyspace. Unpack one of your Keyspace tarballs into this folder (make sure its the correct Keyspace). Since different nodes from the source cluster probably use the same file names for the data and index files, you'll have to do this one tarball at a time to ensure you don't overwrite a file from a different node. Or you could write a utility to add a node number prefix to each file, that way you could unpack all tarball files into one folder, and run sstableloader only once for that Keyspace.
Once your data and index files unpacked in a folder that has the same name as the destination Keyspace, run the following command to kick off sstableloader (NOTE: make sure you run sstableloader from the the copied cassandra install folder you created earlier so it will use the correct ipaddress for gossip):
bin/sstableloader /some/path/to/the/<KeyspaceName>
Once this is finished, delete the data and index files in the Keyspace folder and unpack the next tarball into the folder and repeat the process until the Keyspace has all data loaded into it. Then repeat this process for the other Keyspaces.
Once this is complete, you'll want to re-enable compaction on the destination Cassandra cluster and manually kick off a compaction to get rid of any duplicate data (because you imported all data from the source cluster, and each piece of data was replicated 3 times)
In my previous post I mentioned a possible bug with MongoDb's MapReduce. Well, I played with it a bit further. I went ahead and added a string field to each document to hold the numeric Twitter post id as a string value. Then changed my map function to use this string value instead of the numeric id. This time the duplicate post query worked perfectly. It's results showed that there were no duplicate posts at all.
So, what is the problem? Is it a bug in MongoDb's MapReduce implementation? I doubt it. Is it a numeric bug with Javascript? Could be. Does my code suck? Totally possible. In looking at my MapReduce query that uses a long as part of the key, I cant see anything that would cause the false positive results. So my bet goes with a numeric issue in Javascript.
I wanted to run a simple sanity check to make sure I didn't have any duplicate twitter posts in my database. I didn't think I had any, but you never can be too sure, so I whipped up a simple MapReduce query to check. Right now I'm storing twitter posts in MongoDb using the document schema shown below:
{
category
post
{
post_id
created_date
from_user
from_user_id
geo
iso_lang
post_text
}
}
Each document has two fields, a category field that holds what search was used to find the post, and a post field to store the returned Twitter post. I wanted to make sure that I didn't have any duplicate posts for any given category (though it is allowed to have duplicate posts across categories).
My map function created a key by concatenating the category and the post.post_id fields together, and returned this key with the value 1. Then my reduce function just counts how many values have this key...Easy Peasy.
When I ran the MapReduce job on MongoDb, the results said that out of 270,000 posts, I had 5 duplicated posts. I took the returned key which has the category and post id and ran a MongoDb query to search for them...and it returned one document. Hmmm...odd. I did the same test on the other 4 reported duplicates and got the same result. Only one document returned for each category and post id query. Really odd.
Ok, to troubleshoot this further I changed my map function to pass back both the value 1 and the document object as its value, and then changed the reduce function to return not just to sum of all dups, but the actual document objects that made up the dup.
This is where it really got odd. When I ran the MapReduce query again, I got the same results: 5 duplicate documents as expected. Then I printed out the two documents that made up each duplicate pair, and what did I see. They had unique category / post_id combinations!
Red Wine | 16201436807299072
Red Wine | 16201436807299073
For all file duplicate pairs that MongoDb returned, each pair had the same category, but the post ids were off by 1.
I'm not sure what to think about this. It seems like it might be be some kind of numeric handling or overflow error in javascript as a result of Twitter ids being so large. Should I change all my ids into strings when I store them in MongoDb? Maybe that would solve the issue. Also, since all my analytics are going to be MapReduce based, I'm not sure if I can trust the results or not.
Next steps in troubleshooting this is to add a string field to each post to hold the string value of each id. Then re-run the duplicate check to see if I get the same results.
I'm still getting used to this whole schema-less, document oriented database thing. I've been trying to determine out what document structure to store my Twitter data in, and I keep flipping back and forth. I've been writing software in the land of relational databases for 15 years, and consider myself pretty good at designing database schemas for enterprise class solutions. But when it comes to this whole schema-less document store thing, I'm not sure what the best approach is.
With this Twitter data I'm analyzing, I really have only have one chunk of data: a Twitter search result json document. The data of which could be broken into two parts; fields about the post, and fields about who sent the post (there are also fields about who the post is sent too, but in this scenario I don't really care about those). So I have a few choices on how to store this data.
1) One MongoDb collection to store each individual Twitter post. Seems obvious, but then user data is duplicated.
2) Two MongoDb collections, one for the Twitter posts, and one for the users who sent them. Seems too RDBMS to me.
3) One MongoDb collection to store the users, where each user has a sub-document that stores all the posts for that user. No duplication, no RDBMS'isms.
My initial effort went down the all too familiar RDBMS route...setting up two collections; one for the posts, and one for the users. But as soon as I started playing with a few analytic aggregate queries I immediately ran into a wall. There is no join operation across collections in MongoDb. Queries in MongoDb work against one collection, and one collection only (as far as I know). Now this structure doesn't leave me totally hosed. The posts collection does have a user id field. I could do all aggregate queries using this user id, return the results, then as I iterate through the results, lookup the users in the users collection with this id.
But what about my other two options. The software engineer in me just doesn't like option 1. Its the whole duplication of data thing that just goes against the grain of all that I hold near and dear to my heart when it comes to managing data.
So that leaves option 3. I think I'm going to run all my data collection in parallel for a while using option 2 and 3, write some analytic queries against both, and see which one works better.
I do think its kind of funny that I spend so much time trying to figure out what schema to use to represent my data in a schema-less data store.
When using the Twitter Search API, the returned posts contain a "created_at" date time stamp with the time set to GMT time. This becomes an issue when using pymongo to store the datetime object in MongoDB. When a datetime object gets stored in MongoDb, mongo (or the pymongo library) updates the datetime object to reflect the current timezone.
So in my case, since I live in Seattle, the datetime values get offset by 8 hours. For example, a Twitter post with a timestamp of "12/21/2010 06:07:56" gets stored in MongoDB with a value of "12/20/2010 22:07:56 -0800", which is really the same. Its saying its 22:07 plus 8 hours.
This behavior is good to keep this in mind when debugging your analytics, especially where grouping by date.
I'm not sure if the pymongo library is doing this or if its mongodb. But it can make a difference. Consider if the server running your python script is in New York and the server hosting MongoDb is in San Fran. Then things could really get fun in debugging the issue.
In my quest to dig into how people use social media to communicate about wine I ran into a snag. One of the things I really wanted to do was collect geographic data about each post. When you do a twitter search, a post can contain latitude and longitude data if the source supports GPS, but so far almost none of the posts about wine actually have this data.
So the next best thing is to figure out where the user lives. A twitter user profile has a location field. Its free text unfortunately, but its better than nothing. So for every post returned by a query lookup, I would look up in mongodb if I had stored the user profile yet. If not I would call the twitter api to go get the user profile then pull out the location and time zone of the user, and store it back in mongo. This worked great about 150 times. Then I ran into a fun little thing called rate limiting. Apparently twitter limits any application to 150 api calls per hour (the search api has a different threshold which I haven't hit yet). Once you go over this threshold, you get blacklisted for the next 8 or so hours.
So, considering I have about 50,000 users, at 150 lookups per hour, it'll only take me around 14 days to catch up and have location data for all my users. Of course people don't stop posting on twitter, so in 14 days, I'll probably have 50,000 additional users. At some point this curve will flatten off, but not for a while.
From an analysis point of view, this location information would be really interesting specifically with people who post to twitter with geographic data. If I could find these users, who post often about wine, and each post contains lat/lng information, I could see how people communicate about wine in relation to where they are. For example, someone who lives in San Francisco might never talk about wine, except for trips out to Napa Valley, at which point they post a lot about wineries they visit.
Ok lesson learned. I now know why twitter data gives all ids as numeric and string data format. When I first saw this, I thought "what a waste of data, repeating all numeric fields as both strings and numeric. Thats stupid". Then I started noticing something odd. In my python script, my twitter ids weren't matching up to what I had in MongoDB.
Turns out Python (simplejson to be exact) takes all numerics from the Twitter json document and turns it into an int. But if the number is too big, it overflows back to a negative number, or worse a smaller positive number. So, note to self: explicitly take all numeric data from twitter from the string fields, and manually turn them into longs.
I've been playing with a new side project; wine data analytics mined from Twitter, and stored in MongoDb.
I took a first stab at writing a Python celery service to search for wine related twitter posts and dump them in MongoDb. First lesson learned is that a ton of people post about the word 'Wine' on Sunday afternoons. I wrote the service to pull 100 posts, then search again using the lowest post id as the max id to return. After storing 11K posts in MongoDb I looked at the earliest date and realized that it was just a few hours worth of data (vs what I assumed would be several days worth).
Ok...so 'Wine' is just way too broad. So I changed my list of search terms to NOT include just the single word 'Wine'. Depending on how all this data stores, and how much disk space it takes, I might have to go with more of a streaming analytics approach, where I just read the data, add it to the aggregations, and let it go. We'll see.
During all this, it was good to play with MongoDb. I have an instance running on my macbook. Managing it via the terminal was ok, but found a good mac based UI tool called MongoHub that works pretty good. There is also a JNLP browser based tool called MongoBrowser. Its ok for very simple things when you dont have a lot of data, but its not very functional.
In the past three weeks, I've noticed that my laptop (dual core 2.1GHz, 2Gb RAM) has become amazingly sluggish. I only uses for communications and data lookup workflows, so the slowness was tolerable. But today I finally got fed up with the suckyness and decided to get to the root of the problem (I do have strong performance roots after all).
It actually didn't take all that long to figure it out. About a year ago I converted to Google Chrome (away from FireFox). One of the great tools Chrome has is a "Task Manager" tool, that gives you Windows Task Manager like details for all the tabs open in the browser (Shift + Esc). Since every tab runs in its own process, its easy from Task Manager (both Windows or Chrome) to identify and kill a single performance offending tab. This is unlike IE, where you only get aggregate data about all tabs open.
Anyway, I digress. Today my laptop sucked. Windows Task Manager told me that I had two memory hogging Chrome tabs, but couldn't tell me which web page those tabs are showing. Enter Chrome Task Manager which tells you the page title, along with CPU, memory and network utilization of each tab.
Enter my amazement. Turns out Facebook was using just shy of half a Gb of RAM. Half a Gigabyte! That's 512 Megabytes!524,288 Kilobytes! 536,870,912 Bytes! Or 4,294,967,296 Bits! In other words, that's a frackin boat load of memory.
Now consider that Facebook is running on pretty much 96.3% (statistics based on absolutely nothing) of every house hold desktop, laptop, netbook, and mobile device in America, that is pretty horrific!
And I wasn't playing any Facebook games like FarmWars or MafiaVille. I just had my normal, default home page up showing me who just had breakfast, or just got finished with their morning run.
I'm sorry...let me say that again...HALF A GIG OF RAM! That is just unforgivable.
I can just see my mom calling me up:
Mom: "John...I think I need a new computer. Mine is really slow these days"
John: "What do you have running?"
Mom: "Oh, just Facebook"
John: "Ok, close Facebook and tell me how fast your computer feels"
Mom: "Well...I don't know how fast it is. All I do is use Facebook"
John: "Ok Mom, I'll send you a new computer by Tuesday"
Oh yea...and the other offending web page? It was Twitter, using a quarter of a Gigabyte.
God I love social networks!
Despite what execs at Microsoft think, Windows 7 is NOT a tablet OS. Just because you can install some software (or OS) on a device, doesn't mean that device is meant to run that software. This seems to be the step that the non-engineer execs at Microsoft have seem to not understood.
In order to seamlessly work with a device, the software needs to be designed with that device in mind. That has been the problem with the Windows PDA platform, the Windows Mobil platform, and now with trying to force fit Windows 7 on a tablet. Its just not designed for that style of interaction.
Windows is designed to be interacted with via a mouse and keyboard. In fact, it is brilliant at that. But, It is NOT designed to be interacted with by your fingers. And that is why the Windows tablet failed 10 years ago, and why it will fail today. Its not the hardware's fault like Microsoft claimed 10 years ago. Its the User Interaction design that failed.
And this is why the iPhone and Android OS's work wonderfully on a tablet. The user interaction was designed for small screens, navigated by big fat fingers. I love these OS's and how I interact with them. And when I play with a touch screen Windows 7 device, I am feel like I'm playing with a brittle wana-be. And its not the hardware's fault. The touchscreen is very responsive. I actually like the hardware. But the OS and the software are just not designed to be interacted with, with my big fat fingers.
In order to be successful, Microsoft needs to start from scratch, and build a platform AND SOFTWARE specifically for use by fingers. Thats why everyone was so excited when they though Microsoft was going to release the Courier tablet. Because it looked like a totally different platform. Something that might actually work. But Windows 7...I hate to burst your bubble, but you are not a touch platform.
I've been playing with Microsoft Live Labs Pivot to create a hierarchy of collections all linked together to allow someone to explore a hierarchy of data visually. The problem has been the generation time of the entire hierarchy. I end up creating 500 - 600 collections total and it takes hours and hours using the CollectionCreator class that comes with the DeepZoomTools.
So digging around I found a way to make the actual DeepZoom collection creation wicked fast. Dont use the CollectionCreator!
Turns out Pivot doesnt actually use the image pyramid generated by the CollectionCreator. Or if it does, its only when you open a new collection it shows all the images zooming in. But once the zoom in is complete, Pivot uses the individual DeepZoom images. What Pivot does need is the xml generated by the CollectionCreator, which is in a very simple format.
So what i did was manually generate the xml for the collection image pyramid, and then create the folder structure required (one folder per level of the pyramid), and put a single pixel png file in each folder.
Now, I can create the required files and folders for 500 collections in about 10 seconds. Sweet!
Now you still have to use the ImageCreator to create a DeepZoom image for each image in the collection and that still takes some time, but at least the total processing time is way better.
Learning how to programmatically make collections for Microsoft Live Labs Pivot has been a pretty interesting ride. There are very few examples out there, and the folks at MS Live Labs are often slow on any feedback. But that is what Reflector is for, right?
Well, I was creating these InfoCard images (similar to the Car images in the "New Cars" sample collection that that MS created for Pivot), and wanted to put a Tag Cloud into the info card. The problem was the size of the tag cloud might vary in order for all the tags to fit into the tag cloud (often times being bigger than the info card itself). This was because the varying word lengths and calculated font sizes.
So, to fix this, I made the tag cloud its own separate image from the info card. Then, I would create a sparse image out of the two images, where the tag cloud fit into a small section of the info card. This would allow the user to see the info card, but then zoom into the tag cloud and see all the tags at a normal resolution. Kind'a cool.
But...I couldn't find one code example (not one!) of how to create a sparse image. There is one page on the SeaDragon site (http://www.seadragon.com/developer/creating-content/deep-zoom-tools/) that gives over the API for creating images and collections, and it sparsely goes over how to create a sparse image, but unless you are familiar with the API already, the documentation doesn't help very much.
The key is the Image.ViewportWidth and Image.ViewportOrigin properties of the image that is getting super imposed on the main image. I'll walk through the code below. I've setup a couple Point structs to represent the parent and sub image sizes, as well as where on the parent I want to position the sub image. Next, create the parent image. This is pretty straight forward. Then I create the sub image. Then I calculate several ratios; the height to width ratio of the sub image, the width ratio of the sub image to the parent image, the height ratio of the sub image to the parent image, then the X and Y coordinates on the parent image where I want the sub image to be placed represented as a ratio of the position to the parent image size.
After all these ratios have been calculated, I use them to calculate the Image.ViewportWidth and Image.ViewportOrigin values, then pass the image objects into the SparseImageCreator and call Create.
The key thing that was really missing from the API documentation page is that when setting up your sub images, everything is expressed in a ratio in relation to the main parent image. If I had known this, it would have saved me a lot of trial and error time. And how did I figure this out? Reflector of course! There is a tool called Deep Zoom Composer that came from MS Live Labs which can create a sparse image. I just dug around the tool's code until I found the method that create sparse images. But seriously...look at the API documentation from the SeaDragon size and look at the code below and tell me if the documentation would have helped you at all. I don't think so!
public static void WriteDeepZoomSparseImage(string mainImagePath, string subImagePath, string destination)
{
Point parentImageSize = new Point(720, 420);
Point subImageSize = new Point(490, 310);
Point subImageLocation = new Point(196, 17);
List<Image> images = new List<Image>();
//create main image
Image mainImage = new Image(mainImagePath);
mainImage.Size = parentImageSize;
images.Add(mainImage);
//create sub image
Image subImage = new Image(subImagePath);
double hwRatio = subImageSize.X/subImageSize.Y; // height width ratio of the tag cloud
double nodeWidth = subImageSize.X/parentImageSize.X; // sub image width to parent image width ratio
double nodeHeight = subImageSize.Y / parentImageSize.Y; // sub image height to parent image height ratio
double nodeX = subImageLocation.X/parentImageSize.X; //x cordinate position on parent / width of parent
double nodeY = subImageLocation.Y / parentImageSize.Y; //y cordinate position on parent / height of parent
subImage.ViewportWidth = (nodeWidth < double.Epsilon) ? 1.0 : (1.0 / nodeWidth);
subImage.ViewportOrigin = new Point(
(nodeWidth < double.Epsilon) ? -1.0 : (-nodeX / nodeWidth),
(nodeHeight < double.Epsilon) ? -1.0 : ((-nodeY / nodeHeight) / hwRatio));
images.Add(subImage);
//create sparse image
SparseImageCreator creator = new SparseImageCreator();
creator.Create(images, destination);
}
I've spent the past 3 weeks working a lot with Pivot from Microsoft Live Labs (http://getpivot.com/). Pivot is a tool that allows you to visually explore data. Its an interesting take on visual data mining.
Anyway, I've been writing a lot of code that creates a hierarchy of Pivot collections, where one item in the collection drills down into an entirly new collection.
The dev community around Pivot is still very young, so there isnt much tribal knowledge built up yet. I've spent a lot of time trying to get things to work through trial and error, as well as digging around in Reflector. But I've finally got a framework built for programatically creating DeepZoom images, Pivot collections, Sparse Images, etc.
If anyone has any questions, or suggestions on a post topic, leave a comment and I'll try and answer your question.
This is something I've been wondering about for about a year now. Microsoft has a history of creating very useful products, with lots of useful features. But useful does not mean usable. A lot of stuff coming out of Redmond the past 10 years don't really seem to have been well thought out from a user design point of view. Lots of extra steps, lots of popup windows...very little innovative thinking going on about the user experience of these products.
But about a year ago I started seeing changes in the new products coming out of Microsoft. Windows 7 is a good example of a big change. They really got their asses handed to them on Vista, so they had to make a change. But it looks like this change in philosophy has bled over to other areas. The new Office (2010) lineup has a lot of changes in it to make it way more usable.
Given that big changes like this take about 3 years to go from start to actually shipping product, I'm curious what happened internally at Microsoft that really drove this change in product design. I think that Microsoft got so focused on just adding new functionality for so long, they forgot about the little things that can really make or break a product. Office 2010 is full of these little things that make it much nicer to use. I just hope its not too late for them.
I've spent the past 7 years focused primarily on code and database performance. It's an area that I have a passion for, as well as a propensity. But what I've found is that its very hard to change the culture of a development environment. You can teach performance, you can encourage performance, you might see slight shift in how devs think about performance. But without full management backing and support you wont get long lasting changes in the development culture. And in the end, you are back to being the "Perf Guy", fixing performance design flaws, after the fact, one by one by one.
Which is why last year I asked my boss to changed my title and responsibilities to more naturally align with the team I was working for. So now I'm a Computing Research Engineer (vague, I know), researching in the field of Big Data analytics and visualization.
I've found this change revitalizing and a lot of fun. And given the nature of Big Data (its, um…big) the performance aspects are always ever present.
Recently a performance bug came my way. A highly multithreaded application, that can run for hours depending on the amount of data its processing, was observed having all its CPUs ramping up to 100% utilization, and the amount of data processed per second dropped down to nothing.
Ok, no big deal here. I've most likely got a state where all threads are stuck in a tight loop (most likely the same loop), and each thread is waiting on the other to set a flag that will allow them to exit the loop. Your basic deadlock issue. Pretty easy to fix, if I can reproduce the problem on my dev machine and use the debugger to tell me what the offending function is.
The problem is that I wasn’t able to reproduce it. Crap.
Ok…on to step two…
Looks like I'll have to find or build a tool that can give me the call stack of all the threads in a managed app. I started out trying to use System.Diagnostics.StackTrace, and StackFrame. From a System.Thread instance I can get create a StackTrace object and see what function the thread is in. But I cant get a list of all System.Thread objects in an app. I have access to the Process.Threads collection, but that gives me a list of System.Diagnostics.ProcessThread objects, not System.Thread. Shoot…that’s not going to work.
Ok, next step is to look at creating a really light weight debugger, from .Net's ICorDebug api, to basically break into the app and dump out the call stack of all the managed threads. I found a couple examples and it didn’t look too bad, but the only issue is that ICorDebug is a COM API. So I'd have to do all that fun C++ COM stuff…Ick. And I need the tool yesterday.
After digging around a bit more I found out that the Visual Studio debugger team wrote a very nice managed wrapper around ICorDebug, called MDbg. Sweet!
There is a bunch of info about it here.
After digging a bit further, I found that someone write a handy little tool called Managed Stack Explorer. Oh geez! The gods are smiling at me! That’s exactly what I need.
This little tool shows all managed apps running on your server. When you pick an app, it shows all threads in the process. When you click the thread, it shows you the call stack for that thread. Simple and nice.
With this tool, I was able to find the offending non-threadsafe function in about 5 minutes. Fixed, done, yipee.
But this post and about someone's tool, of my bug fixing adventures. No, its about coming across one of the most useful APIs I've seen in a long time! A simple and well designed .Net wrapper around ICorDebug, giving .Net developers full access to the CLR debugger. I'm very excited about the idea of a managed wrapper around ICorDebug. There are so many diagnostic tools that could be created with this. I'm looking forward to digging around in the API!
Although most of topics I've written about are pretty random, I'll try to focus in on a much more narrow (yet incredibly broad) topic: multi core vs many core processing, parallel processing, and the paradigm shift that we software engineers are on the leading edge of having to face.
To put it in short Intel, AMD, and other hardware manufacturers are telling anyone that listens that programmers need to change the way they think about designing enduser software. End-user software needs to take advantage of multiple cores. And this doesn't mean spinning up a background thread to do some compute intensive request, so that our UI remains responsive. It means designing all compute intensive algorithms to scale to multiple processors.
Intel goes on to say that designing for 2, 4, or 8 processors is way to short sighted. We need to design our software to scale out to N processors; where N could be 16, 64, or 512.
Coding Horror has a great post from last year that demonstrates how well common end user software take advantage of multi core processors. The results as sad to say the least.
We can no longer just expect our software to get faster with the next chip release by Intel or AMD. What is worse, our software will most likely run slower on newer desktop and mobile chips.
The trends in processor manufacturing is to have slower, cooler, more efficient individual cores, and to pack more and more of them on a single chip. This means that end user software that only use 1 or 2 threads will actually run slower on newer processors.
This can be seen with Intel's new quad core mobile processor: QX9300. It has 4 cores, supporting hyper threading so it shows 8 cores in task manager, but runs at 2.53 GHz. This is an amazing chip, but only for software that is actually designed to run across multiple cores.
To boil it down to a simplified problem statement: Software outlives hardware, and hardware ain't getting any faster. (more on that later)
Generally I'm not one to write a post that does nothing but highlight someone else's blog post...BUT...this one was important enough (IMHO) that I decided to break my own rule.
Are you building a Leatherman or a Samurai sword? (stupid linker isnt working)
http://petewarden.typepad.com/searchbrowser/2008/07/are-you-buildin.html
As programmers we always want to write new functionality...neat, new, COOL functionality. That's just what we do, and we love it.
But its hard to keep in mind what our added functionality does to user efficiency. No matter what we think our job is all about, its really about making the lives of our users easier and more efficient. That’s it…done…its that simple.
This is easy to understand when writing a UI application. If a new feature causes the use to perform 5 extra steps with the UI, but those 5 extra steps only give a small return on efficiency (so small it wasn’t worth the time to perform the 5 new steps), than drop the feature, its not worth it. If the feature is complex or confusing, and will cause the user to misuse it or skip it all together, than drop the feature, its not worth it.
Where this becomes harder to evaluate is in writing an SDK API. Like the above post states, we all want to write the ultimate architecture. The one that can do anything and everything. But "anything and everything" can quickly become a directionless mess, where you have a several hundred of classes with obvious direction on how to weave them together into the next "Wonder Bread". What you end up with is a big mess that your users (other developers) will mostly likely just pass off as too complex and look for a simpler API.
The last part of the above post states it perfectly.
"You end up with a million features, which makes it very time-consuming to build, and even when it's done, the number of different gizmos on your Leatherman scare off potential users. You need to have a strong connection to your actual customers, and be hearing about exactly what they need to do. Then you need to design around that, ruthlessly jettisoning anything that distracts from them achieving their goals."
For grins I looked at my code that calls:
T tmp = new T();
in
Reflector, so see if it could shed any light into T instance creation badness. Well, it turns out that the C# compiler spits out code to call Activator.CreateInstance
T tmp =
Activator.
CreateInstance<
T>();
I kind of get why the C# compiler does this, because it doesnt know what T is at compile time. But at run time the JIT compiler DOES know. I'm surprised that the C# team didn't build in the smarts to JIT code to explicitly call the default constructor of whatever type T is.
I recently needed to change how an array lookup worked to make it more efficient, and decided to use the List<T>.BinarySearch to do the lookup. The class that contained this lookup had a generic parameter, and was constrained like so:
public class SortedNameList<T> where T : class, INameValueItem, new()
{...}
where the T of List<T> was the same as the class generic parameter.
In order to do the BinarySearch, List<T> required an input of type T to search against. Since I only had the value of the property that will be compared against (an int), I needed to create a new temp instance of T, set the value, and then pass it into BinarySearch().
My unit tests passed, all the functionality was good, and I was happy. Then I ran the my app under a profiler to see how much faster my fancy BinarySearch was.
To my surprise, the time spent doing the binary search calls was almost exactly the same as a linear lookup (over 1.2 million searches)! What the heck? I know that creating a new temp object each lookup isn't very efficient, but it shouldn't make that much of a difference.
So after looking a bit deeper and doing some more performance tests, I found out that creating a new instance of a generic ("T tmp = new T()") is sloooooo. How slow? How about 30X slower! WOW...I had no idea!
And its not that it takes the CLR some time to figure out how to create a new T, where most of the time is on the first instance, and the rest speed up. Nope, the duration to create a new T is consistant, from the first instance to the millionth instance.
Good to know...dont do that in a high volume area
I get a bit sick of checking for null on my IEnumerable objects before doing a foreach over them. In my opinion I think the CLR should check if the list is null, and if it is just exit out of the foreach iteration as if there were no items in it.
Well, I was goofing around with Extension Methods a bit and figured out how to get this kind of functionality (sort of).
Now unfortunatly Extension Methods cant override an existing method on a type, so I cant just create a new GetEnumerator extension method (well, actually i can make one, but it wont get called). But I can create a new method that returns IEnumerable, and just call the foreach on it.
So in order to do this, first add this class to your code
public static class MyExtnesionMethods
{
public static IEnumerable<T> Enum<T>(this IEnumerable<T> input)
{
if (input != null)
{
foreach (var t in input)
{
yield return t;
}
}
else
{
yield break;
}
}
}
Now, anything that inherits from IEnumerable<T> will have the Enum method. Then all you have to do is call foreach on someClass.Enum(), even if someClass is null. Below is an example of ho this works.
static void Main(string[] args)
{
List<string> names = new List<string>()
{"john", "kim", "jean", "brent"};
//iterate names using stock enumerator
foreach (string name in names)
Console.WriteLine(name);
//iterate names using extension method
foreach (string name in names.Enum())
Console.WriteLine(name);
names = null;
//oh man! I have to check for null...I hate that
if (names != null)
foreach (string name in names)
Console.WriteLine(name);
//Yea! I dont have to check for null anymore!
foreach (string name in names.Enum())
Console.WriteLine(name);
}
The extension method uses the "yield return" and "yield break" iterator syntax to let the foreach either spin over the IEnumerable if its not null, or if it is null, "yield break" returns false from the IEnumerable.MoveNext which tells the foreach that there are no more items in the list so it should break out of the loop.
So, no more null checks!
<Update>
A reader commented that this could be optimized by using the static method Enumerable.Empty<T>. This would save an object instance from being created by the yield return functionality. The new and improved Extension Method is as follows:
public static IEnumerable<T> Enum<T>(this IEnumerable<T> input)
{
return input ?? Enumerable.Empty<T>();
}
I recently profiled a sproc that makes heavy use of the TSQL SUBSTRING function (hundreds of thousands of times) to see how it performs on a SQL 2005 database compared to a SQL 2000 database. Much to my surprise the SQL 2005 database performed worse...dramatically worse than SQL 2000.
After much researching it turns out the problem is that the column the text was stored in was an NTEXT, but SQL 2005 has deprecated the NTEXT in favor of NVARCHAR(MAX). Now, you'd think that string functions on NTEXT would have the same performance on 2005 as it did on 2000, but thats not the case.
Ok, so NTEXT is old badness, and NVARCHAR(MAX) is new goodness. Then the next logical step would be to convert the column to be a NVARCHAR(MAX) data type, but here lies a little but very important gotcha.
By default NTEXT stores the text value in the LOB structure and the table structure just holds a pointer to the location in the LOB where the text lives.
Conversely, the default setting for NVARCHAR(MAX) is to store its text value in the table structure, unless the text is over 8,000 bytes at which point it behaves like an NTEXT and stores the text value in the LOB , and stores a pointer to the text in the table.
So, just to recap, the default settings for NTEXT and NVARCHAR(MAX) are completely opposite.
Now, what do you think will happen when you execute an ALTER COLUMN on a NTEXT column that changes the data type to a NVARCHAR(MAX)? Where do you think the data will be stored? In the LOB structure or the table structure?
Well, lets walk through an example. First create a table with one NTEXT column:
CREATE TABLE [dbo].[testTable](
[testText] [ntext] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
Next, put 20 rows in the table:
INSERT INTO testTable SELECT 'hmmm...i wonder if this will work'
Then run a select query with IO STATISTICS:
SET STATISTICS IO ON
SELECT * FROM testTable
SET STATISTICS IO OFF
Now, looking at the IO stats, we see there was only 1 logical read, but 60 LOB logical reads. This is pretty much as expected as NTEXT stores its text value in the LOB not the table:
Table 'testTable'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 60, lob physical reads 0, lob read-ahead reads 0.
Now, lets alter the table to be an NVARCHAR(MAX):
ALTER TABLE testTable ALTER COLUMN testText NVARCHAR(MAX) null
Now when we run the select query again with UI STATISTICS we still get a lot of LOB reads (though less than we did with NTEXT). So its obvious that when SQL Server did the alter table, it didn't use the default NVARCHAR(MAX) setting of text in row, but kept the text in the LOB and still uses pointers lookups to get the text out of the LOB.
Table 'testTable'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 40, lob physical reads 0, lob read-ahead reads 0.
This is not as expected and can be devastating for performance if you don't catch it, since NVARCHAR(MAX) with text not in row actually performs WORSE than NTEXT when doing SUBSTRING calls.
So how do we fix this problem? Its actually fairly easy. After running your alter table, run an update statement setting the column value to itself, like so:
UPDATE testTable SET testText = testText
SQL server moves the text from the LOB structure to the table (if less than 8,000 bytes). So when we run the select again with IO STATISTICS we get 0 LOB reads.
Table 'testTable'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
YEA! This is what we want.
Now, just for grins, what do you think happens if we change the NVARCHAR(MAX) back to NTEXT? Well it turns out that SQL Server moves the text back to the LOB structure. Completely backwards from what it did when converting NTEXT to NVARCHAR(MAX).
I was looking at Refletor addins the other day and ran across one that would be an amazing time saver.
Its an addin that generates the Reflection.Emit code!
Anyone who has ever spent any time with the Reflection.Emit namespace should immediately realize how wonderful this tool has the potential to be (as long as the generated code is of good quality of course).
Also, the way integrates with
Reflector is pretty slick. It adds a "Reflection.Emit" choice in the list of languages you want Reflection to display the code in. Then, in the left pane,when you click a module, class, method, property, whatever, it displays the Reflection.Emit code in the right pane that you would have to write to generate the thing you clicked on.
Simple...and amazing!
I recently spent 6 days writing Reflection.Emit code to generate two fairly complex methods. 2 days each for writing the code, and 1 day each for debugging it and making it actually work. I probably could have cut that down to 1 to 2 days using this tool.
I haven't yet compared the addin's generated Reflection Emit code to the code i've written manually to validate its quality, but just playing around with it, the generated code looks pretty good.
It can be found here:
http://www.codeplex.com/reflectoraddins/Wiki/View.aspx?title=ReflectionEmitLanguage&referringTitle=Home
I've spent a lot of time lately thinking about instrumentation and how to integrate it into software projects.
As a performance engineer I tend to think about instrumentation from the point of view of someone who wants to record the details of what a system is doing, and then dig through the data and use it to figure out what is wrong.
But I’ve been talking to people the past few months about instrumentation, I’ve come to realize that instrumentation means different things to different people. Some people think of instrumentation as a high level, light weight set of metrics that are easy to consume, understand, and extrapolate performance deltas; a management point of view. Other people, like me, think of it as recording low level details of what’s going on in the call stacks and sql engine; a trouble shooter point of view. And then others think its somewhere in between; everyone else.
Well, I think everyone is correct. There are different levels of instrumentation that are useful at different points in validating system health. There should be easy to consume and understand metrics to validate day to day health checks, there is medium level detail instrumentation that is used to figure out where a problem is, but takes a bit more effort to analyze. And if that isn’t enough to find and fix the problem, there is the dump everything to file model that gives you all the data you need to understand what is going on in the system, but requires internal knowledge of the system and time to analyze the data. Also, each level builds upon the other, so there is as little duplicated effort as possible.
So I’ve tried to create an instrumentation model demonstrate these different levels, the answers each level tries to answer, and when you move onto the next level
The first level will provide you with the most early bang for your buck, and it’s a easy way to tell if you have a problem, with as little dev effort as possible. Then as you get the high level metrics in, you can start building in the mid level metrics, and so on. The main thing is to not try and build the entire instrumentation framework up front before you put anything it. Start putting high level metrics in early and use then in your automated testing
