Geeks With Blogs
John Conwell: aka Turbo Research in the visual exploration of data
I'm still getting used to this whole schema-less, document oriented database thing. I've been trying to determine out what document structure to store my Twitter data in, and I keep flipping back and forth. I've been writing software in the land of relational databases for 15 years, and consider myself pretty good at designing database schemas for enterprise class solutions. But when it comes to this whole schema-less document store thing, I'm not sure what the best approach is.

With this Twitter data I'm analyzing, I really have only have one chunk of data: a Twitter search result json document. The data of which could be broken into two parts; fields about the post, and fields about who sent the post (there are also fields about who the post is sent too, but in this scenario I don't really care about those). So I have a few choices on how to store this data.

1) One MongoDb collection to store each individual Twitter post. Seems obvious, but then user data is duplicated.

2) Two MongoDb collections, one for the Twitter posts, and one for the users who sent them. Seems too RDBMS to me.

3) One MongoDb collection to store the users, where each user has a sub-document that stores all the posts for that user. No duplication, no RDBMS'isms.

My initial effort went down the all too familiar RDBMS route...setting up two collections; one for the posts, and one for the users. But as soon as I started playing with a few analytic aggregate queries I immediately ran into a wall. There is no join operation across collections in MongoDb. Queries in MongoDb work against one collection, and one collection only (as far as I know). Now this structure doesn't leave me totally hosed. The posts collection does have a user id field. I could do all aggregate queries using this user id, return the results, then as I iterate through the results, lookup the users in the users collection with this id.

But what about my other two options. The software engineer in me just doesn't like option 1. Its the whole duplication of data thing that just goes against the grain of all that I hold near and dear to my heart when it comes to managing data.

So that leaves option 3. I think I'm going to run all my data collection in parallel for a while using option 2 and 3, write some analytic queries against both, and see which one works better.

I do think its kind of funny that I spend so much time trying to figure out what schema to use to represent my data in a schema-less data store.

Posted on Tuesday, May 17, 2011 9:27 AM MongoDb | Back to top

Comments on this post: Schema Design in Schema-less Datastores

# re: Schema Design in Schema-less Datastores
Requesting Gravatar...
I think Option 1 is the most "document" oriented. You will run into Mongo's 4MB document size limit with option 3. Documents are intended to be denormalizd. It reduces the about of querying required during a read operation (since you typically read many many more times than writing a specific record you should optimize for them). There are set update operations for this case. Ex: if a user's name changed, you could, in a single operation, do where: {username: "bob" }, update: { username: "robert" } for all records in a collection. You can also create map-reduces that run in the background for every single view in your application and thus reduce the run-time calculations required to deliver a page. I don't really use that feature, but large sites/datasets do. I just love not being constrained to a schema and native support for persisting complex domain models.
Left by Ryan on May 17, 2011 8:58 PM

# re: Schema Design in Schema-less Datastores
Requesting Gravatar...
Regarding option 3, keep in mind that each document is limited to 4MB (16MB in mongodb 1.8+). That severely limits how many posts each user can have. I'd go with option 2, and denormalize only the data you need (you won't need the password or salt fields in the 'posts' collection for example).
Left by tom on May 18, 2011 9:21 PM

# re: Schema Design in Schema-less Datastores
Requesting Gravatar...
Good callouts on the size limitations in MongoDb. I wasnt aware of that. That pretty much kills option 3 from the design.

I think I'll go forward with option 1 for now and see how things go.
Left by turbo on May 23, 2011 9:30 AM

Your comment:
 (will show your gravatar)

Copyright © John Conwell | Powered by: