I'm still getting used to this whole schema-less, document oriented database thing. I've been trying to determine out what document structure to store my Twitter data in, and I keep flipping back and forth. I've been writing software in the land of relational databases for 15 years, and consider myself pretty good at designing database schemas for enterprise class solutions. But when it comes to this whole schema-less document store thing, I'm not sure what the best approach is.
With this Twitter data I'm analyzing, I really have only have one chunk of data: a Twitter search result json document. The data of which could be broken into two parts; fields about the post, and fields about who sent the post (there are also fields about who the post is sent too, but in this scenario I don't really care about those). So I have a few choices on how to store this data.
1) One MongoDb collection to store each individual Twitter post. Seems obvious, but then user data is duplicated.
2) Two MongoDb collections, one for the Twitter posts, and one for the users who sent them. Seems too RDBMS to me.
3) One MongoDb collection to store the users, where each user has a sub-document that stores all the posts for that user. No duplication, no RDBMS'isms.
My initial effort went down the all too familiar RDBMS route...setting up two collections; one for the posts, and one for the users. But as soon as I started playing with a few analytic aggregate queries I immediately ran into a wall. There is no join operation across collections in MongoDb. Queries in MongoDb work against one collection, and one collection only (as far as I know). Now this structure doesn't leave me totally hosed. The posts collection does have a user id field. I could do all aggregate queries using this user id, return the results, then as I iterate through the results, lookup the users in the users collection with this id.
But what about my other two options. The software engineer in me just doesn't like option 1. Its the whole duplication of data thing that just goes against the grain of all that I hold near and dear to my heart when it comes to managing data.
So that leaves option 3. I think I'm going to run all my data collection in parallel for a while using option 2 and 3, write some analytic queries against both, and see which one works better.
I do think its kind of funny that I spend so much time trying to figure out what schema to use to represent my data in a schema-less data store.