Geeks With Blogs
John Conwell: aka Turbo Research in the visual exploration of data May 2011 Entries
MongoDb MapReduce Bug: Part 2
In my previous post I mentioned a possible bug with MongoDb's MapReduce. Well, I played with it a bit further. I went ahead and added a string field to each document to hold the numeric Twitter post id as a string value. Then changed my map function to use this string value instead of the numeric id. This time the duplicate post query worked perfectly. It's results showed that there were no duplicate posts at all. So, what is the problem? Is it a bug in MongoDb's MapReduce implementation? I doubt ......

Posted On Wednesday, May 25, 2011 8:33 AM

MongoDb MapReduce Bug?
I wanted to run a simple sanity check to make sure I didn't have any duplicate twitter posts in my database. I didn't think I had any, but you never can be too sure, so I whipped up a simple MapReduce query to check. Right now I'm storing twitter posts in MongoDb using the document schema shown below: { category post { post_id created_date from_user from_user_id geo iso_lang post_text } } Each document has two fields, a category field that holds what search was used to find the post, and a post field ......

Posted On Monday, May 23, 2011 9:36 AM

Schema Design in Schema-less Datastores
I'm still getting used to this whole schema-less, document oriented database thing. I've been trying to determine out what document structure to store my Twitter data in, and I keep flipping back and forth. I've been writing software in the land of relational databases for 15 years, and consider myself pretty good at designing database schemas for enterprise class solutions. But when it comes to this whole schema-less document store thing, I'm not sure what the best approach is. With this Twitter ......

Posted On Tuesday, May 17, 2011 9:27 AM

Twitter Search Results and GMT
When using the Twitter Search API, the returned posts contain a "created_at" date time stamp with the time set to GMT time. This becomes an issue when using pymongo to store the datetime object in MongoDB. When a datetime object gets stored in MongoDb, mongo (or the pymongo library) updates the datetime object to reflect the current timezone. So in my case, since I live in Seattle, the datetime values get offset by 8 hours. For example, a Twitter post with a timestamp of "12/21/2010 06:07:56" gets ......

Posted On Tuesday, May 17, 2011 9:24 AM

Twitter and Rate Limiting
In my quest to dig into how people use social media to communicate about wine I ran into a snag. One of the things I really wanted to do was collect geographic data about each post. When you do a twitter search, a post can contain latitude and longitude data if the source supports GPS, but so far almost none of the posts about wine actually have this data. So the next best thing is to figure out where the user lives. A twitter user profile has a location field. Its free text unfortunately, but its ......

Posted On Wednesday, May 4, 2011 5:19 PM

Twitter ints, longs, strings...oh my
Ok lesson learned. I now know why twitter data gives all ids as numeric and string data format. When I first saw this, I thought "what a waste of data, repeating all numeric fields as both strings and numeric. Thats stupid". Then I started noticing something odd. In my python script, my twitter ids weren't matching up to what I had in MongoDB. Turns out Python (simplejson to be exact) takes all numerics from the Twitter json document and turns it into an int. But if the number is too big, it overflows ......

Posted On Wednesday, May 4, 2011 1:14 PM

First experience with Twitter Search API and MongoDB
I've been playing with a new side project; wine data analytics mined from Twitter, and stored in MongoDb. I took a first stab at writing a Python celery service to search for wine related twitter posts and dump them in MongoDb. First lesson learned is that a ton of people post about the word 'Wine' on Sunday afternoons. I wrote the service to pull 100 posts, then search again using the lowest post id as the max id to return. After storing 11K posts in MongoDb I looked at the earliest date and realized ......

Posted On Monday, May 2, 2011 4:34 PM

Copyright © John Conwell | Powered by: