I needed a twitter data set for my thesis and struggled to find one that was freely available. I ended up downloading the data I needed so I thought I would release it here. It does come in a slightly annoying format which is a MongoDB dump file but it should be easy enough to extract to there and use it. The structure of each record is shown below.
So if you wanted to go over all records in the database you would do it like this –
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from pymongo import Connection if __name__ == "__main__": # connect to mongodb and get tweets collection db_connection = Connection() fyp_db = db_connection.fyp tweets_db = fyp_db.tweets num_tweets = 0 for user_col in tweets_db.find(): num_tweets += len(user_col['tweets']) for friend in user_col['friends']: if len(user_col['friends'][friend]) > 0: num_tweets += len(user_col['friends'][friend]) |
Download Twitter Mongo Dump
May 3, 2012 at 9:25 am
This will be getting updated with more data when I get the time to download some.
For anyone interested there is actually a Twitter firehose available here – https://stream.twitter.com/1/statuses/sample.json