How to access the Twitter Streaming API from Python?

What is involved in building an application to consume, process and filter Twitter's firehose of real time tweets?

  • Does Twitter's streaming API gives you full access to the real-time data stream or is it rate limited?  What is the cost if any for this ? What is the volume of data that comes in? Are there third party services that offer an affordable way to consume and filter the stream?

  • Answer:

    To consume the Twitter Firehose, or any of the smaller variants (gardenhose, sprinkler, etc), you'll need a dedicated server with ample bandwidth.  For the smaller levels you might be able to get by with a Virtual Private Server (VPS). Software-wise, you'll need to write it in a language that can hold open connections for long periods without becoming a memory hog or start running slow (all of the majors will work - PHP, Java, Ruby, Perl, etc; we use PHP).  You'll also need a queuing software, like Beanstalkd (we use this one), Gearman, Starling, RabitMQ, etc). You shouldn't work directly from the Twitter connection, instead you'll queue everything that Twitter sends your way and then have workers connect to the that queue server to process the tweets.  If you ever get a large influx of tweets, this will save you. All of Twitter's streaming API's are rate limited to a specific percentage of the overall current volume.  Unless you run a large company with a big checkbook, you won't get the Firehose from them (which has all the tweets).  They do not currently charge for the other streaming API's that they offer. Per our agreement with Twitter, I cannot answer the volume questions. As far as third party services, GNIP is the only authorized third party company offering streaming API's.  They recently released a product called Power Track that might interest you (I'm not affiliated with them and do not use their products currently).

Joel Strellner at Quora Visit the source

Was this solution helpful to you?

Other answers

Consuming a firehose is no easy task.  Primarily due to its volume.  (~110 million tweets a day! that is damn lot of data) We get data from multiple sources and Apache Flume helped us rescue in dealing with large volume of data and scalability issues.  I was not able to use Flume out of the box as it was designed for log collection.  I had to do several hacks to make it work for me.  Also, Flume does have an active community which is ready to help. To the second part, GNIP is the one and only source to consume firehose.    Using Power Track (similar to firehose, decahose etc.,) you can filter twitter stream remotely and they charge per thousand tweets. If you are not willing to pay, then spritzer is the option.  This gives ~3-5% of the firehose.  You would have to do the filtering part.

Vivek Krishna

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.