Twitter data storage and processing
We all know that Twitter is a microblogging and social networking service on which users create messages and post them known as tweets. As Twitter is a widely used platform for posting the messages, it generates lots of data. On Tweeter, about 6000 tweets created and posted per second. The total number of tweets sent are around 500 million per day or 200 billion tweets per year.
Although one tweet consists of a 140-characters message, it is seen that Tweeter generates more than 12 terabytes of data per day. That equals to the 84 terabytes per week and 4.3 petabytes per year. Such a huge amount of data is generated by the tweets that we make every day.
This data is processed, stored, cached, and analyzed every time the request is made. For such a massive content, twitter required huge storage and computing resources.
As the huge data is generated this is the problem of BigData. To overcome this problem, Twitter is making efficient use of BigData concepts. The first technology that Twitter uses to manage the data at a large scale is the Hadoop. By means of Hadoop, Twitter is using the concept of Distributed Storage. Hadoop clusters are running both compute and HDFS(Hadoop Distributed File System).
Twitter has multiple clusters storing over 500 PB of data. The biggest cluster of Twitter is over 10K nodes. Twitter runs 150K applications and launches 130M containers per day.
Originally Twitter is using MySQL to store the data. But as the data is kept on increasing the requirement of huge datastores is come up. And to speed up the data processing they require to use the concept of Distributed Storage. Twitter had introduced Gizzard — a framework for creating distributed datastores.
When we tweet it’s stored in an internal system called T-bird, which is built on top of Gizzard. Secondary indexes are stored in a Gizzard based system known as T-flock.
Unique IDs for each tweet are generated by Snowflake, which can be more evenly sharded across a cluster. FlockDB is used for ID to ID mapping, storing the relationships between IDs (uses Gizzard).
A Gizzard is Twitter’s distributed data storage framework built on top of MySQL (InnoDB). InnoDB was chosen because it doesn’t corrupt data and Gizzard is just a datastore.
Twitter originally intended to store MySQL backups using Hadoop, but now they are heavily using Hadoop for analytics.
Along with the tweet's message, users are sending media files like images, videos, etc. To store this media/object files, Twitter is using storage called Blobstore.