1. Requirement
1.1 Use Case
As a user
I can create my profile
I can post a tweet
I can list my tweets posted before
I can follow / unfollow other users
I can list my followers and my followees
I can see others tweets in my homepage timeline
I can see my own tweets in my profile timeline
I can read a tweet (unlike facebook, you don't need to click into the message.)
I can delete my tweets, so that no one can see it any more.
NOTE: If this is an interview, you'd better list some common use cases, but pick one or two key use cases to go in depth. Time slips away quickly when you explain a design. You want to show the breadth and depth of your knowledge.
1.2 Volume
Ask for TPS from your interviewer, unless he asks you to estimate it.
A useful way is to start from MAU or DAU based on US population and the relative size of each use case.
It took me 20 minutes to think about the TPS of each use case below. This is absolutely not feasible in an interview. So in reality, we should go with just the read and write of one use case or two.
Your interviewer would like to see how you estimates, not the accurate number. So I intend not to go too high for the DAU. On one hand, You may trap yourself into an over-challenging problem. On the other hand, we can start with US market first, and if we still have time, we can think about scaling it to EU and FE by replication.
- MAU: 2% US people - 60 Million
DAU: 1/3 MAU - 20 Million 1
TPS
| API | Peak TPS | Reason |
|---|---|---|
| Show timeline (read tweets) | 10k | Assume every DAU access it twice with one refresh. 20e6 * 4 / (24 * 60 * 60) ~ 1000. Consider peak hours and peak events, we give it 10 times buffer. |
| Post a tweet | 100 | 1% of read. |
| Comment a tweet | 1k | 10 times of post. |
| Delete a tweet | 10 | Rare |
| List my tweets | 10 | Rare |
| Follow a user | 100 | 1% of read |
| Unfollow a user | 10 | Rare |
| List my followers | 100 | Same as follow |
| List my followees | 100 | Same as list followers |
- Storage
600 Million entries in user table (Assume 10% of all users are MAU)
315 Million tweets per year (10% of Peak post TPS is 10. 10 * 365 * 24 * 3600 = 315 M)
2. API
Each use case can be a RESTFUL API.
3. Storage Choice
3.1 RDBMS
Both user table and tweets table are too big for RDBMS without sharding. Usually one MySQL table is good for < 1M rows.
We can do sharding like this:
- partition user table based on user id.
- partition tweets table based on tweet id.
Sharding RDBMS is painful and error prone:
- Need a proxy layer to route requests.
- Cannot access data based on other columns instead of the partition key.
- Rescaling is difficult, may need to turn off the whole system.
- You cannot do join tables or select columns using flexible where clause.
Therefore RDBMS is not considered as scalable.
3.2 NoSQL
No SQL is good for this use case - No complex join, No transaction, Eventual consistency is enough.
Hbase/DynamoDB/MongoDb and even Redis will all work.
I'll talk about my schema design using HBase, DynamoDB and Redis respectively in the next few articles.
-
World wide DAU in Q4 2018: Facebook 1,520M; Snap 186M; Twitter 126M. Reference: https://www.vox.com/2019/2/7/18215204/twitter-daily-active-users-dau-snapchat-q4-earnings ↩
