swdev.online

Manage Dependencies for a Python Service

2020-09-04T21:27:32-07:00

Why Dependency Management
Compare Java Way and Python Way of Dependency Management
virtualenv
- Create venv folder with virtualenv
- Deploy dependencies
pipenv
- Create virtual enviroment with pipenv
- Deploy dependencies

Why Dependency Management

When I was beginner of python, I would install packages using pip install to my system's python environment and I can use this global python environment to develop my projects.

But how can I publish a project?
How can I deploy a python service to a cluster of hosts?
It's not feasible to login to each host to download / update dependencies.

That's why we need a way to manage dependencies:

When you deploy your service's source code to a new host, it can automatically download dependencies.
The versions of the dependencies should be consistent with your development environment.
You can develop several projects in your local dev desktop and they can have different version sets.

Compare Java Way and Python Way of Dependency Management

Based on my personal experience, there are two ways of dependency management - Classpath and Virtual Environment.
In Java world, JDK doesn't management dependencies for you. Your service uses Maven/Gradle/Ant to download jars in the building phase of deployment, and these jars are added to the classpath when JDK starts the service in command line.
In Python world, pip downloads dependencies to a python's global library path by default (e.g. /usr/local/lib/python3.7/site-packages) and so we need to use a virtual environment as a local library path (e.g. ./venv/lib/python3.7/site-packages).

NodeJS is using the Virtual Environment way - Manage dependencies with a package.json file and install dependencies to a local node_modules instead of globally.

virtualenv

Create `venv` folder with `virtualenv`

Assuming python3.7 is already installed to dev or prod host.

# Install virtualenv to global environment
$ pip3 install virtualenv

# create a folder named venv as a local python environment
$ virtualenv venv --python=python3.7

# activate the local environment
$ source ./venv/bin/activate

Now ./venv/bin/ is added to the front of PATH so the command python or pip is the ones under this path. When you use pip to install dependencies, they will go to ./venv/lib/python3.7/site-packages

Deploy dependencies

Include a text file of dependencies with versions in your project.
Conventionally it is named requirements.txt.

You can manually create and edit the file, or use pip freeze > requirements.txt to generate the file with current snapshot of your venv.

An example of requirements.txt:

aiohttp==3.0.5
requests==2.18.4
beautifulsoup4==4.6.0
urllib3==1.22

In the startup script^[Assuming you are deploying your service to a cluster of hosts, you want to include all steps to run the service in a startup script] of your service, you use pip install -r requirements.txt to download all dependency libraries, before running the service code.

Question: can we define only major versions and let pip find the most recent minor version?

pipenv

A newer, better and easier virtual environment management tool is pipenv.

Create virtual enviroment with `pipenv`

# install `pipenv` to global environment
$ pip3 install pipenv

# It can install dependencies in `requirements.txt` or `Pipfile` automatically if either one exists.
$ pipenv install

Note that there is no venv folder in your project path. Instead there will be a folder with a random name under ~/.local/share/viritualenvs/. In fact you never need to touch it or look at it.

Deploy dependencies

The actual dependency definition file is Pipfile with Pipfile.lock, automatically generated with pipenv install. We don't need requirements.txt anymore, Instead we include Pipfile and Pipfile.lock into our project source code.

While Pipfile.lock works like a lock of minor versions using sha256, Pipfile also specifies a full version number when generated. We want to modify Pipfile with major version ranges so that we can have recent minor version updates of dependencies in our local dev environment and lock them into Pipfile.lock.

An example of Pipfile snippet.

[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true

[dev-packages]

[packages]
flask = ">=1.1.0, <1.2.0"
flask-sqlalchemy = ">=2.4.0, <2.5.0"
flask-login = ">=0.5.0, <0.6.0"

[requires]
python_version = "3.8"

On prod host, your startup script will be like this:

$ pipenv install
$ pipenv run python service.py

SSH Tunneling Example

2020-08-11T15:53:51-07:00

Format
Example
- 1. Remote Debug a service on a cloud server
- 2. Remote Debug a service on a cloud server through a bastion host

Format

ssh -N -L ::

-N: Not to execute shell command, only listen and forward.
-L: Listen to on client machine.
Any data send to this is forward to :. Therefore the application on listening to is used as if it was a local application.
is usually the same as , i.e. is localhost relative to .
can be also be not on , i.e. is used as a bridge when data is forwarded from :and :

Example

1. Remote Debug a service on a cloud server

Forward client machines's localhost:5050 to 's localhost:5050

ssh -N -L 5050:localhost:5050 @

2. Remote Debug a service on a cloud server through a bastion host

On client machine, forward client machine's 5022 to :22. The goal is to use client machine's 5022 port as if it was :22. After this setting, data to local 5022 is forwarded to :22 , therefore local becomes a local endpointwe make local 5022 to be the same you can even ssh to :22 by ssh -p 5022 localhost.

ssh -A -N -L 5022::22 @

On client machine, forward client machine's 5050 to localhost:5050 through localhost.

ssh -p 5022 -N -L 5050:localhost:5050 localhost

Reading Dubbo Source Code - 1. Setup the Environment and Run Demo

2020-07-31T22:21:57-07:00

0. New Series
1. High level architecture and usage
2. Import to IntelliJ
3. Dubbo Modules (Artifacts)
4. Local Debug
5. Dubbo URL
6. Further questions that I don't have the answers yet

0. New Series

I'll start a new series on Reading Dubbo Source Code.
It might not be as popular as Spring Cloud when used in building microservices. But I think its source code is the best learning material for architecting high performance RPC framework.

1. High level architecture and usage

Dubbo is a Registry based RPC framework. Provider (service) registers endpoint and metadata to Registry. Consumer (client of service) subscribes and caches the service endpoint from Registry.

To use the RPC framework, Service side provides a java interface to the client. Client uses that interface to call the methods.
Dubbo Spring integration provides a bean whose type is the interface and the implementation is a proxy (getting endpoint from Registry center and calling provider service).

2. Import to IntelliJ

git clone https://github.com/apache/dubbo.git

IntellJ -> Open -> find your dubbo folder
IntellJ will automatically recognize it as a maven project and downloads dependencies.

3. Dubbo Modules (Artifacts)

4. Local Debug

4.1 Setup Zookeeper as Registry Center

Download: https://zookeeper.apache.org/releases.html
Any version will do - we can use release line 3.6 because ZooKeeper clients from 3.4 and 3.5 branch are fully compatible with 3.6 servers.

The start script zkServer.sh uses zoo.cfg as config file.
The default path is ./conf. As a local test environment, I will just copy the provided ./conf/zoo_sample.cfg to ./conf/zoo.cfg to make the start script work. This is a basic config that setup port to 2181 for this single host server and store data into /tmp folder.

cp ./conf/zoo_sample.cfg ./conf/zoo.cfg
./bin/zkServer.sh start

4.2 Provider and Consumer

dubbo-demo module provides three flavors to use Dubbo. The service provider is DemoService and the consumer is just a main method. Let's take the XML configured Spring as example.

Service provider creates the actual serviceImpl bean, and pass it to interface definition:

Service provider needs to register itself to Registry Center.

Then the client side can use the interface class to create a proxy bean, and call the methods of the bean like local methods.

Client side code:

DemoService demoService = context.getBean("demoService", DemoService.class);
CompletableFuture hello = demoService.sayHelloAsync("world");

4.3 Run

We use maven to build Dubbo. It can be done in CLI or in IntelliJ maven plugin.

mvn clean install -Dmaven.test.skip=true

Then we can start server and client just by running their main methods (Click the run on main() in IntelliJ, or use the CLI given in dubbo-demo/README.md which is the basic way to run a jar java -jar jarName.jar). Yes, the server is not using any Servlet container. Dubbo uses Netty to serve dubbo protocol!

Logs from server shows it automatically register its current IP address (my laptop ip in LAN) and port (default 20880) to zookeeper (127.0.0.1:2181), with supported methods and interface name.

[01/08/20 01:29:26:649 PDT] main  INFO config.ServiceConfig:  
[DUBBO] Register dubbo service org.apache.dubbo.demo.DemoService url 
dubbo://192.168.43.137:20880/org.apache.dubbo.demo.DemoService? 
anyhost=true&application=demo-provider&bind.ip=192.168.43.137&bind.port=20880&deprecated=false
&dubbo=2.0.2&dynamic=true&generic=false&interface=org.apache.dubbo.demo.DemoService
&metadata-type=remote &methods=sayHello,sayHelloAsync&pid=88546&qos.port=22222&release=
&side=provider×tamp=1596270566561 to registry registry://127.0.0.1:2181/org.apache.dubbo.registry.RegistryService?
application=demo-provider &dubbo=2.0.2&metadata-type=remote&pid=88546&qos.port=22222
®istry=zookeeper×tamp=1596270566555, dubbo version: , 
current host: 192.168.43.137

[01/08/20 01:29:26:927 PDT] main  INFO transport.AbstractServer:  
[DUBBO] Start NettyServer bind /0.0.0.0:20880, 
export /192.168.43.137:20880, dubbo version: , current host: 192.168.43.137

Client side log shows it successfully gets the endpoint of service from Zookeeper.

[01/08/20 01:41:52:344 PDT] main  INFO zookeeper.ZookeeperRegistry:  
[DUBBO] Register: consumer://192.168.43.137/org.apache.dubbo.demo.DemoService?
application=demo-consumer&category=consumers&check=false&dubbo=2.0.2&init=false
&interface=org.apache.dubbo.demo.DemoService&metadata-type=remote&methods=sayHello,sayHelloAsync
&pid=94685&qos.port=33333&side=consumer&sticky=false×tamp=1596271312243, 
dubbo version: , current host: 192.168.43.137

[01/08/20 01:41:52:372 PDT] main  INFO zookeeper.ZookeeperRegistry:  
[DUBBO] Subscribe: consumer://192.168.43.137/org.apache.dubbo.demo.DemoService?
application=demo-consumer&category=providers,configurators,routers&check=false&dubbo=2.0.2
&init=false&interface=org.apache.dubbo.demo.DemoService&metadata-type=remote&methods=sayHello,sayHelloAsync
&pid=94685&qos.port=33333&side=consumer&sticky=false×tamp=1596271312243, 
dubbo version: , current host: 192.168.43.137

[01/08/20 01:41:52:712 PDT] NettyClientWorker-1-1  INFO netty4.NettyClientHandler:  
[DUBBO] The connection of /192.168.43.137:62158 -> /192.168.43.137:20880 is established., 
dubbo version: , current host: 192.168.43.137

4.4 Zookeeper Review

$ bin/zkCli.sh -server 127.0.0.1:2181

[zk: 127.0.0.1:2181(CONNECTED) 1] ls /
[dubbo, zookeeper]

[zk: 127.0.0.1:2181(CONNECTED) 2] ls -R /dubbo

/dubbo/config/mapping/org.apache.dubbo.demo.DemoService/demo-provider
/dubbo/metadata/org.apache.dubbo.demo.DemoService/consumer/demo-consumer
/dubbo/metadata/org.apache.dubbo.demo.DemoService/provider/demo-provider
/dubbo/org.apache.dubbo.demo.DemoService/configurators
/dubbo/org.apache.dubbo.demo.DemoService/consumers
/dubbo/org.apache.dubbo.demo.DemoService/routers
/dubbo/org.apache.dubbo.demo.DemoService/providers/dubbo%3A%2F%2F192.168.43.137%3A20880%2Forg.apache.dubbo.demo.DemoService
%3Fanyhost%3Dtrue%26application%3Ddemo-provider%26deprecated%3Dfalse%26dubbo%3D2.0.2%26dynamic
%3Dtrue%26generic%3Dfalse%26interface%3Dorg.apache.dubbo.demo.DemoService
%26metadata-type%3Dremote%26methods%3DsayHello%2CsayHelloAsync%26pid%3D88546
%26release%3D%26side%3Dprovider%26timestamp%3D1596270566561

(I only list leaf nodes in above ls -R /dubbo results.)

The last item is the Dubbo URL which also got logged when service starts - "Register dubbo service...".

5. Dubbo URL

URL is used as data structure of protocol. The metadata are stored as key-value pairs in URL parameters.

dubbo://192.168.43.137:20880/org.apache.dubbo.demo.DemoService? 
anyhost=true&application=demo-provider&bind.ip=192.168.43.137&bind.port=20880&deprecated=false
&dubbo=2.0.2&dynamic=true&generic=false&interface=org.apache.dubbo.demo.DemoService
&metadata-type=remote &methods=sayHello,sayHelloAsync&pid=88546&qos.port=22222&release=
&side=provider×tamp=1596270566561

6. Further questions that I don't have the answers yet

How does the client configure retry and timeout?
Does the client use any connection pool? How to configure the pool size and idle connection timeout?
How does the client authenticate itself when calling server? or How does the server protect itself being called by unauthorized clients?
What is the minimal dependency for the client and server to use dubbo? Is it dubbo-dependencies-bom)?
What is the difference between dubbo-dependencies-bom and dubbo-bom?

An upward trend ID generation service

2020-07-02T01:15:27-07:00

1. First question - why do we need a sequential ID instead of a random UUID?
- 1.1 2nd Question is why do I write this article.
2. Requirement:
3. Solution 1: DB atomic update
4. Solution2: Redis INCRBY
- 4.1 The logic is the same.
- 4.2 Assessment
5. Solution3: Time-based on-host generation.
6. Conclusion

1. First question - why do we need a sequential ID instead of a random UUID?

UUID takes space! We have to store it as HEX string, with 32 chars & 4 '-'.
- UUID has 128bit in binary - will be 39 chars as Decimal string or 32 chars as HEX string.
- Consider the data type to store it:
  - MySQL BIGINT: 64bit, not enough for 128bit.
  - MySQL 'VARCHAR(36)`: 36 chars to store the HEX string with 4 '-'s.
  - DynamoDB Number: Support Java BigInteger, can be any size. However, it is actually a Decimal string in storage, which takes 39 chars.
  - DynamoDB String: 36 chars to store the HEX string with 4 '-'s.
Some business requirement may require the ID to be in an upward trend.
- StatementId, OrderId, TweetId.
  - Twitter Example
    - Although there may be a creationTime column, twitter said they prefer to sort tweets by ID.
    - The timeline of a user is paginated. If the tweetId has order and is stored as sort key (e.g. MySQL indexed column, DDB Sort Key, HBase Column Key, etc.), the query to get next page will be very efficient.
UUID might have duplicates when there are already a lot of existing IDs. But this is not a practical concern because the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion

1.1 2nd Question is why do I write this article.

In my work projects, we are lazy and just used random UUID for the most of the time. But I also saw many systems like e-commerce and social media systems use a number as ID.
I'm trying to figure out the best way to generate universal unique Ids used in distributed systems.

When I almost completed my doc, I came across the original blog by twitter on why they developed SnowFlake.
I feel that the reasons they listed are very similar to my thoughts. And then I realized I might not need to write this article...

2. Requirement:

ID generation service
ID is in an upward trend
ID is globally unique
TPS is large enough for application requirement
- Normal system write TPS is small
  - FB/Twitter should be < 100TPS
- But in some events write TPS can be large
  - Double11 peak - 100kTPS order placement.
    - Assumption: 1billion RMB / 21s, 500RMB/order
    - order might be written to order table asynchronously
    - I believe they prepared orderId in advance.

3. Solution 1: DB atomic update

3.1 IDGenerator service logic

Service is a cluster.
Each host uses one background thread to fetch an id range periodically.
May prefetch many ranges and store them in a queue.
Clients request one ID at a time.
Consume the head of the queue until it's empty.

3.2 DB schema

id_name ：the name of this ID 
max_id ：current max used id in this sequence
step ：the number of ids to allocate in one fetch

3.3 How to query (MySQL example)

3.3.1 Atomic Read and Update / Transaction

We query current max_id, and we got an ID range [max_id+1, max_id+step].
To avoid other concurrent MySQL threads also getting the same range, we need to add a Write lock (FOR UPDATE, Pessimistic lock). Only after COMMIT can another thread (also with FOR UPDATE) read max_id again.

SET SESSION TRANSACTION ISOLATION LEVEL read committed;
SET autocommit = 0;
BEGIN 
    SELECT max_id 
    FROM id_generator 
    WHERE id_name='OrderId' FOR UPDATE;
    
    UPDATE id_generator 
    SET max_id = max_id+step
    WHERE id_name ='OrderId';
COMMIT;

3.3.2 Optimistic Lock / Lockless

Adding a version column is the common way of applying optimistic lock. Optimistic lock only has better performance than actual lock in cases that concurrency is low.

In this case, max_id is definitely something updated concurrently by all hosts in the service cluster. So I don't think it's a good idea to use optimistic lock.

But for reference purpose, let me still put the logic here.

Read the max_id and version. Use [max_id + 1, max_id + step] as the fetched range.
Update max_id only when version is not changed.
If version changed, need to retry step 1 and 2.

SELECT max_id, version, step
FROM id_generator 
WHERE id_name='OrderId';

UPDATE id_generator
SET max_id = max_id+step, version = version + 1
WHERE version = {version} and id_name ='OrderId'

3.4 Assessment

Because of Write lock in the transaction, the transactions are sequentially executed.
The through put will be limited and latency P99 will be high and even timeout.

Scalability:
- Service cluster is scalable.
- DB write is not scalable because one ID name uses only one row. (Usually this causes HotKey problem.)
- However we may not need to scale because the background thread can fetch a large range each time, at the cost of sequence being not exactly increasing.
- Example:
  - Clients request 100k ID per second.
  - Cluster has 200 hosts, with 500TPS/host.
  - We want the IDs are k-sorted and k~1.
  - Every host fetches a range from DB every 1s
    - The range step will be 500 (100k / 200 * 1)
    - DB write is 200/1 = 200 TPS.
Availability:
- Must have synchronous replication of the master node.
  - MySQL master-slave replication
  - DDB partition has three nodes, one is leader, the other two are followers. We need strong consistency for write.
  - Leaderless architecture, such as Cassandra.
General Drawback:
The disadvantage of this strategy is obvious compared with SnowFlake algorithm.
It depends on an extra system.
It seems high cost / limited scalability / low availability.

4. Solution2: Redis INCRBY

4.1 The logic is the same.

This is the same idea as above.
Service fetches a range from Redis by increasing the max_id.

INCRY is atomic and it returns the value after increase.

redis> SET OrderId "1000000"
"OK"
redis> INCRBY OrderId 500
(integer) 1000500

4.2 Assessment

We can still say it has the same problem as above
- extra dependency
- limited scalability
- low availability
Redis does offer higher throughput than MySQL.
But Redis replication seems only support asynchronous way.
(MySQL supports full sync, semisync and asycn replication, during which we need full sync to make replications consistent and slave can be switch to master without losing data.)

5. Solution3: Time-based on-host generation.

I think SnowFlake and TimeBasedUUID are the same idea.
SnowFlake wins because it has 64bits while UUID has 128bits.

TimeBasedUUID
Java UUID

5.1 128 bit UUID

| Time (100ns, i.e. 1e-7s precision) | 60 bit |
| Version (Timebased, DCE Security, NameBased, Random) | 4 bit |
| Mac | 48 bit |
| Sequence Number 14 bit | 14 bit |
| Variant (2 (Leach-Salz)) | 2 bit |

The generated UUID will be unique for approximately 8,925 years so long as less than 10,000 IDs are generated per millisecond on the same device (as identified by its MAC address).

5.2 64 bit SnowFlake Id

| 0 | 1 bit |
| timestamp in milliseconds | 41 bit |
| Machine id | 10 bit |
| sequence no | 12 bit |

Machine id can be based on private ip - always unique in a LAN.
(In contrast, TimeBasedUUID is using Mac as machine ID, which has some privacy leakage risk.)

The bits can be redefined. Under the default config, there can be at most 2^10=1024 machines. There can be at most 2^12=4096 IDs generated for 1 millisecond.

I saw Baidu has a variation that changed timestamp to seconds, so there can be more machine ids.

5.3 Problem

We can get epoch timestamp using System.currentTimeMillis() in java. But the absolute millisecond value is not very reliable due to delays in NTP sync.
After an NTP sync, the clock may go back to a prior second due to clock drift
Found a good article on it - How System Clocks Can Cause Mysterious Faults?.

6. Conclusion

I still think in most cases, whey data size is not big (data generation is slow or retention time is limited), using RandomUUID is the best way, because of simplicity.
Otherwise use a customized SnowFlake library seems much better than managing a database of Redis cluster.

Reference:

Idempotency of a PUT Request

2020-07-02T01:15:34-07:00

1. How To Avoid Duplicate Order
2. In general how to guarantee Idempotence of an API put request.

1. How To Avoid Duplicate Order

E-commerce Scenario - Common Flow of Placing Order

Go to Cart
Cart Page + Checkout button
Checkout Page + Place order button
Payment:
US E-commerce has no such step - Credit card or PayPal are filled in Checkout Page.
Success Page

Cause:

User clicks PlaceOrder multiple times
Backend retries the call to Order service

Solution 1: One-time token

Checkout page requests one-time token
Store token in Redis as Key.
- Set expired time in case there is no order placed.
- PEXPIRE token 5000
Place order request comes with the token
Order system verifies token by
- DEL token
- Return 0 means the key is already removed.

Assessment:

Scalability: High.
- Redis Sharding.
Availability: High.
- Service is down after DEL token: retries will fail. Place order will fail.
- Redis node is down. Token not available, place order will fail.

Solution 2: Leverage DB Primary Key uniqueness

Checkout page requests orderId
Place order request comes with the orderId
DB creates order using orderId as primary key.
Insert failure means PK already exists, this is a duplicate request.

Assessment:

Scalability: High
Availability: Depends on proxy. If using consistent hashing which can remove the failure node, then it's high.
Security: Attacker can use invalid key, we have to verify it.

2. In general how to guarantee Idempotence of an API put request.

We can have a primary key in an insert directive. The primary key is unique id of the data, so retried writes won't succeed.

Take SQL for example -

INSERT INTO users (uid, age, gender, createTime)
VALUES (1234567, 20, "male", 1593684708)

Can we assume all INSERTS are idempotent when there is a Primary Key in request?

Yes when the Primary Key can come from a ID generation service, or even client side auto generation.

DDB and MongoDB generate UUID or ObjectId at client side:

DynamoDB @DynamoDBAutoGeneratedKey UUID is 128bit in Memory, but in DB it's a string of 32 chars and 4 "-".
MongoDB ObjectId ObjectId, 12bytes number in BSON.

This means we should try to avoid using the AUTO_INCREMENT feature on the DB server side

MySQL AUTO_INCREMENT

Conclusion:
We can guarantee idempotency to DB when the write has a primary key generated on client side - from an ID Generation Service or client side random UUID.

However the actual idempotency is determined by upstream services. Will they retry the calls to the DAO service?

The ideal way is the top level service provide an one time token. The DAO layer checks the token using Solution1 of Section1.

Design Twitter - 1: Requirement and Storage Selection

2019-11-09T14:47:01-08:00

1. Requirement
- 1.1 Use Case
- 1.2 Volume
2. API
3. Storage Choice
- 3.1 RDBMS
- 3.2 NoSQL

1. Requirement

1.1 Use Case

As a user

I can create my profile
I can post a tweet
I can list my tweets posted before
I can follow / unfollow other users
I can list my followers and my followees
I can see others tweets in my homepage timeline
I can see my own tweets in my profile timeline
I can read a tweet (unlike facebook, you don't  need to click into the message.)
I can delete my tweets, so that no one can see it any more.

NOTE: If this is an interview, you'd better list some common use cases, but pick one or two key use cases to go in depth. Time slips away quickly when you explain a design. You want to show the breadth and depth of your knowledge.

1.2 Volume

Ask for TPS from your interviewer, unless he asks you to estimate it.

A useful way is to start from MAU or DAU based on US population and the relative size of each use case.

It took me 20 minutes to think about the TPS of each use case below. This is absolutely not feasible in an interview. So in reality, we should go with just the read and write of one use case or two.

Your interviewer would like to see how you estimates, not the accurate number. So I intend not to go too high for the DAU. On one hand, You may trap yourself into an over-challenging problem. On the other hand, we can start with US market first, and if we still have time, we can think about scaling it to EU and FE by replication.

MAU: 2% US people - 60 Million
DAU: 1/3 MAU - 20 Million ¹
TPS

API	Peak TPS	Reason
Show timeline (read tweets)	10k	Assume every DAU access it twice with one refresh. 20e6 * 4 / (24 * 60 * 60) ~ 1000. Consider peak hours and peak events, we give it 10 times buffer.
Post a tweet	100	1% of read.
Comment a tweet	1k	10 times of post.
Delete a tweet	10	Rare
List my tweets	10	Rare
Follow a user	100	1% of read
Unfollow a user	10	Rare
List my followers	100	Same as follow
List my followees	100	Same as list followers

Storage
600 Million entries in user table (Assume 10% of all users are MAU)
315 Million tweets per year (10% of Peak post TPS is 10. 10 * 365 * 24 * 3600 = 315 M)

2. API

Each use case can be a RESTFUL API.

3. Storage Choice

3.1 RDBMS

Both user table and tweets table are too big for RDBMS without sharding. Usually one MySQL table is good for < 1M rows.

We can do sharding like this:

partition user table based on user id.
partition tweets table based on tweet id.

Sharding RDBMS is painful and error prone:

Need a proxy layer to route requests.
Cannot access data based on other columns instead of the partition key.
Rescaling is difficult, may need to turn off the whole system.
You cannot do join tables or select columns using flexible where clause.

Therefore RDBMS is not considered as scalable.

3.2 NoSQL

No SQL is good for this use case - No complex join, No transaction, Eventual consistency is enough.

Hbase/DynamoDB/MongoDb and even Redis will all work.

I'll talk about my schema design using HBase, DynamoDB and Redis respectively in the next few articles.

World wide DAU in Q4 2018: Facebook 1,520M; Snap 186M; Twitter 126M. Reference: https://www.vox.com/2019/2/7/18215204/twitter-daily-active-users-dau-snapchat-q4-earnings ↩

swdev.online

Manage Dependencies for a Python Service

Why Dependency Management

Compare Java Way and Python Way of Dependency Management

virtualenv

Create venv folder with virtualenv

Deploy dependencies

pipenv

Create virtual enviroment with pipenv

Deploy dependencies

SSH Tunneling Example

Format

Example

1. Remote Debug a service on a cloud server

2. Remote Debug a service on a cloud server through a bastion host

Reading Dubbo Source Code - 1. Setup the Environment and Run Demo

0. New Series

1. High level architecture and usage

2. Import to IntelliJ

3. Dubbo Modules (Artifacts)

4. Local Debug

4.1 Setup Zookeeper as Registry Center

4.2 Provider and Consumer

4.3 Run

4.4 Zookeeper Review

5. Dubbo URL

6. Further questions that I don't have the answers yet

An upward trend ID generation service

1. First question - why do we need a sequential ID instead of a random UUID?

1.1 2nd Question is why do I write this article.

2. Requirement:

3. Solution 1: DB atomic update

3.1 IDGenerator service logic

3.2 DB schema

3.3 How to query (MySQL example)

3.3.1 Atomic Read and Update / Transaction

3.3.2 Optimistic Lock / Lockless

3.4 Assessment

4. Solution2: Redis INCRBY

4.1 The logic is the same.

4.2 Assessment

5. Solution3: Time-based on-host generation.

5.1 128 bit UUID

5.2 64 bit SnowFlake Id

5.3 Problem

6. Conclusion

Idempotency of a PUT Request

1. How To Avoid Duplicate Order

E-commerce Scenario - Common Flow of Placing Order

Cause:

Solution 1: One-time token

Solution 2: Leverage DB Primary Key uniqueness

2. In general how to guarantee Idempotence of an API put request.

Design Twitter - 1: Requirement and Storage Selection

1. Requirement

1.1 Use Case

1.2 Volume

2. API

3. Storage Choice

3.1 RDBMS

3.2 NoSQL

Create `venv` folder with `virtualenv`

Create virtual enviroment with `pipenv`