<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title><![CDATA[swdev.online]]></title>
  <link href="https://swdev.online/atom.xml" rel="self"/>
  <link href="https://swdev.online/"/>
  <updated>2020-09-06T02:07:14-07:00</updated>
  <id>https://swdev.online/</id>
  <author>
    <name><![CDATA[]]></name>
    
  </author>
  <generator uri="http://www.mweb.im/">MWeb</generator>
  
  <entry>
    <title type="html"><![CDATA[Manage Dependencies for a Python Service]]></title>
    <link href="https://swdev.online/manage-dependencies-python.html"/>
    <updated>2020-09-04T21:27:32-07:00</updated>
    <id>https://swdev.online/manage-dependencies-python.html</id>
    <content type="html"><![CDATA[
<ul>
<li>
<a href="#toc_0">Why Dependency Management</a>
</li>
<li>
<a href="#toc_1">Compare Java Way and Python Way of Dependency Management</a>
</li>
<li>
<a href="#toc_2">virtualenv</a>
<ul>
<li>
<a href="#toc_3">Create <code>venv</code> folder with <code>virtualenv</code></a>
</li>
<li>
<a href="#toc_4">Deploy dependencies</a>
</li>
</ul>
</li>
<li>
<a href="#toc_5">pipenv</a>
<ul>
<li>
<a href="#toc_6">Create virtual enviroment with <code>pipenv</code></a>
</li>
<li>
<a href="#toc_7">Deploy dependencies</a>
</li>
</ul>
</li>
</ul>


<span id="more"></span><!-- more -->

<h2 id="toc_0">Why Dependency Management</h2>

<p>When I was beginner of python, I would install packages using <code>pip install</code> to my system&#39;s python environment and I can use this global python environment to develop my projects. </p>

<p>But how can I publish a project? <br/>
How can I deploy a python service to a cluster of hosts? <br/>
It&#39;s not feasible to login to each host to download / update dependencies.</p>

<p>That&#39;s why we need a way to manage dependencies:</p>

<ul>
<li>When you deploy your service&#39;s source code to a new host, it can automatically download dependencies.</li>
<li>The versions of the dependencies should be consistent with your development environment.</li>
<li>You can develop several projects in your local dev desktop and they can have different version sets.</li>
</ul>

<h2 id="toc_1">Compare Java Way and Python Way of Dependency Management</h2>

<p>Based on my personal experience, there are two ways of dependency management - Classpath and Virtual Environment. <br/>
In Java world, JDK doesn&#39;t management dependencies for you. Your service uses <code>Maven/Gradle/Ant</code> to download <code>jars</code> in the building phase of deployment, and these <code>jars</code> are added to the classpath when JDK starts the service in command line. <br/>
In Python world, <code>pip</code> downloads dependencies to a python&#39;s global library path by default (e.g. <code>/usr/local/lib/python3.7/site-packages</code>) and so we need to use a virtual environment as a local library path (e.g. <code>./venv/lib/python3.7/site-packages</code>).</p>

<p>NodeJS is using the Virtual Environment way - Manage dependencies with a <code>package.json</code> file and install dependencies to a local <code>node_modules</code> instead of globally. </p>

<h2 id="toc_2">virtualenv</h2>

<h3 id="toc_3">Create <code>venv</code> folder with <code>virtualenv</code></h3>

<p>Assuming python3.7 is already installed to dev or prod host. </p>

<pre><code class="language-text"># Install virtualenv to global environment
$ pip3 install virtualenv

# create a folder named venv as a local python environment
$ virtualenv venv --python=python3.7

# activate the local environment
$ source ./venv/bin/activate

</code></pre>

<p>Now <code>./venv/bin/</code> is added to the front of <code>PATH</code> so the command <code>python</code> or <code>pip</code> is the ones under this path. When you use <code>pip</code> to install dependencies, they will go to <code>./venv/lib/python3.7/site-packages</code></p>

<h3 id="toc_4">Deploy dependencies</h3>

<p>Include a text file of dependencies with versions in your project. <br/>
Conventionally it is named <code>requirements.txt</code>.</p>

<p>You can manually create and edit the file, or use <code>pip freeze &gt; requirements.txt</code> to generate the file with current snapshot of your <code>venv</code>. </p>

<p>An example of <code>requirements.txt</code>:</p>

<pre><code class="language-text">aiohttp==3.0.5
requests==2.18.4
beautifulsoup4==4.6.0
urllib3==1.22
</code></pre>

<p>In the startup script<sup>[Assuming</sup> you are deploying your service to a cluster of hosts, you want to include all steps to run the service in a startup script] of your service, you use <code>pip install -r requirements.txt</code> to download all dependency libraries, before running the service code.</p>

<ul>
<li>Question: can we define only major versions and let <code>pip</code> find the most recent minor version? </li>
</ul>

<h2 id="toc_5">pipenv</h2>

<p>A newer, better and easier virtual environment management tool is <code>pipenv</code>. </p>

<h3 id="toc_6">Create virtual enviroment with <code>pipenv</code></h3>

<pre><code class="language-text"># install `pipenv` to global environment
$ pip3 install pipenv

# It can install dependencies in `requirements.txt` or `Pipfile` automatically if either one exists.
$ pipenv install
</code></pre>

<p>Note that there is no <code>venv</code> folder in your project path. Instead there will be a folder with a random name under <code>~/.local/share/viritualenvs/</code>. In fact you never need to touch it or look at it. </p>

<h3 id="toc_7">Deploy dependencies</h3>

<p>The actual dependency definition file is <code>Pipfile</code> with <code>Pipfile.lock</code>, automatically generated with <code>pipenv install</code>. We don&#39;t need <code>requirements.txt</code> anymore, Instead we include <code>Pipfile</code> and <code>Pipfile.lock</code> into our project source code.</p>

<p>While <code>Pipfile.lock</code> works like a lock of minor versions using sha256, <code>Pipfile</code> also specifies a full version number when generated. We want to modify <code>Pipfile</code> with major version ranges so that we can have recent minor version updates of dependencies in our local dev environment and lock them into <code>Pipfile.lock</code>.</p>

<p>An example of <code>Pipfile</code> snippet.</p>

<pre><code class="language-text">[[source]]
name = &quot;pypi&quot;
url = &quot;https://pypi.org/simple&quot;
verify_ssl = true

[dev-packages]

[packages]
flask = &quot;&gt;=1.1.0, &lt;1.2.0&quot;
flask-sqlalchemy = &quot;&gt;=2.4.0, &lt;2.5.0&quot;
flask-login = &quot;&gt;=0.5.0, &lt;0.6.0&quot;

[requires]
python_version = &quot;3.8&quot;

</code></pre>

<p>On prod host, your startup script will be like this:</p>

<pre><code class="language-text">$ pipenv install
$ pipenv run python service.py
</code></pre>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[SSH Tunneling Example]]></title>
    <link href="https://swdev.online/ssh-tunneling-example.html"/>
    <updated>2020-08-11T15:53:51-07:00</updated>
    <id>https://swdev.online/ssh-tunneling-example.html</id>
    <content type="html"><![CDATA[
<ul>
<li>
<a href="#toc_0">Format</a>
</li>
<li>
<a href="#toc_1">Example</a>
<ul>
<li>
<a href="#toc_2">1. Remote Debug a service on a cloud server &lt;ssh_server&gt;</a>
</li>
<li>
<a href="#toc_3">2. Remote Debug a service on a cloud server &lt;target_host&gt; through a bastion host &lt;ssh_server&gt;</a>
</li>
</ul>
</li>
</ul>


<span id="more"></span><!-- more -->

<h2 id="toc_0">Format</h2>

<p><code>ssh -N -L &lt;local port&gt;:&lt;remote host&gt;:&lt;remote port&gt; &lt;SSH server&gt;<br/>
</code></p>

<ul>
<li><code>-N</code>: Not to execute shell command, only listen and forward. </li>
<li><code>-L</code>: Listen to <code>&lt;local port&gt;</code> on client machine.</li>
<li>Any data send to this <code>&lt;local port&gt;</code> is forward to <code>&lt;remote host&gt;:&lt;remote port&gt;</code>. Therefore the application on <code>&lt;remote host&gt;</code> listening to <code>&lt;remote port&gt;</code> is used as if it was a local application. </li>
<li><code>&lt;remote host&gt;</code> is usually the same as <code>&lt;SSH server&gt;</code>, i.e. <code>&lt;remote host&gt;</code> is localhost relative to <code>&lt;SSH server&gt;</code>.</li>
<li><code>&lt;remote host&gt;</code> can be also be not on <code>&lt;SSH server&gt;</code>, i.e. <code>&lt;SSH server&gt;</code> is used as a bridge when data is forwarded from <code>&lt;SSH client&gt;:&lt;local port&gt;</code>and <code>&lt;remote host&gt;:&lt;remote port&gt;</code></li>
</ul>

<p><img src="media/15971864317792/15971864950747.jpg" alt=""/></p>

<h2 id="toc_1">Example</h2>

<h3 id="toc_2">1. Remote Debug a service on a cloud server <ssh_server></h3>

<p>Forward client machines&#39;s <code>localhost:5050</code> to <ssh_server>&#39;s <code>localhost:5050</code></p>

<p><code>ssh -N -L 5050:localhost:5050 &lt;username&gt;@&lt;ssh_server&gt;</code></p>

<h3 id="toc_3">2. Remote Debug a service on a cloud server <target_host> through a bastion host <ssh_server></h3>

<p>On client machine, forward client machine&#39;s <code>5022</code> to <code>&lt;target_host&gt;:22</code>. The goal is to use client machine&#39;s <code>5022</code> port as if it was <code>&lt;target_host&gt;:22</code>. After this setting, data to local <code>5022</code> is forwarded to <code>&lt;target_host&gt;:22</code> , therefore local becomes a local endpointwe make local <code>5022</code> to be the same you can even ssh to <code>&lt;target_host&gt;:22</code> by <code>ssh -p 5022 localhost</code>.</p>

<p><code>ssh -A -N -L 5022:&lt;target_host&gt;:22 &lt;username&gt;@&lt;ssh_server&gt;</code></p>

<p>On client machine, forward client machine&#39;s <code>5050</code> to <code>localhost:5050</code> through <code>localhost</code>.</p>

<p><code>ssh -p 5022 -N -L 5050:localhost:5050 localhost</code></p>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Reading Dubbo Source Code - 1. Setup the Environment and Run Demo]]></title>
    <link href="https://swdev.online/reading-dubbo-source-code--1.html"/>
    <updated>2020-07-31T22:21:57-07:00</updated>
    <id>https://swdev.online/reading-dubbo-source-code--1.html</id>
    <content type="html"><![CDATA[
<ul>
<li>
<a href="#toc_0">0. New Series</a>
</li>
<li>
<a href="#toc_1">1. High level architecture and usage</a>
</li>
<li>
<a href="#toc_2">2. Import to IntelliJ</a>
</li>
<li>
<a href="#toc_3">3. Dubbo Modules (Artifacts)</a>
</li>
<li>
<a href="#toc_4">4. Local Debug</a>
<ul>
<li>
<a href="#toc_5">4.1 Setup Zookeeper as Registry Center</a>
</li>
<li>
<a href="#toc_6">4.2 Provider and Consumer</a>
</li>
<li>
<a href="#toc_7">4.3 Run</a>
</li>
<li>
<a href="#toc_8">4.4 Zookeeper Review</a>
</li>
</ul>
</li>
<li>
<a href="#toc_9">5. Dubbo URL</a>
</li>
<li>
<a href="#toc_10">6. Further questions that I don&#39;t have the answers yet</a>
</li>
</ul>


<span id="more"></span><!-- more -->

<h2 id="toc_0">0. New Series</h2>

<p>I&#39;ll start a new series on <strong>Reading Dubbo Source Code</strong>. <br/>
It might not be as popular as Spring Cloud when used in building microservices. But I think its source code is the best learning material for architecting high performance RPC framework. </p>

<h2 id="toc_1">1. High level architecture and usage</h2>

<p>Dubbo is a Registry based RPC framework. Provider (service) registers endpoint and metadata to Registry. Consumer (client of service) subscribes and caches the service endpoint from Registry.<br/>
<img src="https://camo.githubusercontent.com/660e543510891254fa0ca6138af3350458aa0582/687474703a2f2f647562626f2e6170616368652e6f72672f696d672f6172636869746563747572652e706e67" alt=""/></p>

<p>To use the RPC framework, Service side provides a java interface to the client. Client uses that interface to call the methods.<br/>
Dubbo Spring integration provides a bean whose type is the interface and the implementation is a proxy (getting endpoint from Registry center and calling provider service).</p>

<h2 id="toc_2">2. Import to IntelliJ</h2>

<pre><code class="language-text">git clone https://github.com/apache/dubbo.git
</code></pre>

<p>IntellJ -&gt; <code>Open</code> -&gt;  find your dubbo folder<br/>
IntellJ will automatically recognize it as a maven project and downloads dependencies. </p>

<h2 id="toc_3">3. Dubbo Modules (Artifacts)</h2>

<p><img src="media/15962593173836/15962632902281.jpg" alt=""/></p>

<h2 id="toc_4">4. Local Debug</h2>

<h3 id="toc_5">4.1 Setup Zookeeper as Registry Center</h3>

<p>Download: <a href="https://zookeeper.apache.org/releases.html">https://zookeeper.apache.org/releases.html</a><br/>
Any version will do - we can use release line 3.6 because ZooKeeper clients from 3.4 and 3.5 branch are fully compatible with 3.6 servers.</p>

<p>The start script <code>zkServer.sh</code> uses <code>zoo.cfg</code> as config file. <br/>
The default path is <code>./conf</code>. As a local test environment, I will just copy the provided  <code>./conf/zoo_sample.cfg</code> to <code>./conf/zoo.cfg</code> to make the start script work. This is a basic config that setup port to <code>2181</code> for this single host server and store data into <code>/tmp</code> folder. </p>

<pre><code class="language-text">cp ./conf/zoo_sample.cfg ./conf/zoo.cfg
./bin/zkServer.sh start
</code></pre>

<h3 id="toc_6">4.2 Provider and Consumer</h3>

<p><code>dubbo-demo</code> module provides three flavors to use Dubbo. The service provider is DemoService and the consumer is just a main method. Let&#39;s take the XML configured Spring as example. </p>

<p>Service provider creates the actual serviceImpl bean, and pass it to <code>interface</code> definition:</p>

<pre><code class="language-markup">&lt;bean id=&quot;demoService&quot; class=&quot;org.apache.dubbo.demo.provider.DemoServiceImpl&quot;/&gt; 
&lt;dubbo:service interface=&quot;org.apache.dubbo.demo.DemoService&quot; ref=&quot;demoService&quot;/&gt;
</code></pre>

<p>Service provider needs to register itself to Registry Center.</p>

<pre><code class="language-markup">&lt;dubbo:registry address=&quot;zookeeper://127.0.0.1:2181&quot;/&gt; 
</code></pre>

<p>Then the client side can use the interface class to create a proxy bean, and call the methods of the bean like local methods. </p>

<pre><code class="language-markup">&lt;dubbo:reference id=&quot;demoService&quot; check=&quot;false&quot;  
                 interface=&quot;org.apache.dubbo.demo.DemoService&quot;/&gt;
</code></pre>

<p>Client side code:</p>

<pre><code class="language-java">DemoService demoService = context.getBean(&quot;demoService&quot;, DemoService.class);
CompletableFuture&lt;String&gt; hello = demoService.sayHelloAsync(&quot;world&quot;);
</code></pre>

<h3 id="toc_7">4.3 Run</h3>

<p>We use maven to build Dubbo. It can be done in CLI or in IntelliJ maven plugin. <br/>
<img src="media/15962593173836/15962699434104.jpg" alt=""/></p>

<pre><code class="language-text">mvn clean install -Dmaven.test.skip=true 
</code></pre>

<p>Then we can start server and client just by running their main methods (Click the <code>run</code> on <code>main()</code> in IntelliJ, or use the CLI given in <code>dubbo-demo/README.md</code> which is the basic way to run a jar <code>java -jar jarName.jar</code>). Yes, the server is not using any Servlet container. Dubbo uses Netty to serve dubbo protocol!</p>

<p>Logs from server shows it automatically register its current IP address (my laptop ip in LAN) and port (default 20880) to zookeeper (127.0.0.1:2181), with supported methods and interface name. </p>

<pre><code class="language-txt">[01/08/20 01:29:26:649 PDT] main  INFO config.ServiceConfig:  
[DUBBO] Register dubbo service org.apache.dubbo.demo.DemoService url 
dubbo://192.168.43.137:20880/org.apache.dubbo.demo.DemoService? 
anyhost=true&amp;application=demo-provider&amp;bind.ip=192.168.43.137&amp;bind.port=20880&amp;deprecated=false
&amp;dubbo=2.0.2&amp;dynamic=true&amp;generic=false&amp;interface=org.apache.dubbo.demo.DemoService
&amp;metadata-type=remote &amp;methods=sayHello,sayHelloAsync&amp;pid=88546&amp;qos.port=22222&amp;release=
&amp;side=provider&amp;timestamp=1596270566561 to registry registry://127.0.0.1:2181/org.apache.dubbo.registry.RegistryService?
application=demo-provider &amp;dubbo=2.0.2&amp;metadata-type=remote&amp;pid=88546&amp;qos.port=22222
&amp;registry=zookeeper&amp;timestamp=1596270566555, dubbo version: , 
current host: 192.168.43.137

[01/08/20 01:29:26:927 PDT] main  INFO transport.AbstractServer:  
[DUBBO] Start NettyServer bind /0.0.0.0:20880, 
export /192.168.43.137:20880, dubbo version: , current host: 192.168.43.137
</code></pre>

<p>Client side log shows it successfully gets the endpoint of service from Zookeeper. </p>

<pre><code class="language-txt">[01/08/20 01:41:52:344 PDT] main  INFO zookeeper.ZookeeperRegistry:  
[DUBBO] Register: consumer://192.168.43.137/org.apache.dubbo.demo.DemoService?
application=demo-consumer&amp;category=consumers&amp;check=false&amp;dubbo=2.0.2&amp;init=false
&amp;interface=org.apache.dubbo.demo.DemoService&amp;metadata-type=remote&amp;methods=sayHello,sayHelloAsync
&amp;pid=94685&amp;qos.port=33333&amp;side=consumer&amp;sticky=false&amp;timestamp=1596271312243, 
dubbo version: , current host: 192.168.43.137

[01/08/20 01:41:52:372 PDT] main  INFO zookeeper.ZookeeperRegistry:  
[DUBBO] Subscribe: consumer://192.168.43.137/org.apache.dubbo.demo.DemoService?
application=demo-consumer&amp;category=providers,configurators,routers&amp;check=false&amp;dubbo=2.0.2
&amp;init=false&amp;interface=org.apache.dubbo.demo.DemoService&amp;metadata-type=remote&amp;methods=sayHello,sayHelloAsync
&amp;pid=94685&amp;qos.port=33333&amp;side=consumer&amp;sticky=false&amp;timestamp=1596271312243, 
dubbo version: , current host: 192.168.43.137

[01/08/20 01:41:52:712 PDT] NettyClientWorker-1-1  INFO netty4.NettyClientHandler:  
[DUBBO] The connection of /192.168.43.137:62158 -&gt; /192.168.43.137:20880 is established., 
dubbo version: , current host: 192.168.43.137
</code></pre>

<h3 id="toc_8">4.4 Zookeeper Review</h3>

<pre><code class="language-cli">$ bin/zkCli.sh -server 127.0.0.1:2181

[zk: 127.0.0.1:2181(CONNECTED) 1] ls /
[dubbo, zookeeper]

[zk: 127.0.0.1:2181(CONNECTED) 2] ls -R /dubbo

/dubbo/config/mapping/org.apache.dubbo.demo.DemoService/demo-provider
/dubbo/metadata/org.apache.dubbo.demo.DemoService/consumer/demo-consumer
/dubbo/metadata/org.apache.dubbo.demo.DemoService/provider/demo-provider
/dubbo/org.apache.dubbo.demo.DemoService/configurators
/dubbo/org.apache.dubbo.demo.DemoService/consumers
/dubbo/org.apache.dubbo.demo.DemoService/routers
/dubbo/org.apache.dubbo.demo.DemoService/providers/dubbo%3A%2F%2F192.168.43.137%3A20880%2Forg.apache.dubbo.demo.DemoService
%3Fanyhost%3Dtrue%26application%3Ddemo-provider%26deprecated%3Dfalse%26dubbo%3D2.0.2%26dynamic
%3Dtrue%26generic%3Dfalse%26interface%3Dorg.apache.dubbo.demo.DemoService
%26metadata-type%3Dremote%26methods%3DsayHello%2CsayHelloAsync%26pid%3D88546
%26release%3D%26side%3Dprovider%26timestamp%3D1596270566561
</code></pre>

<p>(I only list leaf nodes in above <code>ls -R /dubbo</code> results.)</p>

<p>The last item is the Dubbo URL which also got logged when service starts - &quot;Register dubbo service...&quot;. </p>

<h2 id="toc_9">5. Dubbo URL</h2>

<p>URL is used as data structure of protocol. The metadata are stored as key-value pairs in URL parameters.</p>

<pre><code class="language-text">dubbo://192.168.43.137:20880/org.apache.dubbo.demo.DemoService? 
anyhost=true&amp;application=demo-provider&amp;bind.ip=192.168.43.137&amp;bind.port=20880&amp;deprecated=false
&amp;dubbo=2.0.2&amp;dynamic=true&amp;generic=false&amp;interface=org.apache.dubbo.demo.DemoService
&amp;metadata-type=remote &amp;methods=sayHello,sayHelloAsync&amp;pid=88546&amp;qos.port=22222&amp;release=
&amp;side=provider&amp;timestamp=1596270566561
</code></pre>

<h2 id="toc_10">6. Further questions that I don&#39;t have the answers yet</h2>

<ol>
<li>How does the client configure retry and timeout?</li>
<li>Does the client use any connection pool? How to configure the pool size and idle connection timeout?</li>
<li>How does the client authenticate itself when calling server? or How does the server protect itself being called by unauthorized clients?</li>
<li>What is the minimal dependency for the client and server to use dubbo? Is it <code>&lt;artifactId&gt;dubbo-dependencies-bom&lt;/artifactId&gt;</code>)?</li>
<li>What is the difference between <code>&lt;artifactId&gt;dubbo-dependencies-bom&lt;/artifactId&gt;</code> and <code>&lt;artifactId&gt;dubbo-bom&lt;/artifactId&gt;</code>?</li>
</ol>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[An upward trend ID generation service]]></title>
    <link href="https://swdev.online/an-upward-trend-id-generation.html"/>
    <updated>2020-07-02T01:15:27-07:00</updated>
    <id>https://swdev.online/an-upward-trend-id-generation.html</id>
    <content type="html"><![CDATA[
<ul>
<li>
<a href="#toc_0">1. First question - why do we need a sequential ID instead of a random UUID?</a>
<ul>
<li>
<a href="#toc_1">1.1 2nd Question is why do I write this article.</a>
</li>
</ul>
</li>
<li>
<a href="#toc_2">2. Requirement:</a>
</li>
<li>
<a href="#toc_3">3. Solution 1: DB atomic update</a>
<ul>
<li>
<a href="#toc_4">3.1 IDGenerator service logic</a>
</li>
<li>
<a href="#toc_5">3.2 DB schema</a>
</li>
<li>
<a href="#toc_6">3.3 How to query (MySQL example)</a>
<ul>
<li>
<a href="#toc_7">3.3.1 Atomic Read and Update / Transaction</a>
</li>
<li>
<a href="#toc_8">3.3.2 Optimistic Lock / Lockless</a>
</li>
</ul>
</li>
<li>
<a href="#toc_9">3.4 Assessment</a>
</li>
</ul>
</li>
<li>
<a href="#toc_10">4. Solution2: Redis INCRBY</a>
<ul>
<li>
<a href="#toc_11">4.1 The logic is the same.</a>
</li>
<li>
<a href="#toc_12">4.2 Assessment</a>
</li>
</ul>
</li>
<li>
<a href="#toc_13">5. Solution3: Time-based on-host generation.</a>
<ul>
<li>
<a href="#toc_14">5.1 128 bit UUID</a>
</li>
<li>
<a href="#toc_15">5.2 64 bit SnowFlake Id</a>
</li>
<li>
<a href="#toc_16">5.3 Problem</a>
</li>
</ul>
</li>
<li>
<a href="#toc_17">6. Conclusion</a>
</li>
</ul>


<span id="more"></span><!-- more -->

<h2 id="toc_0">1. First question - why do we need a sequential ID instead of a random UUID?</h2>

<ul>
<li><p>UUID takes space! We have to store it as HEX string, with 32 chars &amp; 4 &#39;-&#39;.</p>
<ul>
<li>UUID has 128bit in binary - will be 39 chars as Decimal string or 32 chars as HEX string.</li>
<li>Consider the data type to store it:
<ul>
<li>MySQL <code>BIGINT</code>: 64bit, not enough for 128bit.</li>
<li>MySQL &#39;VARCHAR(36)`: 36 chars to store the HEX string with 4 &#39;-&#39;s. </li>
<li>DynamoDB <code>Number</code>: Support Java BigInteger, can be any size. However, it is actually a Decimal string in storage, which takes 39 chars.</li>
<li>DynamoDB <code>String</code>: 36 chars to store the HEX string with 4 &#39;-&#39;s. </li>
</ul></li>
</ul></li>
<li><p>Some business requirement <code>may</code> require the ID to be in an upward trend. </p>
<ul>
<li>StatementId, OrderId, TweetId.
<ul>
<li>Twitter Example
<ul>
<li>Although there may be a <code>creationTime</code> column, twitter said they prefer to <a href="https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html">sort tweets by ID</a>. </li>
<li>The timeline of a user is paginated. If the tweetId has order and is stored as sort key (e.g. MySQL indexed column, DDB Sort Key, HBase Column Key, etc.), the query to get next page will be very efficient. </li>
</ul></li>
</ul></li>
</ul></li>
<li><p>UUID might have duplicates when there are already a lot of existing IDs. But this is not a practical concern because <a href="https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions">the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion</a></p></li>
</ul>

<h3 id="toc_1">1.1 2nd Question is why do I write this article.</h3>

<p>In my work projects, we are lazy and just used random UUID for the most of the time. But I also saw many systems like e-commerce and social media systems use a number as ID. <br/>
I&#39;m trying to figure out the best way to generate universal unique Ids used in distributed systems. </p>

<p>When I almost completed my doc, I came across the original blog by twitter on why they developed SnowFlake. <br/>
I feel that the reasons they listed are very similar to my thoughts. And then I realized I might not need to write this article...</p>

<h2 id="toc_2">2. Requirement:</h2>

<ul>
<li>ID generation service</li>
<li>ID is in an upward trend</li>
<li>ID is globally unique </li>
<li>TPS is large enough for application requirement
<ul>
<li>Normal system write TPS is small
<ul>
<li>FB/Twitter should be &lt; 100TPS</li>
</ul></li>
<li>But in some events write TPS can be large
<ul>
<li>Double11 peak - <strong>100kTPS</strong> order placement.
<ul>
<li>Assumption: <a href="source?">1billion RMB / 21s</a>, 500RMB/order</li>
<li>order might be written to order table asynchronously</li>
<li>I believe they prepared orderId in advance.</li>
</ul></li>
</ul></li>
</ul></li>
</ul>

<h2 id="toc_3">3. Solution 1: DB atomic update</h2>

<h3 id="toc_4">3.1 IDGenerator service logic</h3>

<ul>
<li>Service is a cluster.</li>
<li>Each host uses one background thread to fetch an id range periodically.</li>
<li>May prefetch many ranges and store them in a queue. </li>
<li>Clients request one ID at a time. </li>
<li>Consume the head of the queue until it&#39;s empty. </li>
</ul>

<h3 id="toc_5">3.2 DB schema</h3>

<pre><code class="language-text">id_name ：the name of this ID 
max_id ：current max used id in this sequence
step ：the number of ids to allocate in one fetch 
</code></pre>

<h3 id="toc_6">3.3 How to query (MySQL example)</h3>

<h4 id="toc_7">3.3.1 Atomic Read and Update / Transaction</h4>

<ol>
<li>We query current <code>max_id</code>, and we got an ID range <code>[max_id+1, max_id+step]</code>. </li>
<li>To avoid other concurrent MySQL threads also getting the same range, we need to add a <code>Write lock</code> (<code>FOR UPDATE</code>, <code>Pessimistic lock</code>). Only after <code>COMMIT</code> can another thread (also with <code>FOR UPDATE</code>) read <code>max_id</code> again. </li>
</ol>

<pre><code class="language-sql">SET SESSION TRANSACTION ISOLATION LEVEL read committed;
SET autocommit = 0;
BEGIN 
    SELECT max_id 
    FROM id_generator 
    WHERE id_name=&#39;OrderId&#39; FOR UPDATE;
    
    UPDATE id_generator 
    SET max_id = max_id+step
    WHERE id_name =&#39;OrderId&#39;;
COMMIT;
</code></pre>

<h4 id="toc_8">3.3.2 Optimistic Lock / Lockless</h4>

<p>Adding a <code>version</code> column is the common way of applying optimistic lock. Optimistic lock only has better performance than actual lock in cases that concurrency is low. </p>

<p>In this case, <code>max_id</code> is definitely something updated concurrently by all hosts in the service cluster.  So I don&#39;t think it&#39;s a good idea to use optimistic lock. </p>

<p>But for reference purpose, let me still put the logic here. </p>

<ol>
<li>Read the max_id and version. Use [max_id + 1, max_id + step] as the fetched range.</li>
<li>Update max_id only when version is not changed. </li>
<li>If version changed, need to retry step 1 and 2. </li>
</ol>

<pre><code class="language-sql">SELECT max_id, version, step
FROM id_generator 
WHERE id_name=&#39;OrderId&#39;;

UPDATE id_generator
SET max_id = max_id+step, version = version + 1
WHERE version = {version} and id_name =&#39;OrderId&#39;
</code></pre>

<h3 id="toc_9">3.4 Assessment</h3>

<p>Because of <code>Write lock</code> in the transaction, the transactions are sequentially executed. <br/>
The through put will be limited and latency P99 will be high and even timeout. </p>

<ul>
<li><p>Scalability: </p>
<ul>
<li>Service cluster is scalable. </li>
<li>DB write is not scalable because one ID name uses only one row.  (Usually this causes HotKey problem.) </li>
<li><p>However we may not need to scale because the background thread can fetch a large range each time, at the cost of sequence being not exactly increasing. </p></li>
<li><p>Example:</p>
<ul>
<li>Clients request 100k ID per second. </li>
<li>Cluster has 200 hosts, with 500TPS/host. </li>
<li>We want the IDs are <a href="https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html">k-sorted</a> and k~1.</li>
<li>Every host fetches a range from DB every 1s
<ul>
<li>The range step will be 500 (100k / 200 * 1)</li>
<li>DB write is 200/1 = 200 TPS.</li>
</ul></li>
</ul></li>
</ul></li>
<li><p>Availability: </p>
<ul>
<li>Must have synchronous replication of the master node. 
<ul>
<li>MySQL master-slave replication</li>
<li>DDB partition has three nodes, one is leader, the other  two are followers. We need strong consistency for write. </li>
<li>Leaderless architecture, such as Cassandra.<br/></li>
</ul></li>
</ul></li>
<li><p>General Drawback:<br/>
The disadvantage of this strategy is obvious compared with SnowFlake algorithm. <br/>
It depends on an extra system. <br/>
It seems high cost / limited scalability / low availability.</p></li>
</ul>

<h2 id="toc_10">4. Solution2: Redis INCRBY</h2>

<h3 id="toc_11">4.1 The logic is the same.</h3>

<p>This is the same idea as above. <br/>
Service fetches a range from Redis by increasing the max_id.</p>

<p>INCRY is atomic and it returns the value after increase.</p>

<pre><code class="language-text">redis&gt; SET OrderId &quot;1000000&quot;
&quot;OK&quot;
redis&gt; INCRBY OrderId 500
(integer) 1000500
</code></pre>

<h3 id="toc_12">4.2 Assessment</h3>

<p>We can still say it has the same problem as above<br/>
    - extra dependency<br/>
    - limited scalability<br/>
    - low availability<br/>
Redis does offer higher throughput than MySQL. <br/>
But Redis replication seems only support asynchronous way. <br/>
(MySQL supports full sync, semisync and asycn replication, during which we need full sync to make replications consistent and slave can be switch to master without losing data.)</p>

<h2 id="toc_13">5. Solution3: Time-based on-host generation.</h2>

<p>I think SnowFlake and TimeBasedUUID are the same idea.<br/>
SnowFlake wins because it has 64bits while UUID has 128bits.</p>

<p><a href="https://logging.apache.org/log4j/log4j-2.2/log4j-core/apidocs/org/apache/logging/log4j/core/util/UuidUtil.html#getTimeBasedUuid()">TimeBasedUUID</a><br/>
<a href="https://docs.oracle.com/javase/6/docs/api/java/util/UUID.html">Java UUID</a></p>

<h3 id="toc_14">5.1 128 bit UUID</h3>

<p>| Time (100ns, i.e. 1e-7s precision)                   | 60 bit |<br/>
| Version (Timebased, DCE Security, NameBased, Random) | 4 bit  |<br/>
| Mac                                                  | 48 bit |<br/>
| Sequence Number 14 bit                               | 14 bit |<br/>
| Variant (2 (Leach-Salz))                             | 2 bit  |</p>

<p>The generated UUID will be unique for approximately 8,925 years so long as less than 10,000 IDs are generated per millisecond on the same device (as identified by its MAC address).</p>

<h3 id="toc_15">5.2 64 bit SnowFlake Id</h3>

<p>| 0                         | 1 bit  |<br/>
| timestamp in milliseconds | 41 bit |<br/>
| Machine id                | 10 bit |<br/>
| sequence no               | 12 bit |</p>

<p>Machine id can be based on private ip - always unique in a LAN.<br/>
(In contrast, TimeBasedUUID is using Mac as machine ID, which has some privacy leakage risk.)</p>

<p>The bits can be redefined. Under the default config, there can be  at most 2<sup>10=1024</sup> machines. There can be at most 2<sup>12=4096</sup> IDs generated for 1 millisecond. </p>

<p>I saw Baidu has a variation that changed timestamp to seconds, so  there can be more machine ids.</p>

<h3 id="toc_16">5.3 Problem</h3>

<p>We can get epoch timestamp using <code>System.currentTimeMillis()</code> in java. But the absolute millisecond value is not very reliable due to delays in NTP sync.<br/>
After an NTP sync, the clock may go back to a prior second due to <a href="https://en.wikipedia.org/wiki/Clock_drift">clock drift</a><br/>
Found a good article on it - <a href="https://codeburst.io/why-shouldnt-you-trust-system-clocks-72a82a41df93">How System Clocks Can Cause Mysterious Faults?</a>.</p>

<h2 id="toc_17">6. Conclusion</h2>

<ol>
<li><p>I still think in most cases, whey data size is not big (data generation is slow or retention time is limited), using RandomUUID is the best way, because of simplicity.</p></li>
<li><p>Otherwise use a customized SnowFlake library seems much better than managing a database of Redis cluster.</p></li>
</ol>

<p>Reference:</p>

<ol>
<li><a href="https://gitbook.cn/books/5eb683f8a04d4f3d96d5b8cc/index.html">Nine ways to generate unique ID</a></li>
<li><a href="https://www.cnblogs.com/jajian/p/11101213.html">Distributed universal ID generation</a></li>
<li><a href="https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html">Announcing Snowflake</a></li>
</ol>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Idempotency of a PUT Request]]></title>
    <link href="https://swdev.online/idempotency-of-a-put-request.html"/>
    <updated>2020-07-02T01:15:34-07:00</updated>
    <id>https://swdev.online/idempotency-of-a-put-request.html</id>
    <content type="html"><![CDATA[
<ul>
<li>
<a href="#toc_0">1. How To Avoid Duplicate Order</a>
<ul>
<li>
<a href="#toc_1">E-commerce Scenario - Common Flow of Placing Order</a>
</li>
<li>
<a href="#toc_2">Cause:</a>
</li>
<li>
<a href="#toc_3">Solution 1: One-time token</a>
</li>
<li>
<a href="#toc_4">Solution 2: Leverage DB Primary Key uniqueness</a>
</li>
</ul>
</li>
<li>
<a href="#toc_5">2. In general how to guarantee Idempotence of an API put request.</a>
</li>
</ul>


<span id="more"></span><!-- more -->

<h2 id="toc_0">1. How To Avoid Duplicate Order</h2>

<h3 id="toc_1">E-commerce Scenario - Common Flow of Placing Order</h3>

<ul>
<li>Go to Cart</li>
<li>Cart Page + Checkout button </li>
<li>Checkout Page + Place order button </li>
<li>Payment: <br/>
US E-commerce has no such step - Credit card or PayPal are filled in Checkout Page.</li>
<li>Success Page</li>
</ul>

<h3 id="toc_2">Cause:</h3>

<ul>
<li>User clicks PlaceOrder multiple times</li>
<li>Backend retries the call to Order service</li>
</ul>

<h3 id="toc_3">Solution 1: One-time token</h3>

<ul>
<li>Checkout page requests one-time token</li>
<li>Store token in Redis as Key.
<ul>
<li>Set expired time in case there is no order placed.</li>
<li><a href="https://redis.io/commands/pexpire">PEXPIRE token 5000</a></li>
</ul></li>
<li>Place order request comes with the token</li>
<li>Order system verifies token by
<ul>
<li><a href="https://redis.io/commands/del">DEL token</a></li>
<li>Return 0 means the key is already removed.</li>
</ul></li>
</ul>

<p>Assessment: </p>

<ul>
<li>Scalability: High. 
<ul>
<li>Redis Sharding.</li>
</ul></li>
<li>Availability: High.
<ul>
<li>Service is down after DEL token: retries will fail. Place order will fail. </li>
<li>Redis node is down. Token not available, place order will fail. </li>
</ul></li>
</ul>

<h3 id="toc_4">Solution 2: Leverage DB Primary Key uniqueness</h3>

<ul>
<li>Checkout page requests orderId</li>
<li>Place order request comes with the orderId</li>
<li>DB creates order using orderId as primary key. </li>
<li>Insert failure means PK already exists, this is a duplicate request. </li>
</ul>

<p>Assessment: </p>

<ul>
<li>Scalability: High</li>
<li>Availability: Depends on proxy. If using consistent hashing which can remove the failure node, then it&#39;s high.</li>
<li>Security: Attacker can use invalid key, we have to verify it.</li>
</ul>

<h2 id="toc_5">2. In general how to guarantee Idempotence of an API put request.</h2>

<p>We can have a primary key in an <code>insert</code> directive. The primary key is unique id of the data, so retried writes won&#39;t succeed. </p>

<p>Take SQL for example - </p>

<pre><code class="language-sql">INSERT INTO users (uid, age, gender, createTime)
VALUES (1234567, 20, &quot;male&quot;, 1593684708)
</code></pre>

<p>Can we assume all INSERTS are idempotent when there is a Primary Key in request? </p>

<p>Yes when the Primary Key can come from a <a href="/an-upward-trend-id-generation.html">ID generation service</a>, or even client side auto generation. </p>

<p>DDB and MongoDB generate UUID or ObjectId at client side: </p>

<ul>
<li><p>DynamoDB <a href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBMapper.Annotations.html#DynamoDBMapper.Annotations.DynamoDBAutoGeneratedKey">@DynamoDBAutoGeneratedKey</a> UUID is 128bit in Memory, but in DB it&#39;s a <a href="https://docs.oracle.com/javase/6/docs/api/java/util/UUID.html#toString()">string</a> of 32 chars and 4 &quot;-&quot;.</p></li>
<li><p>MongoDB <a href="https://docs.mongodb.com/manual/reference/bson-types/#objectid">ObjectId</a> ObjectId, 12bytes number in BSON.</p></li>
</ul>

<p>This means we should try to avoid using the AUTO_INCREMENT feature on the DB server side</p>

<ul>
<li>MySQL <a href="https://dev.mysql.com/doc/refman/8.0/en/example-auto-increment.html">AUTO_INCREMENT</a></li>
</ul>

<p><strong>Conclusion</strong>: <br/>
We can guarantee idempotency to DB when the write has a primary key generated on client side - from an ID Generation Service or client side random UUID. </p>

<p>However the actual idempotency is determined by upstream services. Will they retry the calls to the DAO service? </p>

<p>The ideal way is the top level service provide an one time token. The DAO layer checks the token using <strong>Solution1</strong> of Section1.</p>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Design Twitter - 1: Requirement and Storage Selection]]></title>
    <link href="https://swdev.online/design-twitter-using-no-sql-1.html"/>
    <updated>2019-11-09T14:47:01-08:00</updated>
    <id>https://swdev.online/design-twitter-using-no-sql-1.html</id>
    <content type="html"><![CDATA[
<ul>
<li>
<a href="#toc_0">1. Requirement</a>
<ul>
<li>
<a href="#toc_1">1.1 Use Case</a>
</li>
<li>
<a href="#toc_2">1.2 Volume</a>
</li>
</ul>
</li>
<li>
<a href="#toc_3">2. API</a>
</li>
<li>
<a href="#toc_4">3. Storage Choice</a>
<ul>
<li>
<a href="#toc_5">3.1 RDBMS</a>
</li>
<li>
<a href="#toc_6">3.2 NoSQL</a>
</li>
</ul>
</li>
</ul>


<span id="more"></span><!-- more -->

<h2 id="toc_0">1. Requirement</h2>

<h3 id="toc_1">1.1 Use Case</h3>

<p>As a user</p>

<pre><code class="language-text">I can create my profile
I can post a tweet
I can list my tweets posted before
I can follow / unfollow other users
I can list my followers and my followees
I can see others tweets in my homepage timeline
I can see my own tweets in my profile timeline
I can read a tweet (unlike facebook, you don&#39;t  need to click into the message.)
I can delete my tweets, so that no one can see it any more. 
</code></pre>

<p>NOTE: If this is an interview, you&#39;d better list some common use cases, but pick one or two key use cases to go in depth. Time slips away quickly when you explain a design. You want to show the breadth and depth of your knowledge.</p>

<h3 id="toc_2">1.2 Volume</h3>

<p>Ask for TPS from your interviewer, unless he asks you to estimate it. </p>

<p>A useful way is to start from MAU or DAU based on US population and the relative size of each use case. </p>

<p>It took me 20 minutes to think about the TPS of each use case below. This is absolutely not feasible in an interview. So in reality, we should go with just the read and write of one use case or two.</p>

<p>Your interviewer would like to see how you estimates, not the accurate number. So I intend not to go too high for the DAU. On one hand, You may trap yourself into an over-challenging problem. On the other hand, we can start with US market first, and if we still have time, we can think about scaling it to EU and FE by replication. </p>

<ul>
<li>MAU: 2% US people - 60 Million</li>
<li><p>DAU: 1/3 MAU - 20 Million <sup id="fnref1"><a href="#fn1" rel="footnote">1</a></sup></p></li>
<li><p>TPS</p></li>
</ul>

<table>
<thead>
<tr>
<th>API</th>
<th>Peak TPS</th>
<th style="text-align: right">Reason</th>
</tr>
</thead>

<tbody>
<tr>
<td>Show timeline (read tweets)</td>
<td>10k</td>
<td style="text-align: right">Assume every DAU access it twice with one refresh. <br>20e6 * 4 / (24 * 60 * 60) ~ 1000. <br>Consider peak hours and peak events, we give it 10 times buffer.</td>
</tr>
<tr>
<td>Post a tweet</td>
<td>100</td>
<td style="text-align: right">1% of read.</td>
</tr>
<tr>
<td>Comment a tweet</td>
<td>1k</td>
<td style="text-align: right">10 times of post.</td>
</tr>
<tr>
<td>Delete a tweet</td>
<td>10</td>
<td style="text-align: right">Rare</td>
</tr>
<tr>
<td>List my tweets</td>
<td>10</td>
<td style="text-align: right">Rare</td>
</tr>
<tr>
<td>Follow a user</td>
<td>100</td>
<td style="text-align: right">1% of read</td>
</tr>
<tr>
<td>Unfollow a user</td>
<td>10</td>
<td style="text-align: right">Rare</td>
</tr>
<tr>
<td>List my followers</td>
<td>100</td>
<td style="text-align: right">Same as follow</td>
</tr>
<tr>
<td>List my followees</td>
<td>100</td>
<td style="text-align: right">Same as list followers</td>
</tr>
</tbody>
</table>

<ul>
<li>Storage<br/>
600 Million entries in user table (Assume 10% of all users are MAU)<br/>
315 Million tweets per year (10% of Peak post TPS is 10. 10 * 365 * 24 * 3600 = 315 M)</li>
</ul>

<h2 id="toc_3">2. API</h2>

<p>Each use case can be a RESTFUL API.</p>

<h2 id="toc_4">3. Storage Choice</h2>

<h3 id="toc_5">3.1 RDBMS</h3>

<p>Both user table and tweets table are too big for RDBMS without sharding. Usually one MySQL table is good for &lt; 1M rows. </p>

<p>We can do sharding like this:</p>

<ul>
<li>partition user table based on user id.</li>
<li>partition tweets table based on tweet id.</li>
</ul>

<p>Sharding RDBMS is painful and error prone:</p>

<ul>
<li> Need a proxy layer to route requests. </li>
<li> Cannot access data based on other columns instead of the partition key.</li>
<li> Rescaling is difficult, may need to turn off the whole system.</li>
<li> You cannot do join tables or select columns using flexible where clause.</li>
</ul>

<p>Therefore RDBMS is not considered as scalable.</p>

<h3 id="toc_6">3.2 NoSQL</h3>

<p>No SQL is good for this use case - No complex join, No transaction, Eventual consistency is enough.</p>

<p>Hbase/DynamoDB/MongoDb and even Redis will all work. </p>

<p>I&#39;ll talk about my schema design using HBase, DynamoDB and Redis respectively in the next few articles. </p>

<div class="footnotes">
<hr/>
<ol>

<li id="fn1">
<p>World wide DAU in Q4 2018: Facebook 1,520M; Snap 186M; Twitter 126M. Reference: <a href="https://www.vox.com/2019/2/7/18215204/twitter-daily-active-users-dau-snapchat-q4-earnings">https://www.vox.com/2019/2/7/18215204/twitter-daily-active-users-dau-snapchat-q4-earnings</a>&nbsp;<a href="#fnref1" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>

]]></content>
  </entry>
  
</feed>
