Global Incremental ID

A internal research note from 2021, there are more and more data structures in this topic have been introduced.

Dec 13, 2023

Global Incremental ID plays a crucial role in the distributed computing. Its primary utility lies in addressing the challenge of ID collisions, particularly during data backup, restoration, or the maintenance of global event order. Various solutions, such as UUID, Snowflake ID, and other alternatives, have been devised, each with its own set of advantages and drawbacks.

Bellow is the note regarding the pursuit of a solution to guarantee a global incremental ID within a distributed system. It's essential to highlight that the solutions bellow do not extensively address the imperative of ensuring monotonically incrementing auto-increment fields.It requires a high level of isolation, potentially leading to performance issues.

UUID v1/GUIDs:

A universally unique identifier (UUID) is a 128-bit label used for information in computer systems, normally written in hexadecimal and split by dashes into five groups. There are officially 5 types of UUID values, the a time-based version is widely used in this context.

Pros:

Unique and comparable by timestamp, it can be used as a primary key in database to eliminate the ID fragmentation.
Popular standard format which is supported in multiple languages, platforms, libraries... so engineers can easily generate the IDs offline without additional call to other services.

Cons:

128 bit is considered a big size for an ID
Using Time, MAC address or Network interface are predictable, It’s not safe when exposing externally

Snowflake ID

Snowflake ID is another type of global unique id with small size(64 bit). Among the 64 bits, 1 bit is not used, and then 41 bits are used as milliseconds, 10 bits as working machine id and 12 bits as serial number. An example is shown in the figure above

Unused(1 bit): 0, to generate positive number.
The timestamp(41 bits): The unit is milliseconds. 41 bit can represent as many as 2 ^ 41 - 1, that is to say, it can identify 2 ^ 41 - 1 millisecond value, which can be converted into 69 years.
Machine info(10 bit): This means that it can represent up to 2 ^ 10
The serial number(12 bits): The serial number of the id generated simultaneously in one millisecond. The maximum positive integer that 12 bits can represent is 2 ^ 12 - 1 = 4096, which means that the number represented by 12 bits can be used to distinguish 4096 different IDS in the same millisecond.

Actually, Snowflake has multiple settings corresponding to its variants, depending on the requirement of each company they will decide for that setting.

Pros

“Roughly” time ordered.
Less memory required than UUID.

Cons:

Introduce more complexity (machine room id, machine id) when generating an id.
Limited lifetime (69 years)

Snowflake ID variants

Snowflake ID also has many variants, depending on the requirement, the data format will be setup differently. Below are some popular examples:

Sonyflake

Sonyflake is a distributed unique ID generator inspired by Twitter's Snowflake with different setting, Sonyflake focuses on lifetime and performance on many host/core environment. So it has a different bit assignment from Snowflake. A Sonyflake ID is composed of

39 bits for time in units of 10 msec
8 bits for a sequence number
16 bits for a machine id (private IP)

Pros

The lifetime (174 years) is longer than that of Snowflake (69 years)
It can work in more distributed machines (2^16) than Snowflake (2^10)
Stateless, can be used as either library or service

Cons

It can generate 2^8 IDs per 10 msec at most in a single machine/thread (slower than Snowflake)

Instagram Snowflake ID

Based on Snowflake ID with:

41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch)
13 bits that represent the logical shard ID
10 bits that represent an auto-incrementing sequence, modulus 1024. This means we can generate 1024 IDs, per shard, per millisecond

This “We’ve delegated ID creation to each table inside each shard, by using PL/PGSQL, Postgres’ internal programming language, and Postgres’ existing auto-increment functionality.” → Not really fit with our system.

Baidu Snowflake ID

Based on the snowflake algorithm.

sign(1bit): The highest bit is always 0.
delta seconds (28 bits): The next 28 bits, represents delta seconds since a customer epoch (2016-05-20).
worker id (22 bits): The next 22 bits, represents the worker node id, maximum value will be 4.2 million.
sequence (13 bits): the last 13 bits, represents sequence within the one second, maximum is 8192 per second by default.

Pros:

It can work in more distributed machines (2^22) than Snowflake (2^10)
It can generate 2^13 IDs per 1 ms at most in a single machine/thread (faster than Snowflake)

Cons:

The maximum time will be 8.7 years (smaller than Snowflake)

KSUIDs

“KSUID is for K-Sortable Unique IDentifier. It is a kind of globally unique identifier similar to a RFC 4122 UUID, built from the ground-up to be "naturally" sorted by generation timestamp without any special type-aware logic.”

Binary KSUIDs uses 20-bytes which can be stored:
- 32-bit unsigned integer UTC timestamp
- 128-bit randomly generated payload.
Value presentation in in text is 27 chars
- String format is encoded to base-62 (0-9A-Za-z);
- String format is URL safe and has no hyphens.

Pros:

Naturally ordered by generation time
Collision-free, coordination-free, dependency-free
Highly portable representations: The text and binary representations are lexicographically sortable, easy to copy-paste or present friendly.
Providing over 100 years of life.

Cons:

Using lager memory space

Centralizing Auto-Increments

Centralizing Auto Increments normally uses an backend database to keep the Id incremental state, it could be MySQL, Postgres, Redis, ectd...

Ticket server

Ticket server is a Distributed Unique Primary Keys solution from Flickr to solve the database sharing and migrate data between databases, so they need primary keys to be globally unique. Ticket service is using REPLACE INTO query in MySQL to maintain the state of global IDs.

Ticket server is a dedicated database server, with a single database on it, and in that database there are tables like Tickets32 for 32-bit IDs, and Tickets64 for 64-bit IDs. The Tickets64 schema looks like:

CREATE TABLE `Tickets64` (
  `id` bigint(20) unsigned NOT NULL auto_increment, -- 2^64 ~ 10^19
  `stub` char(1) NOT NULL default '',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `stub` (`stub`)
) ENGINE=InnoDB

Create a new globally unique 64-bit ID I issue the following SQL:

REPLACE INTO Tickets64 (stub) VALUES ('id');
SELECT LAST_INSERT_ID();

Get a single row that looks something like:

SELECT * FROM Tickets64 returns id 
+-------------------+-------------+
| id                | stub        |
+-------------------+-------------+
| 72157623227190423 | event_id    |
+-------------------+-------------+

To obtain distinct sequences of IDs for various types of events in a distributed system, all that is required is the creation of a new table with the identical schema. Utilizing separate tables for different event types serves to minimize the likelihood of table locking.

Pros:

IDs can be provisioned in a monotonically increasing fashion.
Safety is enhanced as the ID excludes device information.
Different sequences of IDs can be established for various event types.

Cons:

The ticket server has a potential single point of failure in high-scale systems. Enhancing high availability (HA) requires a more robust database backend solution, such as a counter type in Cassandra being a viable option for achieving synchronization across distributed nodes
Performance concerns arise in the presence of numerous concurrent requests due to database locking.

Pre provision Id generation server

In this approach, a centralized server pre-provisions segments of IDs in advance and loads them into memory. Clients then query from this memory. As the usage of the current number segment approaches a specific threshold, the next available number segment is loaded asynchronously. This ensures a constant availability of number segments in memory..

Tinyid(DiDi)

Tinyid system architecture diagram

Tinyid-server offer 2 methods nextId and getNextSegmentId
- NextId is to get the next id. When nextId is called, bizType will be passed in. The id data of each bizType is isolated, and the generated id will use the IdGenerator generated by the bizType type.
- getNextSegmentId is to get the next available number segment, Tinyid-client will get the available number segment through this interface
IdGeneratorFactory is a factory that produces specific IdGenerator, and each biz_type generates an IdGenerator instance. Through the factory, we can add biz_type to db at any time without restarting the servic. IdGeneratorFactory actually has two subclasses: IdGeneratorFactoryServer and IdGeneratorFactoryClient. The difference is that the getNextSegmentId is different, one is DbGet and the other is HttpGet
CachedIdGenerator is a specific id generator object, holding currentSegmentId and nextSegmentId objects, and is responsible for the core process of nextId. The nextId is finally generated by the AtomicLong.andAndGet(delta) method.

Pros

Avoid database locking
Scalable and low latency solution

Cons:

It requires a complex setup

How to use

Library

https://github.com/oklog/ulid
https://github.com/sony/sonyflake
https://github.com/callicoder/java-snowflake
https://github.com/baidu/uid-generator
https://github.com/didi/tinyid
https://github.com/segmentio/ksuid

Hung's Notes

Discussion about this post