The ID Problem
For a large part of the history of databases, developers have relied upon a tried-and-true method of generating primary keys: the auto-incrementing ID. The way this works is fairly simple and is exactly the way it sounds: every new row increments the ID by 1.
- Row 1 gets ID 1
- Row 2 gets ID 2
- Row 3 gets ID 3
- Row N gets ID N
This works effectively forever, up until you hit MAXINT, which is somewhere around 2.1 billion for a 32-bit integer and virtually unlimited in the 64-bit space.
As I mentioned, this works brilliantly... except for one thing: it makes IDs extremely predictable.
What's wrong with that? Well, imagine that you are a malicious actor. You want to disrupt someone else's account. You create your own account, and you realize you are assigned User ID 5000. Maybe you see this in API responses, maybe it's in the URL, maybe it's explicitly defined on the account page somewhere. Regardless, you know you are user 5000.
This gives you a bit of information, including but not limited to the following:
- There are at a minimum 5000 users
- You are the 5000th user
- There is presumably a user with an ID of 4999, 4998, 4997, 4996,... and so on
Imagine you go to your Direct Messages page, then look at the network tab. You see a GET request to
/api/users/5000/messages. In the response, you see an array of ostensibly private messages.
"Hmm," you think. "What if I make an identical request and swap 5000 out for 4999?"
If the site hasn't taken the appropriate precautions, you may be able to see another user's "private" messages! No bueno.
But even if they have solved for that, you still know they have 5000 users, and as perhaps a competitor, you may be able to glean additional information about the company's size, revenue, and growth based off of that number. Still not good for the company.
UUIDs and GUIDs
One possible way to solve this problem is to simply assign a UUID to each entity. Instead of an integer like 5000, you assign a UUID like
ba0f5a59-0299-47c9-9890-a71d6d82b191. For UUIDs, this is basically guaranteed to be unique. The length and complexity of the UUID means that guessing another user's UUID is exceedingly unlikely. Malicious attempts to scrape data would be easily detected.
But there are problems with this approach. For one thing, it's a pretty complex thing to look at. What are each section? Is it useful to talk about User
ba0f5a59-0299-47c9-9890-a71d6d82b191 with other humans? When customers see an error message with UUIDs, is it easy for them to read them back to Customer Support?
There's a time and place for UUIDs, but in this case, perhaps there's a better solution.
Enter the Snowflake ID. First developed by Twitter in July of 2010, the Snowflake ID is a deterministic-but-pseudo-random 64-bit integer that is roughly sortable. We'll come back to that in a moment.
There are 64 bits in a Snowflake ID. That is, in binary, there are sixty-four ones and zeroes back to back. This is a large enough space that we will practically never run out of integer space. The total number of possible values is 2^64, which is 18,446,744,073,709,551,616. You could assign each person in the world a unique ID over 2 BILLION times before having to repeat a number. There are very few data sets in the real world that approach these kinds of numbers. Therefore, 64 bits is a pretty safe range of integers to use for an ID space like this.
Because we're treating them as 64 bits and not a single integer, we can split up those bits into different roles. This is effectively how it works:
The first 41 bits are for the timestamp. As I mentioned earlier, Snowflake IDs are roughly sortable. The key to this is to ensure that the first 41 bits are time-based! As time goes on, this number goes higher and higher. That means an ID generated now should be lower than an ID generated 5 seconds from now.
How granular is the time? Well, with 41 bits we can basically only get to milliseconds since epoch, which in human terms is milliseconds since January 1, 1970. Don't ask me why they picked that date. Just roll with it.
The next 10 bits are for the machine ID, which is unique per machine. This is something that helps prevent clashes.
The next 12 bits are for the sequence number, which is also defined on a per-machine basis. If you have two servers generating IDs, this guarantees that if you have two IDs generated at the exact same millisecond, they'll wind up with different numbers. Very important.
The Lost Bit??
Savvy readers will notice that 41+10+12=63, so we appear to be missing a bit. That initial bit is reserved for signed vs unsigned integers. Technically, it's a 63-bit integer with a 1-bit positive-or-negative sign.
The Final Number
Now you append each section to each other, so now you wind up with 1+41+10+12 = 64 bits. Then we can take those 64 bits and convert into an integer. And that's our Snowflake ID!
How to Do it
Now that we know how Snowflake IDs are generated, let's write some code to do it.