on
Shard DB: A CaskDB Implementation
Shard DB: A Cask DB implementation
I've been writing code for my entire adult life, but I've never felt like I truly understood some of the really complex tech that powers so much of our lives. For example, I'm endlessly fascinated that people can just write programs to procedurally generate graphical worlds. I don't even know how to begin generating graphics. But in the last 10 years, I've spent a lot of time working with various datastores, from MySQL to Elasticsearch to Redis, and I think I understand how all the pieces fit together at a high level.
That's why I was really excited to stumble upon Riak's CaskDB paper. Go ahead. Read it. It's only six pages long, and most of that consists of diagrams.
And I thought to myself... why not give it a shot? It's fairly straightforward!
All code below is in Go, which is my preferred language, although this code should be simple enough to implement in any language you choose.
How it Works
At its core, Cask is a hybrid file-based and memory-based key-value datastore. We store the keys and values to disk in append-only files, and we store the keys and value positions in memory. The files are append-only, which means writes are extremely fast. This does mean that files can get large, but that gets addressed a little later.
File-based data structure
Imagine we have the following data:
key | value |
---|---|
name | Maximus Pegasus |
Given this particular key/value pair, we need to generate some information, then generate a fixed-length header to give us some metadata about the key and value. The fixed-length header is a total of 16 bytes:
- Timestamp (64 bits / 8 bytes)
- Key Size (32 bits / 4 bytes)
- Value Size (32 bits / 4 bytes)
Then we'll append the {header}{key}{value}
as bytes in the data file, recording the position where this entry begins.
In the above example, we have a key size of len(name) -> 4
and a value size of len(value) -> 15
. So our fixed header would be something along the lines of:
timestamp | key_size | value_size | key | value |
---|---|---|---|---|
1652215185 | 4 | 15 | name | Maximus Pegasus |
The total length of this record would be:
16 (fixed-length header) + 4 (key) + 15 (value) = 31 bytes
.
Note: as I mentioned earlier, the data store files are append-only, so as we add or update new data, they get tacked onto the end of the file. You wind up with something like this:
Memory-based data structure
Now that we've established our file format, we need some way to look up this data. Riak's CaskDB spec has a pretty elegant solution: a key directory. At its core, it's basically a hashmap where the key is... well, the key, and the map contains the value size and the position of the record.
Imagine we have the following records:
key | value |
---|---|
name | Maximus Pegasus |
job | Chief Wing Repair Officer |
age | 23 |
legs | 4 |
wings | 2 |
Our file would look something like this:
timestamp | key_size | value_size | key | value |
---|---|---|---|---|
1652215185 | 4 | 15 | name | Maximus Pegasus |
1652216123 | 3 | 25 | job | Chief Wing Repair Officer |
1652218141 | 3 | 2 | age | 23 |
1652219662 | 4 | 1 | legs | 4 |
1652212031 | 5 | 1 | wings | 2 |
The length of each record would be 31, 44, 21, 21, and 22 bytes, respectively. This is important, so keep this in mind for now.
Now we need to construct our Key Directory. The Key Directory is basically a list of the keys, the value lengths, and the positions of those records. Those record lengths above? Those values are used to determine the ValuePosition
. Given the above examples, you would have a file with records at these offsets:
key | length | offset (sum of previous lengths) |
---|---|---|
name | 31 | 0 |
job | 44 | 31 |
age | 21 | 75 |
legs | 21 | 96 |
wings | 22 | 118 |
Why is this useful? Well, it means that when we look up the key age
, we know exactly where to start parsing the record! It starts at byte 75!
The Code
OK, now that we know how this works, more or less, we can start writing some code. We'll break this up into five parts:
- The fixed-length header
- The variable-length key/values
- Writing to Disk
- Creating the Key Directory
- Read a value from disk
The Header
First, we'll define our fixed-length header:
type Header struct {
Timestamp int64
KeySize int32
ValSize int32
}
Note: I know that the Riak Bitcask paper says the timestamp should be int32, but Go returns Unix timestamps in 64-bit integers, and I feel like that's acceptable, so I'm leaving it at 64 bits.
Next, we need to encode our Header
into bytes. This is pretty easy with Go's bytes
package:
func (h Header) Encode() []byte {
buf := new(bytes.Buffer)
binary.Write(buf, binary.LittleEndian, h.Timestamp)
binary.Write(buf, binary.LittleEndian, h.KeySize)
binary.Write(buf, binary.LittleEndian, h.ValSize)
return buf.Bytes()
}
Decoding this is a little more complicated, but only because we're dealing with variable byte sizes. If we look at our Header, we have the following sizes:
field | bits | bytes |
---|---|---|
Timestamp | 64 | 8 |
KeySize | 32 | 4 |
ValSize | 32 | 4 |
So, in order to decode, we need to do the following:
- Grab the next N bytes
- Decode from bytes to appropriate integer value
- Assign to struct
The code would look something like this:
func (h *Header) Decode(data []byte) {
reader := bytes.NewReader(data)
ts := make([]byte, 8)
_, err := reader.ReadAt(ts, 0)
if err != nil {
panic(err)
}
ks := make([]byte, 4)
_, err = reader.ReadAt(ks, 8)
if err != nil {
panic(err)
}
vs := make([]byte, 4)
_, err = reader.ReadAt(vs, 12)
if err != nil {
panic(err)
}
h.Timestamp = int64(binary.LittleEndian.Uint64(ts))
h.KeySize = int32(binary.LittleEndian.Uint32(ks))
h.ValSize = int32(binary.LittleEndian.Uint32(vs))
}
First, we grab the first 8 bytes which represent the 64-bit timestamp. Then we grab the next 4 bytes, which represent the 32-bit key size, and then we grab the final 4 bytes, which represent the 32-bit value size. Now we have our fixed-length header!
Here's a quick unit test to sanity check ourselves:
func TestEncodeKV(t *testing.T) {
kv := KeyValue{
Timestamp: time.Now().Unix(),
Key: "name",
Value: "Maximus Pegasus",
}
size, data := kv.Encode()
fmt.Println(size)
result := KeyValue{}
result.Decode(data)
assert.Equal(t, result.Timestamp, kv.Timestamp)
assert.Equal(t, result.Key, kv.Key)
assert.Equal(t, result.Value, kv.Value)
assert.Equal(t, result.Size, size)
}
The Variable Length Keys / Values
If you recall from the Bitcask paper and our explanation above, our entire record is composed of the fixed-length header that we just created, plus the actual key and actual value.
I called this a KeyValue
item, and it looks something like this:
type KeyValue struct {
Key string
Value string
Size int
}
Note: the Size
field is primarily for sanity-checking our work later. It's not strictly necessary here.
We'll encode this struct in exactly the same was awe did the header -- but this record would actually include the header as well, which we can construct from the key and value!
func (kv KeyValue) Encode() (int, []byte) {
header := Header{
Timestamp: kv.Timestamp,
KeySize: int32(len(kv.Key)),
ValSize: int32(len(kv.Value)),
}
buf := new(bytes.Buffer)
buf.Write(header.Encode())
buf.Write([]byte(kv.Key))
buf.Write([]byte(kv.Value))
return buf.Len(), buf.Bytes()
}
First, we create the Header based on the length of the key and the value and the timestamp and encode that to a byte slice. Once we have that, we simply put that at the beginning of our bytes buffer, then add the bytes for the key and value, and voila! Our record!
Decoding it is once again as simple as decoding the Header, with the exact same idea. This time, we'll decode the Header first, so that we know how many bytes long the key and value are, then skip ahead 16 bytes to grab the key and value.
func (kv *KeyValue) Decode(data []byte) {
reader := bytes.NewReader(data)
header := Header{}
header.Decode(data)
offset := int64(16)
key := make([]byte, header.KeySize)
_, err := reader.ReadAt(key, offset)
if err != nil {
panic(err)
}
offset += int64(len(key))
value := make([]byte, header.ValSize)
_, err = reader.ReadAt(value, offset)
if err != nil {
panic(err)
}
kv.Timestamp = header.Timestamp
kv.Key = string(key)
kv.Value = string(value)
kv.Size = len(data)
}
Writing to Disk
Now that we can encode and decode our data, it's time to write that to disk. In the Bitcask paper, the rule is that you can only write to one file at a time -- the "active" file -- and so this part is fairly easy! Let's just write to a single file for now.
First, we'll load a file that will allow us to write in append mode:
func loadFile(path string) error {
f, err := os.OpenFile(file, os.O_RDWR, fs.ModeAppend)
if err != nil {
handleError(err) // whatever this means to you
}
return f
}
Next, we'll create a function to set our key and value, then encode our key/value with the functions we created above. It would look something like this:
func Set(key string, value string) error {
timestamp := time.Now().Unix()
kv := KeyValue{
Timestamp: timestamp,
Key: key,
Value: value,
}
size, data := kv.Encode()
f := loadFile("test.db")
position, _ := f.Seek(0, io.SeekEnd) // set the offset at the end of the file -- append-only, remember?
n, err := f.Write(data) // write the data to disk
if err != nil {
handleError(err)
}
// Lets' sanity check ourselves to make sure we wrote what we thought we were going to write
if n != size {
return log.Fatalf("Oops! Failed to write expected length! Expected %d, got %d!", size, n)
}
// No problem!
return nil
}
That's it! Now we've encoded our KeyValue
and written it to disk.
Level Up Opportunity: to clean this up further, don't load the file on every Set
command, but instead keep the file-handler in memory. I'll leave this as an exercise to you!
Create the Key Directory
We need to do one more thing before we move on, and that's add this key to the Key Directory. The Key Directory is the gizmo that lets us find the values we've already written. As discussed above, the Key Directory is fairly simple, and it looks like this:
var KeyDirectory = make(map[string]KeyEntry)
type KeyEntry struct {
Timestamp int64
ValueSize int32
ValuePosition int64
}
The KeyDirectory
is just a map[string]KeyEntry
where the key is the... well, the key, and the value is a struct that tells us where in the active file to find that key. Pretty simple!
At the end of your Set
function, simply add this code:
keyEntry := KeyEntry{
Timestamp: timestamp,
ValueSize: int32(size),
ValuePosition: position,
}
KeyDirectory[key] = keyEntry
This simply creates the Key Directory entry and puts it into memory!
Your final code for the Set
function should look like this:
func Set(key string, value string) error {
timestamp := time.Now().Unix()
kv := KeyValue{
Timestamp: timestamp,
Key: key,
Value: value,
}
size, data := kv.Encode()
f := loadFile("test.db")
position, _ := f.Seek(0, io.SeekEnd) // set the offset at the end of the file -- append-only, remember?
n, err := f.Write(data) // write the data to disk
if err != nil {
handleError(err)
}
// Lets' sanity check ourselves to make sure we wrote what we thought we were going to write
if n != size {
return log.Fatalf("Oops! Failed to write expected length! Expected %d, got %d!", size, n)
}
keyEntry := KeyEntry{
Timestamp: timestamp,
ValueSize: int32(size),
ValuePosition: position,
}
KeyDirectory[key] = keyEntry
// No problem!
return nil
}
Read value from disk
Now that we've written to disk and created our Key Directory, actually grabbing the value from disk couldn't be simpler. It follows three basic steps:
- Grab the
KeyEntry
from the key directory map - Seek to the
ValuePosition
- Decode the
KeyValue
record starting from that position!
Note: technically, you could grab just the value since we know the size, but since we've already written the code to decode the whole record, why not just do that as well? It's going to be just as fast in 99.999% of cases.
You could probably write this code yourself, but just in case, here's how I might do it:
func Get(key string) string {
f := loadFile("test.db")
keyEntry, ok := KeyDirectory[key]
if !ok {
log.Fatalf("Invalid key %s", key)
}
f.Seek(0, int(keyEntry.ValuePosition)) // jump to the offset of the record
data, := make([]byte, keyEntry.ValueSize)
_, err := f.Read(data)
if err != nil {
panic(err)
}
entry := &KeyValue{}
entry.Decode(data)
return string(entry.Value)
}
Next Steps
Now that you have a functioning key/value store that persists data to disk and keeps its access lookup in memory, you're pretty much ready to rock. There are some optimizations to make, and the Riak Bitcask paper has some ideas on how to improve it. Here are just a few of them:
- Organize your code to keep the active file handler in memory instead of loading it on get/set.
- Since the file is append-only, you will eventually get duplicate records and file sizes can explode. Once the file hits a certain size, migrate the data to a newer, smaller file.
- Create a Hint File (see Riak Bitcask docs for more explanation) to speed up app startup times.
- Allow reads from older files instead of the only active file. (Remember: writes will only go to the active file handler!).
- Right now our Bitcask implementation only supports string values. Modify this to support multiple value types: ints, floats, strings, bytes, maybe even complex data structures.