Shard DB: A CaskDB Implementation


Shard DB: A Cask DB implementation

I've been writing code for my entire adult life, but I've never felt like I truly understood some of the really complex tech that powers so much of our lives. For example, I'm endlessly fascinated that people can just write programs to procedurally generate graphical worlds. I don't even know how to begin generating graphics. But in the last 10 years, I've spent a lot of time working with various datastores, from MySQL to Elasticsearch to Redis, and I think I understand how all the pieces fit together at a high level.

That's why I was really excited to stumble upon Riak's CaskDB paper. Go ahead. Read it. It's only six pages long, and most of that consists of diagrams.

And I thought to myself... why not give it a shot? It's fairly straightforward!

All code below is in Go, which is my preferred language, although this code should be simple enough to implement in any language you choose.

How it Works

At its core, Cask is a hybrid file-based and memory-based key-value datastore. We store the keys and values to disk in append-only files, and we store the keys and value positions in memory. The files are append-only, which means writes are extremely fast. This does mean that files can get large, but that gets addressed a little later.

File-based data structure

Imagine we have the following data:

key value
name Maximus Pegasus

Given this particular key/value pair, we need to generate some information, then generate a fixed-length header to give us some metadata about the key and value. The fixed-length header is a total of 16 bytes:

Then we'll append the {header}{key}{value} as bytes in the data file, recording the position where this entry begins.

In the above example, we have a key size of len(name) -> 4 and a value size of len(value) -> 15. So our fixed header would be something along the lines of:

timestamp key_size value_size key value
1652215185 4 15 name Maximus Pegasus

The total length of this record would be:

16 (fixed-length header) + 4 (key) + 15 (value) = 31 bytes.

Note: as I mentioned earlier, the data store files are append-only, so as we add or update new data, they get tacked onto the end of the file. You wind up with something like this:

A visual representation of the file structure

Memory-based data structure

Now that we've established our file format, we need some way to look up this data. Riak's CaskDB spec has a pretty elegant solution: a key directory. At its core, it's basically a hashmap where the key is... well, the key, and the map contains the value size and the position of the record.

Imagine we have the following records:

key value
name Maximus Pegasus
job Chief Wing Repair Officer
age 23
legs 4
wings 2

Our file would look something like this:

timestamp key_size value_size key value
1652215185 4 15 name Maximus Pegasus
1652216123 3 25 job Chief Wing Repair Officer
1652218141 3 2 age 23
1652219662 4 1 legs 4
1652212031 5 1 wings 2

The length of each record would be 31, 44, 21, 21, and 22 bytes, respectively. This is important, so keep this in mind for now.

Now we need to construct our Key Directory. The Key Directory is basically a list of the keys, the value lengths, and the positions of those records. Those record lengths above? Those values are used to determine the ValuePosition. Given the above examples, you would have a file with records at these offsets:

key length offset (sum of previous lengths)
name 31 0
job 44 31
age 21 75
legs 21 96
wings 22 118

Why is this useful? Well, it means that when we look up the key age, we know exactly where to start parsing the record! It starts at byte 75!

The Code

OK, now that we know how this works, more or less, we can start writing some code. We'll break this up into five parts:

  1. The fixed-length header
  2. The variable-length key/values
  3. Writing to Disk
  4. Creating the Key Directory
  5. Read a value from disk

The Header

First, we'll define our fixed-length header:

type Header struct {
    Timestamp int64
    KeySize int32
    ValSize int32
}

Note: I know that the Riak Bitcask paper says the timestamp should be int32, but Go returns Unix timestamps in 64-bit integers, and I feel like that's acceptable, so I'm leaving it at 64 bits.

Next, we need to encode our Header into bytes. This is pretty easy with Go's bytes package:

func (h Header) Encode() []byte {
    buf := new(bytes.Buffer)

    binary.Write(buf, binary.LittleEndian, h.Timestamp)
	binary.Write(buf, binary.LittleEndian, h.KeySize)
	binary.Write(buf, binary.LittleEndian, h.ValSize)

	return buf.Bytes()
}

Decoding this is a little more complicated, but only because we're dealing with variable byte sizes. If we look at our Header, we have the following sizes:

field bits bytes
Timestamp 64 8
KeySize 32 4
ValSize 32 4

So, in order to decode, we need to do the following:

  1. Grab the next N bytes
  2. Decode from bytes to appropriate integer value
  3. Assign to struct

The code would look something like this:

func (h *Header) Decode(data []byte) {
	reader := bytes.NewReader(data)

	ts := make([]byte, 8)
	_, err := reader.ReadAt(ts, 0)
	if err != nil {
		panic(err)
	}

	ks := make([]byte, 4)
	_, err = reader.ReadAt(ks, 8)
	if err != nil {
		panic(err)
	}

	vs := make([]byte, 4)
	_, err = reader.ReadAt(vs, 12)
	if err != nil {
		panic(err)
	}

	h.Timestamp = int64(binary.LittleEndian.Uint64(ts))
	h.KeySize = int32(binary.LittleEndian.Uint32(ks))
	h.ValSize = int32(binary.LittleEndian.Uint32(vs))
}

First, we grab the first 8 bytes which represent the 64-bit timestamp. Then we grab the next 4 bytes, which represent the 32-bit key size, and then we grab the final 4 bytes, which represent the 32-bit value size. Now we have our fixed-length header!

Here's a quick unit test to sanity check ourselves:

func TestEncodeKV(t *testing.T) {
	kv := KeyValue{
		Timestamp: time.Now().Unix(),
		Key:       "name",
		Value:     "Maximus Pegasus",
	}

	size, data := kv.Encode()

	fmt.Println(size)
	result := KeyValue{}
	result.Decode(data)

	assert.Equal(t, result.Timestamp, kv.Timestamp)
	assert.Equal(t, result.Key, kv.Key)
	assert.Equal(t, result.Value, kv.Value)
	assert.Equal(t, result.Size, size)
}

The Variable Length Keys / Values

If you recall from the Bitcask paper and our explanation above, our entire record is composed of the fixed-length header that we just created, plus the actual key and actual value.

I called this a KeyValue item, and it looks something like this:


type KeyValue struct {
	Key       string
	Value     string
	Size      int
}

Note: the Size field is primarily for sanity-checking our work later. It's not strictly necessary here.

We'll encode this struct in exactly the same was awe did the header -- but this record would actually include the header as well, which we can construct from the key and value!

func (kv KeyValue) Encode() (int, []byte) {

	header := Header{
		Timestamp: kv.Timestamp,
		KeySize:   int32(len(kv.Key)),
		ValSize:   int32(len(kv.Value)),
	}

	buf := new(bytes.Buffer)
	buf.Write(header.Encode())
	buf.Write([]byte(kv.Key))
	buf.Write([]byte(kv.Value))

	return buf.Len(), buf.Bytes()
}

First, we create the Header based on the length of the key and the value and the timestamp and encode that to a byte slice. Once we have that, we simply put that at the beginning of our bytes buffer, then add the bytes for the key and value, and voila! Our record!

Decoding it is once again as simple as decoding the Header, with the exact same idea. This time, we'll decode the Header first, so that we know how many bytes long the key and value are, then skip ahead 16 bytes to grab the key and value.

func (kv *KeyValue) Decode(data []byte) {
	reader := bytes.NewReader(data)

	header := Header{}
	header.Decode(data)

	offset := int64(16)
	key := make([]byte, header.KeySize)
	_, err := reader.ReadAt(key, offset)
	if err != nil {
		panic(err)
	}
	offset += int64(len(key))

	value := make([]byte, header.ValSize)
	_, err = reader.ReadAt(value, offset)
	if err != nil {
		panic(err)
	}

	kv.Timestamp = header.Timestamp
	kv.Key = string(key)
	kv.Value = string(value)
	kv.Size = len(data)
}

Writing to Disk

Now that we can encode and decode our data, it's time to write that to disk. In the Bitcask paper, the rule is that you can only write to one file at a time -- the "active" file -- and so this part is fairly easy! Let's just write to a single file for now.

First, we'll load a file that will allow us to write in append mode:

func loadFile(path string) error {
    f, err := os.OpenFile(file, os.O_RDWR, fs.ModeAppend)
    if err != nil {
        handleError(err) // whatever this means to you
    }

    return f
}

Next, we'll create a function to set our key and value, then encode our key/value with the functions we created above. It would look something like this:

func Set(key string, value string) error {
    timestamp := time.Now().Unix()

    kv := KeyValue{
        Timestamp: timestamp,
        Key: key,
        Value: value,
    }

    size, data := kv.Encode()

    f := loadFile("test.db")
    position, _ := f.Seek(0, io.SeekEnd) // set the offset at the end of the file -- append-only, remember?
    n, err := f.Write(data) // write the data to disk
    if err != nil {
        handleError(err)
    }

    // Lets' sanity check ourselves to make sure we wrote what we thought we were going to write
    if n != size {
        return log.Fatalf("Oops! Failed to write expected length! Expected %d, got %d!", size, n)
    }

    // No problem!
    return nil
}

That's it! Now we've encoded our KeyValue and written it to disk.

Level Up Opportunity: to clean this up further, don't load the file on every Set command, but instead keep the file-handler in memory. I'll leave this as an exercise to you!

Create the Key Directory

We need to do one more thing before we move on, and that's add this key to the Key Directory. The Key Directory is the gizmo that lets us find the values we've already written. As discussed above, the Key Directory is fairly simple, and it looks like this:

var KeyDirectory = make(map[string]KeyEntry)

type KeyEntry struct {
    Timestamp int64
    ValueSize int32
    ValuePosition int64
}

The KeyDirectory is just a map[string]KeyEntry where the key is the... well, the key, and the value is a struct that tells us where in the active file to find that key. Pretty simple!

At the end of your Set function, simply add this code:

keyEntry := KeyEntry{
    Timestamp: timestamp,
    ValueSize: int32(size),
    ValuePosition: position,
}

KeyDirectory[key] = keyEntry

This simply creates the Key Directory entry and puts it into memory!

Your final code for the Set function should look like this:

func Set(key string, value string) error {
    timestamp := time.Now().Unix()

    kv := KeyValue{
        Timestamp: timestamp,
        Key: key,
        Value: value,
    }

    size, data := kv.Encode()

    f := loadFile("test.db")
    position, _ := f.Seek(0, io.SeekEnd) // set the offset at the end of the file -- append-only, remember?
    n, err := f.Write(data) // write the data to disk
    if err != nil {
        handleError(err)
    }

    // Lets' sanity check ourselves to make sure we wrote what we thought we were going to write
    if n != size {
        return log.Fatalf("Oops! Failed to write expected length! Expected %d, got %d!", size, n)
    }

    keyEntry := KeyEntry{
        Timestamp: timestamp,
        ValueSize: int32(size),
        ValuePosition: position,
    }

    KeyDirectory[key] = keyEntry

    // No problem!
    return nil
}

Read value from disk

Now that we've written to disk and created our Key Directory, actually grabbing the value from disk couldn't be simpler. It follows three basic steps:

  1. Grab the KeyEntry from the key directory map
  2. Seek to the ValuePosition
  3. Decode the KeyValue record starting from that position!

Note: technically, you could grab just the value since we know the size, but since we've already written the code to decode the whole record, why not just do that as well? It's going to be just as fast in 99.999% of cases.

You could probably write this code yourself, but just in case, here's how I might do it:

func Get(key string) string {
    f := loadFile("test.db")
    keyEntry, ok := KeyDirectory[key]
    if !ok {
        log.Fatalf("Invalid key %s", key)
    }

    f.Seek(0, int(keyEntry.ValuePosition)) // jump to the offset of the record
    data, := make([]byte, keyEntry.ValueSize)
    _, err := f.Read(data)
    if err != nil {
        panic(err)
    }

    entry := &KeyValue{}
    entry.Decode(data)

    return string(entry.Value)
}

Next Steps

Now that you have a functioning key/value store that persists data to disk and keeps its access lookup in memory, you're pretty much ready to rock. There are some optimizations to make, and the Riak Bitcask paper has some ideas on how to improve it. Here are just a few of them: