Explaining Varints

2018-10-28

Abstract: Discussion about sending data across the wire in fixed-width and variable-width formats; how varints work; implementing varints in Rust.

How can a computer send data to another computer? There are a bunch of different perspectives to this question - networking, hardware, security, communication protocols, and a million others. I want to just focus on the question from a very simple serialization perspective. Serialization (and conversely, deserialization) deals with transforming data into a format that can be transmitted across a network. The question now becomes: how can a computer format data in order to send it across the wire (read: network) in a way that another computer can understand the data?

Let’s break it down to a simple use case - how does a computer send a single non-negative integer to another computer? Let’s assume we have two computers that communicate with each other by passing sequences of bits. Let’s assume they know the mechanics of passing messages to each other (they know how to send and receive on a network, they know how to parse incoming bits, etc) and that the network is stable (bits arrive in order, bits are always delivered).

Sending a single bit

So we have two computers, A and B. A will always send to B. All they want to do is send single non-negative integers to each other.

B <- A

A wants to send the number 1 to B. This is pretty simple - we can capture the value of 1 in a single bit.

B <- A
bits: 1

Sending a multiple bits

Now A wants to send the number 8 to B. This is also pretty simple - we need to expand the number of bits we’re sending, but it’s still only four.

B <- A
bits: 1000

Unfortunately, there’s a little snag. What happens if the 0th bit (in this case, the last 0) doesn’t make it to computer B? While A is putting bits on the network, it crashes before it can output the last bit. From B’s perspective, it has received the bits 100, which is a valid non-negative integer.

We haven’t defined a standard for how large our messages will be, so B never knows when a message should end and assumes that any message it receives is valid and complete. From B’s perspective, this is a perfectly fine transaction and it doesn’t know that there were further bits that did not complete the journey. Data has been transferred between the two computers, with no ability by A or B to know that B has misinterpreted the data.

Fixed-width messages

One way to deal with this problem is to set a contract between A and B. We can define a certain number of bits required by all messages between A and B so that B can determine which messages are valid and which messages are incomplete.

One straightforward method is to define a fixed size of the messages being sent to B. An easy size to reach for is a byte. A and B can both know that each message will be the size of a byte - A will always send the non-negative integers encased in the size of a byte, and B will know that only once it receives an entire byte is a message complete. Sending 79 looks like the following:

B <- A
bits: 0100 1101

If B only receives six bits, it knows something is wrong and can handle that error accordingly.

Sending a 32-bit fixed-width message

The obvious issue with a byte-sized message is that 8 bits really isn’t a lot. That gives us a range of 256 non-negative integers to work with. We probably need something a bit bigger. Taking a cue from Java’s default integer size, we can settle on a fixed size of 32 bits. That gives us 4,294,967,296 non-negative integers - that’s a pretty big number! This comes out to a 4 byte message. Sending 1,423,998,764 looks like this:

B <- A
bits: 01010100 11100000 01111111 00101100

Limits with using a 32-bit fixed-width message

Now, let’s send the number 1 again. What does this look like?

B <- A
bits: 0000 0000 0000 0000 0000 0000 0000 0001

That’s a lot of extra bits just to send a single bit worth of data! Unfortunately, we can’t do much here as both computers expect messages to be a fixed width of 32 bits. Even sending larger numbers such as 50,000 only need to use half of the bits in the 32-bit message:

B <- A
bits: 00000000 00000000 11000011 01010000

We also reach a limit with how big a number that wants to be sent over the wire can be. If we want a number that’s bigger

So, it boils down to a lot of wasted bits for small numbers and a limit on the maximum magnitude of numbers that we can send using a fixed-width serialization format.

Variable-width messages

One solution to our fixed-width woes is to get a little smarter with the contract we define between A and B. So far we’ve defined a contract where we’ve specified the exact number of bytes a whole message will be - our earlier example was a 32 bit, or 4 byte, message. We can get a little smarter by slightly changing the contract we already defined, and adding another small component:

Define the size of each chunk of a message, rather than the entire message
Use a single bit in each chunk of a message to indicate whether there is another chunk

Rather than defining the required number of bits for an entire message, we can break the message into smaller parts, and require a smaller number of bits for each sub-message. We can then chain together sub-messages to be as short or as long as we want, giving us the flexibility we lacked with fixed-width messages. Do we need a number in the undecillion range? Great - we can specify that with a bunch of sub-messages. Need to send a measly one? Awesome! Just a single submessage will do.

But what does this format actually look like? A common format used for this type of contact is a variable-length quantity, where:

The defined size of each chunk of a message is 8 bits, or a byte
The single bit in each chunk of a message is the most-significant bit, or the left-hand bit.

This means that each chunk of a message has a range of 128 (2⁷) non-negative integers, since 1 bit is used to indicate whether another message follows.

There are a few dialects within variable length encoding that differ in small aspects, but we’re going to focus for the rest of this post specifically on varints for unsigned integers¹. Why varints? Well, Google Protocol Buffers uses varints to serialize integers in their protobuf messages.

Encoding Varints

Varints follow the rules for variable-length integers specified above - byte-sized message chunks, MSB for indicating an additional chunk - but orders each chunk from least-significant to most-significant.

An example will help clarify this. Let’s use 247398 as an integer we want to encode as a varint. The algorithm for encoding a non-negative integer into a varint is as follows:

1) Convert the integer to its binary format.

247398
111100011001100110

2) Slice into 7-bit chunks.

1111 0001100 1100110

3) Set the MSB for each chunk to 1 except for the most-significant byte, or left-hand chunk. Zero-pad the most-significant byte.

00001111 10001100 11100110

4) Flip the order of the chunks, so that they’re ordered from least-significant to most-significant.

00001111 10001100 11100110
   (1)      (2)      (3)

11100110 10001100 00001111
   (3)      (2)      (1)

And we have our varint for 247398! The first two bytes have the most-significant bit set to 1, indicating that more blocks are coming. The final block has the most-significant bit set to 0, indicting it is the final block.

Decoding Varints

To decode this varint, just reverse the algorithm:

1) Flip the order of the bytes, so that they’re ordered from most-significant to least-significant.

11100110 10001100 00001111
   (3)      (2)      (1)

00001111 10001100 11100110
   (1)      (2)      (3)

2) Strip the most-significant bit for each chunk.

0001111 0001100 1100110

3) Concatenate the 7-bit chunks.

000111100011001100110

4) Convert binary to integer form.

000111100011001100110
247398

And we’re back to our original integer, 247398!

Varints in Google Protocol Buffers

Varints are particularly useful in Google Protocol Buffer format. They’re used in a variety of situations, such as:

Enums use varint encoding. This means passing the first 127 values defined for an enum around is really efficient, since they can be packed into a single byte.
Strings, Embedded Messages and all “wire type 2” values use varint encoding to specify the byte length of the message
Keys (also known as tags) for each message use varints. A key is defined as the field number + wire type, where the wire type is the last three bits of the varint.

This is so cool! We can use varints to encode the field and type of a field, which itself may be a varint. We also use varints to encode the size of the field, which allows for efficient decoding when clients may not know or care about a particular value within the protobuf message without being limited on how big a field’s value can be.

Rust Implementation

I wanted to write a quick little program to convert non-negative integers to their varints and display in a hexadecimal format. This was my first time using Rust (this experience is worthy of a separate blog post), and it was quite fun to implement the above encoding algorithm. Whether or not I’ll implement the decoding step is TBD.

Check it out for yourself!

Conclusion

Variable-length integer encoding is a pivotal step in effeciently passing data around over the wire. It enables flexiblility in dealing with large numbers and considerable space considerations when sending small integers. Serialization formats are not one-size-fits-all, but varints play a particularly important role in ensuring Google Protocol Buffer messages are effecient in size and information.

Why just unsigned integers? Well, this is a hard enough concept, and introducing signed integers adds a few interesting but more complex wrinkles.