Variable-length quantity

A variable-length quantity is a universal code that uses an arbitrary number of binary octets to represent an arbitrarily large integer. A VLQ is essentially a base-128 representation of an unsigned integer with the addition of the eighth bit to mark continuation of bytes. See the example below.

Applications and history

Base-128 compression is known by many namesVB, VByte, Varint, VInt, EncInt etc.
A variable-length quantity was defined for use in the standard MIDI file format to save additional space for a resource constrained system, and is also used in the later Extensible Music Format.
Base-128 is also used in ASN.1 BER encoding to encode tag numbers and Object Identifiers. It is also used in the WAP environment, where it is called variable length unsigned integer or uintvar. The DWARF debugging format defines a variant called LEB128, where the least significant group of 7 bits are encoded in the first byte and the most significant bits are in the last byte. Google Protocol Buffers use a similar format to have compact representation of integer values, as does Oracle Portable Object Format and the Microsoft.NET Framework "7-bit encoded int" in the BinaryReader and BinaryWriter classes.
It is also used extensively in web browsers for source mapping – which contain a lot of integer line & column number mappings – to keep the size of the map to a minimum.
Variable width integers in LLVM use a similar principle. The encoding chunks are little-endian and need not be 8 bits in size. The LLVM documentation describes a field that uses 4-bit chunk, with each chunk consisting of 1 bit continuation and 3 bits payload.

General structure

The encoding assumes an octet where the most significant bit, also commonly known as the sign bit, is reserved to indicate whether another VLQ octet follows.
If A is 0, then this is the last VLQ octet of the integer. If A is 1, then another VLQ octet follows.
B is a 7-bit number and n is the position of the VLQ octet where B₀ is the least significant. The VLQ octets are arranged most significant first in a stream.

Variants

The general VLQ encoding is simple, but in basic form is only defined for unsigned integers, and is somewhat redundant, since prepending 0x80 octets corresponds to zero padding. There are various signed number representations to handle negative numbers, and techniques to remove the redundancy.

Group Varint Encoding

Google developed Group Varint Encoding after observing that traditional VLQ encoding incurs many CPU branches during decompression. GVE uses a single byte as a header for 4 variable-length uint32 values. The header byte has 4 2-bit numbers representing the storage length of each of the following 4 uint32s. Such a layout eliminates the need to check and remove VLQ continuation bits. Data bytes can be copied directly to their destination. This layout reduces CPU branches, making GVE faster than VLQ on modern pipelined CPUs.
PrefixVarint is a similar design but with a uint64 maximum. It is said to have "been invented multiple times independently". It is possible to be changed into a chained version with infinitely many continuations.

Signed Numbers

Sign bit

Negative numbers can be handled using a sign bit, which only needs to be present in the first octet.
In the data format for Unreal Packages used by the Unreal Engine, a variable length quantity scheme called Compact Indices is used. The only difference in this encoding is that the first VLQ has the sixth bit reserved to indicate whether the encoded integer is positive or negative. Any consecutive VLQ octet follows the general structure.
If A is 0, then this is the last VLQ octet of the integer. If A is 1, then another VLQ octet follows.
If B is 0, then the VLQ represents a positive integer. If B is 1, then the VLQ represents a negative number.
C is number chunk being encoded and n is the position of the VLQ octet where C₀ is the least significant. The VLQ octets are arranged most significant first in a stream.

Zigzag encoding

An alternative way to encode negative numbers is to use the least-significant bit for sign. This is notably done for Google Protocol Buffers, and is known as a zigzag encoding for signed integers. One can encode the numbers so that encoded 0 corresponds to 0, 1 to −1, 10 to 1, 11 to −2, 100 to 2, etc.: counting up alternates between nonnegative and negative, whence the name "zigzag encoding". Concretely, transform the integer as ^ for fixed k-bit integers.

Two's complement

LEB128 uses two's complement to represent signed numbers. In this scheme of representation, n bits encodes a range from -2ⁿ to 2ⁿ-1, and all negative numbers start with a 1 in the most significant bit. In Signed LEB128, the input is sign extended so that its length is a multiple of 7 bits. From there the encoding proceeds as usual.
In LEB128, the stream is arranged least significant first.

Removing redundancy

With the VLQ encoding described above, any number that can be encoded with N octets can also be encoded with more than N octets simply by prepending additional 0x80 octets as zero-padding. For example, the decimal number 358 can be encoded as the 2-octet VLQ 0x8266 or the number 0358 can be encoded as 3-octet VLQ 0x808266 or 00358 as the 4-octet VLQ 0x80808266 and so forth.
However, the VLQ format used in Git removes this prepending redundancy and extends the representable range of shorter VLQs by adding an offset to VLQs of 2 or more octets in such a way that the lowest possible value for such an -octet VLQ becomes exactly one more than the maximum possible value for an N-octet VLQ. In particular, since a 1-octet VLQ can store a maximum value of 127, the minimum 2-octet VLQ is assigned the value 128 instead of 0. Conversely, the maximum value of such a 2-octet VLQ is 16511 instead of just 16383. Similarly, the minimum 3-octet VLQ has a value of 16512 instead of zero, which means that the maximum 3-octet VLQ is 2113663 instead of just 2097151.
In this way, there is one and only one encoding of each integer, making this a base-128 bijective numeration.

Examples

Here is a worked out example for the decimal number 137:

Represent the value in binary notation
Break it up in groups of 7 bits starting from the lowest significant bit. This is equivalent to representing the number in base 128.
Take the lowest 7 bits and that gives you the least significant byte. This byte comes last.
For all the other groups of 7 bits, set the MSB to 1. Thus 137 becomes 1000 0001 0000 1001 where the bits in boldface are something we added. These added bits denote if there is another byte to follow or not. Thus, by definition, the very last byte of a variable length integer will have 0 as its MSB.

Another way to look at this is to represent the value in base-128, and then set the MSB of all but the last base-128 digit to 1.
The Standard MIDI File format specification gives more examples:

Integer	Integer	Integer	Variable-length quantity	Variable-length quantity
0	0x00000000	00000000 00000000 00000000 00000000	0x00	00000000
127	0x0000007F	00000000 00000000 00000000 01111111	0x7F	01111111
128	0x00000080	00000000 00000000 00000000 10000000	0x81 0x00	10000001 00000000
8192	0x00002000	00000000 00000000 00100000 00000000	0xC0 0x00	11000000 00000000
16383	0x00003FFF	00000000 00000000 00111111 11111111	0xFF 0x7F	11111111 01111111
16384	0x00004000	00000000 00000000 01000000 00000000	0x81 0x80 0x00	10000001 10000000 00000000
2097151	0x001FFFFF	00000000 00011111 11111111 11111111	0xFF 0xFF 0x7F	11111111 11111111 01111111
2097152	0x00200000	00000000 00100000 00000000 00000000	0x81 0x80 0x80 0x00	10000001 10000000 10000000 00000000
134217728	0x08000000	00001000 00000000 00000000 00000000	0xC0 0x80 0x80 0x00	11000000 10000000 10000000 00000000
268435455	0x0FFFFFFF	00001111 11111111 11111111 11111111	0xFF 0xFF 0xFF 0x7F	11111111 11111111 11111111 01111111

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...