Bfloat16 floating-point format


The bfloat16 floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated version of the 32-bit IEEE 754 single-precision floating-point format with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.
The bfloat16 format is utilized in Intel AI processors, such as Nervana NNP-L1000, Xeon processors, and Intel FPGAs, Google Cloud TPUs, and TensorFlow. ARMv8.6-A also supports the bfloat16 format. As of October 2019, AMD has added support for the format to its ROCm libraries.

bfloat16 floating-point format

bfloat16 has the following format:
The bfloat16 format, being a truncated IEEE 754 single-precision 32-bit float, allows for fast conversion to and from an IEEE 754 single-precision 32-bit float; in conversion to the bfloat16 format, the exponent bits are preserved while the significand field can be reduced by truncation, ignoring the NaN special case. Preserving the exponent bits maintains the 32-bit float's range of ≈ 10−38 to ≈ 3 × 1038.
The bits are laid out as follows:

Contrast with bfloat16 and single precision

Legend

The bfloat16 binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 127; also known as exponent bias in the IEEE 754 standard.
Thus, in order to get the true exponent as defined by the offset-binary representation, the offset of 127 has to be subtracted from the value of the exponent field.
The minimum and maximum values of the exponent field are interpreted specially, like in the IEEE 754 standard formats.
The minimum positive normal value is 2−126 ≈ 1.18 × 10−38 and the minimum positive value is 2−126−7 = 2−133 ≈ 9.2 × 10−41.

Encoding of special values

Positive and negative infinity

Just as in IEEE 754, positive and negative infinity are represented with their corresponding sign bits, all 8 exponent bits set and all significand bits zero. Explicitly,
val s_exponent_signcnd
+inf = 0_11111111_0000000
-inf = 1_11111111_0000000

Not a Number

Just as in IEEE 754, NaN values are represented with either sign bit, all 8 exponent bits set and not all significand bits zero. Explicitly,
val s_exponent_signcnd
+NaN = 0_11111111_klmnopq
-NaN = 1_11111111_klmonpq
where at least one of k, l, m, n, o, p, or q is 1. As with IEEE 754, NaN values can be quiet or signaling, although there are no known uses of signaling bfloat16 NaNs as of September 2018.

Range and precision

Bfloat16 is designed to maintain the number range from the 32-bit IEEE 754 single-precision floating-point format, while reducing the precision from 24 bits to 8 bits. This means that the precision is between two and three decimal digits, and bfloat16 can represent finite values up to about 3.4 × 1038.

Examples

These examples are given in bit representation, in hexadecimal and binary, of the floating-point value. This includes the sign, exponent, and significand.
3f80 = 0 01111111 0000000 = 1
c000 = 1 10000000 0000000 = −2
7f7f = 0 11111110 1111111 = × 2−7 × 2127 ≈ 3.38953139 × 1038
0080 = 0 00000001 0000000 = 2−126 ≈ 1.175494351 × 10−38
The maximum positive finite value of a normal bfloat16 number is 3.38953139 × 1038, slightly below × 2−23 × 2127 = 3.402823466 × 1038, the max finite positive value representable in single precision.

Zeros and infinities

0000 = 0 00000000 0000000 = 0
8000 = 1 00000000 0000000 = −0
7f80 = 0 11111111 0000000 = infinity
ff80 = 1 11111111 0000000 = −infinity

Special values

4049 = 0 10000000 1001001 = 3.140625 ≈ π
3eab = 0 01111101 0101011 = 0.333984375 ≈ 1/3

NaNs

ffc1 = x 11111111 1000001 => qNaN
ff81 = x 11111111 0000001 => sNaN