A primer on the binary system: two's complement, IEEE-754 and more, explained

2021-09-18

Modern 64-bit processors are capable of storing $2^{64}$ different values within their general-purpose registers. Each one of those possible values is simply a different combination of 64 sequential bits - and each one of those bits can, of course, exist in one of two states at a single point in time.

Given any number of bits $ N $, we can always conclude that the maximum number of different values representable using that particular bit count is always $ 2^N $.

Each one of those values can also be thought of as a different number. Us humans have designed the binary system as a consequence of needing to represent the many different numbers found in the decimal system - the one we think and operate in. It turned out, however, that it’s no small feat. As we’ll see, representing positive (i.e. “unsigned”) integers is easy - but how do we represent negative integers? And what about non-integers (i.e. decimal, real numbers)?

On “numbers” vs “values”

While we most often think of binary values as numbers, they can also simply represent different abstract ideas. For example, think of the many syscall functions in the Linux kernel. They usually expose a “flags” argument - a value composed of one or more binary OR’d ‘flags’, each of which is a value with exactly one set bit. We never care what the decimal representation of the final flag argument is (after it has been “fed” the many other individual possible options), we simply use the macros provided by the appropriate header to construct the final value. However nothing stops us from observing the decimal value of the final argument, and possibly using it in the future. I wouldn’t call that good practice though. For example:

1#include <stdint.h>
2#include <fcntl.h>
3
4int32_t fd1 = open("/some/file", O_APPEND | O_CREAT | O_NONBLOCK),
5        fd2 = open("/some/other/file", 3136);

Both open() calls receive exactly the same value as their second argument, i.e. the exact same set of flags, but there’s no reason to ever use the second variant, unless you’re doing code golf.

Representations

Now, let’s take a look into how different kinds of numbers in the decimal system are represented in binary form. I will only be describing the systems/encodings in use today.

Positive (unsigned) integers

“unsigned” as in - “no sign” - the value is absolute

Representing exclusively positive integers in a binary format is trivial. Each of the bits in the sequence carries a value twice as large as that of the previous, starting at $2^0$, or $1$. The most significant bit carries the value $2^{N - 1}$, where $N$ is the total number of bits. The range of decimal values epresentable using $N$ bits in this system ranges from $0$ to $2^{N - 1}$ (because we account for the case where all bits are zeroes). To convert a value encoded in this binary format to a decimal value, we simply add values of all the “set” (i.e. “$1$”) bits together.

That’s all that there really is to it. Here are some examples using 8-bit sequences, for the sake of clarity:

Note that, when representing a single byte sequence in textual form, the most significant bit is most often writtten first.

0 0 0 0 1 0 0 0 - $8$ in decimal

1 0 0 0 0 0 0 0 - $128$ in decimal

1 1 0 0 1 0 0 1 - $201$ in decimal

Signed integers

So, how do we represent negative integers? From the get-go it is obvious that this is going to require more work - an encoding of some kind. Historically, there have existed several solutions to this problem. I will describe two of them: the oldest and most trivial one - “sign and magnitude”, and the one currently in use in most modern systems - “two’s complement”.

Sign and magnitude

In this method, the most significant bit denotes the sign of the value (+, positive ($0$) or -, negative ($1$)), while the remainder of the bits in the sequence represent the absolute value (magnitude), which is why it is called sign and magnitude.

The magniute is encoded just as a regular unsinged integer. If we imagine an $N$-bit sequence, using “sign and magnitude” we can represent decimal values from $-2^{N - 1} - 1$ to $+2^{N - 1} - 1$, inclusive. For example, using an 8-bit sequence we can represent decimal values from $-127$ to $+127$.

An intereseting property of “sign and magnituide” is that there are two ways to represent zero: 0 0 0 0 0 0 0 0 (+0) and 1 0 0 0 0 0 0 0 (-0), thus essentially wasting a value. In addition, basic operations on “sign and magnitude”-encoded values (such as comparisons or even basic arithmetic) have to account for the sign bit, leading to them being computationally more expensive, which is not a problem with the “two’s complement” method, as we’ll see.

Two’s complement

In “two’s complement”, negative counterparts of a positive integer (and vice-versa) are obtained by:

Performing a binary NOT operation on the whole bit sequence (i.e. reversing all of the bits)
Adding 1 to the previous result (regular binary addition)

Another method that is easier to visualise is locating the first set ("$1$") bit from the right, and then inverting all of the bits to the left of that bit.

Performing two’s complement conversion on a negative number will, of course, convert it back to a positive one. The absolute value stays the same.

Despite not relying solely on a single bit to denote the sign of the value, the highest-order bit still does just that - 0 for positive, and 1 for negative.

The range of decimal values representable using $N$ bits in “two’s complement” ranges from $-2^{N - 1}$ to $+2^{N - 1} - 1$, (i.e. from $-128$ to $+127$ using an $8$-bit sequence). The range is not “symmetric” as we need to represent the zero as well.

Seeing as we can encode (using 8 bits, as an example) positive values from $0$ to $+127$, and seeing as $+127$’s bit pattern is the following: 0 1 1 1 1 1 1 1, it is apparent why we cannot further encode larger positive values: it would require us to flip the highest order bit to a $1$, which would no longer denote a positive value. That’s why the upper range is capped where all bits but the most significant one are set ("$1$") - i.e. the value $2^{N - 1} - 1$.

Two’s complement solves many of the problems of the methods that came before it:

there’s only one way to represent a zero (0 0 0 0 0 0 0 0)
- again, a side effect of this is that it encodes 1 more possible (than other solutions) value on either side of the range
- utilizing 8 bits sign and magnitude encodes values from $-127$ to $+127$, but two’s complement encodes values from $-128$ to $+127$
arithmetic operations are trivial and require no additional computation - for example, adding 2 two’s-complement-encoded values requires no operation bar performing regular binary addition - regardless of the operands’ signs

Below are a couple step-by-step examples of converting an unsigned integer to its negative two’s complement countepart:

0 1 0 0 0 0 1 0 ($66$)
1. Peform a binary NOT on the bit sequence: 1 0 1 1 1 1 0 1
2. Add $1$ to the previous result: 1 0 1 1 1 1 1 0
3. $-66$
0 0 0 0 0 0 0 1 ($1$)
1. Perform a binary NOT on the bit sequence: 1 1 1 1 1 1 1 0
2. Add $1$ to the previous result: 1 1 1 1 1 1 1 1
3. $-1$

The process is of course exactly the same regardless of the length of the bit sequence (e.g. $32$, $64$, etc…)

The most negative number

The most negative number representable using two’s complement in an $N$-bit sequence is a special case. We know that applying the two’s complement operation on any valid bit sequence will flip its sign, i.e. $-5$ would become $+5$, $+77$ would become $-77$, $-112$ would become $+112$ and so on. The same is not true for the most negative number - $-2^{N - 1}$. If we take an 8-bit sequence as an example, we know that it can represent decimal numbers ranging from $-128$ to $+127$ (exactly why - it is explained above). So what would happen if we apply two’s complement (i.e. we try to negate) the bit sequence representing $-128$? We would expect to obtain $+128$ as the result - but we know that it cannot be represented (it is by 1 larger than the largest representable number). So what do we get? We get the exact same bit sequence. We can see this in action, if we apply two’s complement on the lowest number in an 8-bit system, i.e. $-128$:

1 0 0 0 0 0 0 0 ($-128$ in two’s complement)
0 1 1 1 1 1 1 1 (reversing the bits)
1 0 0 0 0 0 0 0 (after adding $1$) - the same bit sequence that we started with!

Thus, the most negative number is a special case in the two’s complement system, and it has to be treated with special care. The C and C++ standards in particular do not define behavior when dealing with the most negative number. As an example, the following C snippet (on an x86-64 machine), produces -128 as the result: (as is, well, expected in this special case):

1signed char b = (signed char) -128 * (signed char) -1;

Real numbers

Floating-point arithmetic is considered an esoteric subject by many people. -David Goldberg

Useful resources and tools:

It’s immediately apparent that we need yet another encoding in order to represent and operate on non-integer (real) numbers.

Fixed-point representation

The most trivial way to represent real numbers using a binary format is by employing a “fixed point representation”. That is, in an N-bit pattern, choose a position between two adjecent bits in which the radix point will be placed. Bits to the left of the radix point (also called the binary point) represent the whole (integer) part of the number, while bits to the right of it represent the fractional part of the number. Starting from the radix point, bits to the right of it (i.e. the least significant half of the bits) represent negative powers of 2. The highest order bit of the fractional part of the bit pattern represents the value $2^{-1}$ ($0.5$), while the lowest order bit represents the value $2^{-(N / 2)}$, where $N$ is the number of bits in the entire bit pattern. This of course applies to bit patterns of any length.

The most glaring issue of this system is that the magnitude of the integer part as well as the degree of precision of the fractional part are fixed and dependent on the position of the binary point. If we take an 8-bit wide sequence as an example, and if we assume that the binary point is fixed exactly in the middle - then, its integer part ranges from $0$ through $15$ (though two’s complement could be utilized here to provide negative number representation), and its fractional part ranges from $0.0625$ to $0.9375$. If we however at any point begin to desire more precision in the integer part (for example we need to store the value $38$) at the cost of less precision in the fractional part, we are out of luck, as the radix point’s position is fixed.

The other major problem is that fractional values that aren’t combinations of integer powers of 2 are unrepresentable. To illustrate this, let’s imagine $0.2$ (or $\frac{1}{5}$). It cannot be represented as a power of two, nor as a combination of any number of powers of two. In contrast, $0.75$ for example, is representable as $2^{-1} + 2^{-2}$. So how do we represent those values? Well, we cannot. This problem is unfortunately not solved by any other more advanced representation, as we’ll see later. The best we can do is approximate those values. The more bits we have to work with, the better our precision is, and the “closer” we can get to the desired, ultimately irrepresentable value. For example, using an 8-bit fixed-point representation sequence where the radix point is placed in the middle, the closest we can get to $0.2$ is 0000.0011, or $0.1875$, i.e. $ 2^{-4} + 2^{-3} = \frac{1}{16} + \frac{1}{8}$. As the bit count increases, so does our precision, and we get closer and closer to $0.2$, but we never quite get there. The closest we can get to $0.35$ is 0000.0110, or $0.375$. Note that by “close” I don’t necesserily mean “just before” or “slighly less” - the closes possible approximation may also be larger than the desired number.

This issue is actually present in the decimal number system that we use every day: in it, we cannot exactly represent a value such as $\frac{1}{3}$ (one third). We usually approximate it by writing it as $0.33333….$, with the digits after the decimal point repeating “forever” (often indicated by placing a dot mark above last written digit).

To get around the first issue of fixed-point representations, a floating-point representation can be used. The name is quite self explanatory - the binary point is no longer fixed to a specific position and can instead “float” around on demand. This allows for tradeoffs: a larger magnitude for the integer part of a number with lower fractional precision, or vice versa. The industry standard for representing floating point numbers in computers is the IEEE-754 standard, first introduced in 1985.

IEEE-754 (floating-point representation)

The standard was first introduced in 1985 and describes how floating point numbers of various widths are to be stored in memory, as well as how various arithmetic operations are to behave on those numbers. While all of the semantics of the standard could be implemented in software, it was commonly implemented in hardware by so called FPUs (“floating-point units”) for the sake of efficiency and speed. The first piece of hardware to implement IEEE-754 is the Intel 8087 floating-point co-processor. In modern processors, IEEE-754 operations are implemented directly within, ridding the need for external FPUs.

The 2008 revision of the standard can be found here, though it is of course very technical and thorough.

Scientific notation

Representation of floating point numbers as described by IEEE-754 shares a lot of semantics with scientific notation, which I will first describe.

Scientific notation is a way of representing numerical values in the decimal number system, that are way too long to be conveniently written down in their entirety. The general form is this:

$$M \times 10^N $$

, where $M$ is a real number called the mantissa/significand, and $N$ is an integer (whole number) called the exponent (this terminology comes into use again later). If $M$ is negative, it is prefixed with a minus sign.

Examples:

$88 120 000 000$ in decimal notation can be written as $8.8812 \times 10^{11}$ in scientific notation.
$-67 000$ in decimal notation can be written as $-6.7 \times 10^4$ in scientific notation
$0.000129$ in decimal notation can be written as $1.29 \times 10^{-4}$

While the mantissa m can generally be any real number, most commonly a subset of the scientific notation called normalized notation is used, whereby the absolute value of the mantissa is constrained between 1 and 10, i.e. $1 \leq |M| \lt 10$.

Meaning that, while $0.000129$ can be written as $12.9 \times 10^{-5}$, or even $1290 \times 10^{-7}$, in normalized notation the mantissa is constrained to $1.29$, the exponent is therefore chosen accordingly to be $-4$, and the final expression ends up being $1.29 \times 10^{-4}$.

The degree of precision in the final value depends of course on the number of significant digits of the mantissa. For example, the significand $1.2345$ allows for precise values such as $0.012345$ or $0.12345$, while a mantissa with less significant digits, for example $1.234$ would allow for lesser degrees of precion such as $0.01234$ or $0.1234$. This is explained in a bit more detail here.

Now back to IEEE-754. The reason that I brought up scientific notation in the first place is that real numbers in computers are represented in much a similar way , as we’ll see.

The standard defines 4 levels of precision, however the two most commonly used and implemented ones are:

single-precision: 32 bits - referred to as a float by the C standard
double-precision: 64 bits - referred to as a double by the C standard

Encoding in memory

For the sake of efficacy I will be presenting all examples in the single precision format. The double precision follows exactly the same rules, however as the bit count is increased some constants used in calculations differ. I will note those along the way.

The bitfields of floating point numbers encoded as per IEEE-754 are split into 3 parts:

1 sign bit
8 biased exponents bits (11 in double-precision)
23 mantissa bits (52 in double-precision)

The final value comprised of these parts can be calculated as follows:

$$(-1)^s \times (1 + \text{mantissa}) \times 2^{\text{exponent} - 127}$$

The most significant bit - the sign bit denotes the sign of the final number. $0$ for positive and $1$ for negative. As a result of this, two zero values are representable: $+0$ and $-0$.

The next 8 bits represent the biased exponent that is used to scale the value of the mantissa. The biased exponent is treated as a regular 8-bit unsigned integer, however it is $127$-biased (the stored value is increased or “shifted” by 127) to allow for storage of negative exponents. To obtain the actual value of the exponent, subtract $127$ from its value, e.g. a biased exponent of $231$ would denote an actual exponent of $231-127 = 104$. A bias is used instead of an encoding such as “two’s complement” for efficient comparisons. Once the bias is taken into account, the value of the exponent ranges from $-126$ to $+127$ (not $-127$ to $128$ as you may believe - cases where the exponent field is all $0$s / all $1$s are treated specially, more on this later).

The last 23 bits represent the mantissa (analogous to a significand in scientific notation). Its bits represent negative powers of two, starting with $2^{-1}$ at the most significant bit, all the way through to $2^{-23}$ at the least significant bit. A non-existent $1$-bit carrying the value $2^0$ ($1.0$) is implied to be placed just between the exponent and mantissa bits - that’s what the $+1$ denotes in the equation pictured above, and is also why the mantissa is guaranteed to be between $1.0$ and $2.0$ in value, i.e. $1.0 \le mantissa \lt 2.0$. This is called the “leading bit convention” or the “implicit bit convention”, and it allows the format to have an extra bit of precision. This is true in all cases but when dealing with a denormalized number, indicated by the biased exponent field being all 0s - more on this later.

Once all three components are known obtaining the final value is quite trivial - the formula pictured above is quite self-explanatory. This article provides a nice additional visual explanation. IEEE-754 representations in no way solve the representation issues discussed in the earlier. If the value cannot be represented by a combination of any powers of two ranging from -1 to -23 increased by 1.0 and then scaled (multiplied) by any power of two ranging from -126 to 127, then it cannot be exactly represented, and is instead approximated as closely as possible. This stands for integer numbers as well - for example, the value $16,777,217$ cannot be exactly represented by a single-precision IEEE-754 format, and is instead rounded down to $16,777,216$. Perhaps somewhat counter-intuitively, the value $5$ can be exactly represented. By setting the exponent field to $2$, and the mantissa field to $0.25$ (i.e. setting the second least significant bit carrying the value $2^{-2}$ or $0.25$), the final equation becomes $1 \times 2^2 \times (1 + 0.25) = 1.25 \times 4 = 5$.

Representing zeroes

Zeroes are represented by setting the exponent and mantissa fields to all $0$s. The sign bit determines if the zero is positive ($+0$) or negative ($-0$).

Denormalized numbers

When the exponent portion of a number is all $0$s the value is said to be denormalized. This comes with a couple deviations from the regular rules about representing floating point values (and as such the general formula cannot be used in that case). Firstly, the exponent’s actual value (when the 127-bias is taken into account) isn’t $-127$ as you may expect, but rather $-126$, just as if the biased exponent was $1$. Secondly, the implied $1$-bit no longer exists, leading to the mantissa being constrained between $0.0$ and $1.0$, i.e. $0.0 \le mantissa \lt 1.0$. As the leading digit of the mantissa is a 0, this allows for representation of numbers even smaller than the smallest normalized number. For example, the numbers closest to $0$ representable using a normalized single-precision format are $\pm2^{-126}$ ($\approx\pm1.17549 \times 10^{-38}$) - this is the case when the mantissa’s bits are all $0$s and the biased exponent is $1$. In contrast, the numbers closest to $0$ representable when utilizing denormalized values are $\pm2^{-23} \times 2^{-126}$ ($\approx\pm1.40130 \times 10^{-45}$) - this being the case when the mantissa’s least significant bit is set, and when the exponent’s bits are of course all $0$s. Any non-zero number with magnitude smaller than the smallest representable normalized number is consider to be subnormal. The existence of denormalized numbers guarantees that there is always a non-zero difference between any two nearby floating point numbers, leading to operations such as $x - y$ never being able to produce a zero result despite the operands not being equal.

Other special values

Being that the IEEE-754 standard also requires conforming implementations to support a number of arithmetic operations (e.g. taking the square root of a value), and being that results of many of those operations are often invalid or undefined with certain operands, the standard defines five exceptions. The results of operations which produce exceptions are often one of three special values: NaN (“not-a-number”), +infinity (+∞), and -infinity (-∞). Their bitfield representations are listed in the table below:

Type	Sign bit	Biased exponent	Mantissa
NaN	any	all 1s	any non-zero
+∞	0	all 1s	all 0s
-∞	1	all 1s	all 0s

Mathematically undefined operations, such as taking the square root of a negative number for example, result in a NaN (e.g. $\sqrt{-5.0}$). Divisions by zero for example result in ±∞ depending on the sign of the first operand (at least that’s what I’ve observed on my system).

Precision explained in detail

The larger the absolute magnitude of the integer part of a number is, the less precision is available to cover the fractional part of the number. This is illustrated by this initially quite elusive (at least to me) graph found on the IEEE-754 Wikipedia page, pictured below:

To understand why this is, let’s recap what roles the parts of an IEEE-754 format actually play.

The exponent tells us between which two consecutive powers of 2 our value falls within - for example $[0.25, 0.5]$ (the value falls between powers of two $-2$ and $-1$), $[1, 2]$ (the value falls between powers of two $0$ and $1$), $[16, 32]$ (the value falls between powers of two $4$ and $5$), all the way up to $[2^{127}, 2^{128}]$. If the exponent is for example $18$, we know that the final value will fall somewhere between $[2^{18}, 2^{19}]$ or $[262144, 524288]$.

The mantissa then dividies that window/range indicated by the exponent into $2^{23}$ (or $8388608$) “buckets”, or other individual values. In other words, it approximates the exact location/offset of the final number in the range/window indicated by the exponent. Since the mantissa is constrained between $1.0$ and $2.0$ in normalized form, and since the final value is obtained by multiplying the exponent and mantissa, it is clear why the exponent only stores the lower end of the range. The closer the mantissa is to $1.0$ the closer the final number is to the lower end of the range, and the closer it is to $2.0$ the closer the final number is to the upper end of the range.

The shorter the range within which the final number falls in, the more ‘precisely’ its location/offset within that range can be approximated, because each of the $8388608$ buckets dividing the range represent a smaller value. For example, imagine that we are trying to represent a number in the range $[1, 2]$. Our range in that case is $(2 - 1) = 1$, and each of the different $8388608$ mantissa representations dividing that range carry the value $\frac{2 - 1}{8388608}$, or $0.00000011920928955078125$ - that value being our precision - i.e. the smallest difference between two consecutive numbers in that range.

In contrast, had we instead wanted to represent a value in the range $[512, 1024]$ (so $9$ being the exponent), the precision would ‘fall down’ to $\frac{1024 - 512}{8388608}$, or $0.00006103515625$. Quite a bit less precise.

Let’s see an example which showcases the precision in action: say we want to represent $65536$. It is very trivial to represent exactly that value only using the exponent, as it is a direct power of two. We simply set the sign to $0$, the exponent to $16$ (or $143$ when accounting for the bias), and we leave the mantissa at $0$. Now, the exponent’s value tells us that we can represent values in the range $[2^{16}, 2^{17}]$, or $[65536, 131072]$ - essentially we are covering a range of $65536$ values. Our precision in that case is: $\frac{65536}{8388608} = 0.0078125$, or $7.8125\text{e-}3$. If we then flip the least significant bit of the mantissa on (the one carrying the value $2^{-23}$), the final value becomes: $+1 \times 2^{16} \times (1 + 2^{-23}) = 65536.0078125$. This value is the next largest representable one after $65536$, and it is larger than it exactly by the value of the precision in that range (i.e. $7.8125\text{e-}3$). After it, the next largest value is: $+1 \times 2^{16} * (1 + 2^{-22}) = 65536.015625$. Again, the difference between it and the previous value is the precision: $65536.015625 - 65536.0078125 = 0.0078125$. Past $65536$, we cannot make jumps more precise/smaller than $7.8125\text{e-}3$.

From the graph pictured above we can observe that (at least concerning the 32-bit single-precision format) somewhere between $10^6$ and $10^8$ the precision reaches $10^0 = 1$, meaning that past that particular cutoff point, we lose the ability to represent non-integer numbers, as our precision itself becomes a whole number without a fractional part. In fact, we can figure out the exact range at which the cutoff point appears:

$$\frac{x}{2^{23}} = 1 \Rightarrow x = 2^{23}$$

From this we see that when the exponent reaches $23$, or rather when our whole integer part crosses into the $[2^{23}, 2^{24}]$ range, we lose the ability to represent non-integer values.

We can easily confirm this by writing a tiny C program:

 1#include <stdio.h>
 2
 3int main(void) {
 4    const float val1 = 4194305.5f, // 2^22
 5                val2 = 5194305.5f,
 6                val3 = 6194305.5f,
 7                val4 = 8388607.5f; // 2^23 - 1
 8    
 9    const float val5 = 8388608.5f, // 2^23
10                val6 = 8388610.5f; // 2^23 + 2
11
12    printf("val1: %.5f\n", val1);
13    printf("val2: %.5f\n", val2);
14    printf("val3: %.5f\n", val3);
15    printf("val4: %.5f\n", val4);
16    puts("---------");
17    printf("val5: %.10f\n", val5);
18    printf("val6: %.10f\n", val6);
19}

Variables val1 through val4 all fall into the $[2^{22}, 2^{23}]$ range, i.e. the last range that supports the fractional part. The precision in that range is $\frac{2^{23} - 2^{22}}{2^{23}} = 0.5$, so we use that in the code.

Variables val5 and val6 dip into the very beginning of the $[2^{23}, 2^{24}]$ range - the first one that loses support for fractional parts. We can prove this by examining the output of the code above:

1val1: 4194305.50000
2val2: 5194305.50000
3val3: 6194305.50000
4val4: 8388607.50000
5---------
6val5: 8388608.0000000000
7val6: 8388610.0000000000

As far as the double-precision format goes, that cutoff point comes a lot later (it isn’t even visible on the graph).