UTF-8

2020-12-03

As of writing, Unicode contains 1,112,064 possible codepoints.

Actual codepoints range from 0 through 0x10FFFF though a part is reserved for backwards compatibility of other codecs with UTF-16, resulting in 1,112,064 actually available codepoints.

It’s important to note that not all (in fact most) codepoints are not characters. A common example is the Combining Diaeresis (e.g. ï) or the “double dot above” mark that, when rendered, simply appears above the adjecent previous character. In fact I can’t even paste it in neovim on its own, it simply appears above the previously typed character.

UTF-8

UTF-8 is a variable Unicode encoding in a sense that not all codepoints occupy the same amount of bytes in their encoded form. This is different from, for example, UTF-32 where every single codepoint always occupies exactly 4 bytes. This makes it very easy to parse but it also makes it very memory inefficient. Back to UTF-8, the maximum number of bytes an encoded codepoint can occupy is 4, even though only 21 bits are required to represent all possible codepoints, and we’ll see why shortly.

Encoding rules

Number of bytes	First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
1	U+0000	U+007F	0xxxxxxx	N/A	N/A	N/A
2	U+0080	U+07FF	110xxxxx	10xxxxxx	N/A	N/A
3	U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx	N/A
4	U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

These rules made little sense to me at the beginning. However, as I read more and more of the Wikipedia article on UTF-8 things became clearer. For example, all codepoints between U+0000 and U+007F ooccupy exactly one byte and that byte cannot be larger than 127 (0x7F) in order to maintain complete interchangeability with US-ASCII. If we encounter a single byte <= 0x7F, we know that it’s a single byte codepoint and an ASCII character. (though (byte & 0x80) == 0 is a more portable check if we’re also dealing with signed types)

The restrictions for the 3 following codepoint ranges exist for similar reasons (mainly UTF-16 compatibility I think, but I’m not too sure on these), for example, for all codepoints between 0x800 and 0xFFFF, 3 bytes are required even though all numbers in that range can fit into 2 bytes, since 5 bits in the 2byte form are reserved.

Example

Let’s look at how the euro sign (€) is encoded in UTF-8.

The Unicode codepoint of the euro sign is U+20AC. If we look at the table above, we see that it falls between codepoint U+0800 and codepoint U+FFFF, meaning that it takes 3 bytes to encode in memory.
0x20AC in binary are the two bytes: 0b00100000 and 0b10101100.
Byte by byte we “replace” all the available bytes from the sequence in the table by the bytes from the codepoint. If you look at the table, you can see that for codepoints that take 3 bytes to encode, we have exactly 16 bits (2 bytes) available.
When we replace them, we get: 0b11100010, 0b10000010 and 0b10101100, or in hex: 0xE2, 0x82, 0xAC.
From our shell we can see that this is exactly how the € is encoded:

1echo -n "€" | xxd -g 1 -c 10 -u
200000000: E2 82 AC                       ...

Writing an UTF-8 decoder

This is something I’ve wanted to do for a long time just because of how vital a part UTF-8 plays, well, everywhere. Literally all text that you see nowadays on screen was at some point encoded as UTF-8, then decoded to get all invididual codepoints, and then rendered to the screen with a relevant font.

String related functions in C’s stdlib pay no attention to any sort of encoding at all. They follow the ‘1 byte = 1 character’ mantra. Support for different encodings exists via Wide strings and Multi-byte strings but I’ve never had a reason to use those functions. They are also very intertwined with the C locale, which is also a very complicated topic as evident by the infamous mpv developer C locale rant.

Writing a decoder turned out to be way easier than I thought. The gist of basically the entire operation is the following: peek at the first byte of the string -> find out how many bytes the codepoint that it’s a part of occupies -> construct a codepoint from those bytes -> seek that many bytes forward in the stream -> return. Repeat for the entire string.

The only part I struggled with for a bit was the bitwise manipulation regarding specific bytes as I had little experience with that prior to this project - that is now on GitHub as butf8. In the future I’m planning to add encoding as well as indexing support and error checking - all of which should be fairly trivial.

UTF-8

Encoding rules

Example

Writing an UTF-8 decoder

Resources