Introducing F-UTF-8

F-UTF-8 (an acronym) is an extension of UTF-8 that hates you.

Standard UTF-8 is cool. It’s a variable-width encoding, which means that some Unicode codepoints are encoded using 1 byte, some using 2, some 3, and some are even 4 bytes long.

Now, the mechanism used to determine the length of a codepoint in UTF-8 is quite simple, actually! For ASCII characters, just output the byte as-is. Then, for all other characters, the first byte of the sequence has its N highest bits set, and the N-1th highest bit unset, where N is the number of bytes used in the final encoding. The rest of the bytes all have the two top bits set to 10, and the unset bits encode the actual binary data of the codepoint. Here, let me show you:

Character Binary codepoint UTF-8 bytes
a 01100001 01100001
ä 11100100 11000011 + 10100100
00110000 + 01000010 11100011 + 10000001 + 10000010
🦆 00000001 + 11111001 + 10000110 11110000 + 10011111 + 10100110 + 10000110

Red bits encode the data, blue bits tell you how long the sequence is, and green bits let you know that you’re in the middle of a multibyte sequence.

Why am I telling you this? Well, if you notice above, UTF-8 inserts some padding zero bits into the red section. This is of course done to make the final result byte-aligned. Buuuuuuuut… Who’s stopping you from adding more padding?

… The Unicode Consortium. They’re stopping you from adding more padding. It’s what they call “an overlong encoding” and “invalid UTF-8” and other nonsense like that. Well, I stand up to the establishment! I will make the overlongest of overlong encodings, consortium be darned!

So with that said, welcome to F-UTF-8! This is an extension of UTF-8 which not only permits overlong encodings with too much padding, it requires it. In fact, it requires putting the maximal amount of padding for every possible byte. The table above turns into:

Character Binary codepoint F-UTF-8 bytes
a 01100001 11111111 + 10000000 + 10000000 + 10000000 + 10000000 + 10000000 + 10000001 + 10100001
ä 11100100 11111111 + 10000000 + 10000000 + 10000000 + 10000000 + 10000000 + 10000011 + 10100100
00110000 + 01000010 11111111 + 10000000 + 10000000 + 10000000 + 10000000 + 10000011 + 10000001 + 10000010
🦆 00000001 + 11111001 + 10000110 11111111 + 10000000 + 10000000 + 10000000 + 10000000 + 10011111 + 10100110 + 10000110

This is glorious! This is beautiful! Rejoice!

As is typical for trendsetters like this, I have implemented this new standard in Rust.

Comparisons

Standard UTF-8 is ubiquitous in digital communications and data storage, for a few simple reasons:

F-UTF-8 has none of these features!

  1. Unless you’re Japanese. 

  2. Except the Windows API3, and JavaScript, and Qt 

  3. for now


subscribe via RSS