Introducing F-UTF-8

May 15, 2024 | medium | programming | esoteric | projects

F-UTF-8 (an acronym) is an extension of UTF-8 that hates you.

Standard UTF-8 is cool. It’s a variable-width encoding, which means that some Unicode codepoints are encoded using 1 byte, some using 2, some 3, and some are even 4 bytes long.

Now, the mechanism used to determine the length of a codepoint in UTF-8 is quite simple, actually! For ASCII characters, just output the byte as-is. Then, for all other characters, the first byte of the sequence has its N highest bits set, and the N-1th highest bit unset, where N is the number of bytes used in the final encoding. The rest of the bytes all have the two top bits set to 10, and the unset bits encode the actual binary data of the codepoint. Here, let me show you:

Character	Binary codepoint	UTF-8 bytes
a	01100001	01100001
ä	11100100	11000011 + 10100100
あ	00110000 + 01000010	11100011 + 10000001 + 10000010
🦆	00000001 + 11111001 + 10000110	11110000 + 10011111 + 10100110 + 10000110

Red bits encode the data, blue bits tell you how long the sequence is, and green bits let you know that you’re in the middle of a multibyte sequence.

Why am I telling you this? Well, if you notice above, UTF-8 inserts some padding zero bits into the red section. This is of course done to make the final result byte-aligned. Buuuuuuuut… Who’s stopping you from adding more padding?

… The Unicode Consortium. They’re stopping you from adding more padding. It’s what they call “an overlong encoding” and “invalid UTF-8” and other nonsense like that. Well, I stand up to the establishment! I will make the overlongest of overlong encodings, consortium be darned!

So with that said, welcome to F-UTF-8! This is an extension of UTF-8 which not only permits overlong encodings with too much padding, it requires it. In fact, it requires putting the maximal amount of padding for every possible byte. The table above turns into:

Character	Binary codepoint	F-UTF-8 bytes
a	01100001	11111111 + 10000000 + 10000000 + 10000000 + 10000000 + 10000000 + 10000001 + 10100001
ä	11100100	11111111 + 10000000 + 10000000 + 10000000 + 10000000 + 10000000 + 10000011 + 10100100
あ	00110000 + 01000010	11111111 + 10000000 + 10000000 + 10000000 + 10000000 + 10000011 + 10000001 + 10000010
🦆	00000001 + 11111001 + 10000110	11111111 + 10000000 + 10000000 + 10000000 + 10000000 + 10011111 + 10100110 + 10000110

This is glorious! This is beautiful! Rejoice!

As is typical for trendsetters like this, I have implemented this new standard in Rust.

Comparisons

Standard UTF-8 is ubiquitous in digital communications and data storage, for a few simple reasons:

It’s compact; UTF-8 leads to space and bandwidth savings¹.
It’s compatible with ASCII; programs that handle UTF-8 support the enormous world of software dispensing ASCII, out of the box!
It’s standardized; UTF-8 is specified by the Unicode Consortium and adopted by just about every major technical system and programming language worldwide².
It has sane error handling and fallback mechanisms; you’ll be glad when you’re messsing around with bytes you don’t exactly know the source of.

F-UTF-8 has none of these features!

It’s bloated; every single codepoint is encoded with 8 bytes, which is twice as wide as even the chonky UTF-32! More than half of this space is wasted on the unnecessary padding!
It’s incompatible with ASCII; code points that fit in one ASCII byte are nevertheless expanded into 8. Hell, it’s even incompatible with UTF-8, due to its liberal use of 0xFF bytes (which are always invalid in UTF-8).
It’s niche and nonstandard; do you trust a random git repo to be well supported?
It has no error handling; if something fails to decode, TOO BAD!

Unless you’re Japanese. ↩
Except the Windows API³, and JavaScript, and Qt ↩
for now. ↩