F-UTF-8 (an acronym) is an extension of UTF-8 that hates you.
Standard UTF-8 is cool. It’s a variable-width encoding, which means that some Unicode codepoints are encoded using 1 byte, some using 2, some 3, and some are even 4 bytes long.
Now, the mechanism used to determine the length of a codepoint in UTF-8 is quite simple, actually! For ASCII characters, just output the byte as-is. Then, for all other characters, the first byte of the sequence has its N highest bits set, and the N-1th highest bit unset, where N is the number of bytes used in the final encoding. The rest of the bytes all have the two top bits set to 10, and the unset bits encode the actual binary data of the codepoint. Here, let me show you:
Character | Binary codepoint | UTF-8 bytes |
---|---|---|
a | 01100001 | 01100001 |
ä | 11100100 | 11000011 + 10100100 |
あ | 00110000 + 01000010 | 11100011 + 10000001 + 10000010 |
🦆 | 00000001 + 11111001 + 10000110 | 11110000 + 10011111 + 10100110 + 10000110 |
Red bits encode the data, blue bits tell you how long the sequence is, and green bits let you know that you’re in the middle of a multibyte sequence.
Why am I telling you this? Well, if you notice above, UTF-8 inserts some padding zero bits into the red section. This is of course done to make the final result byte-aligned. Buuuuuuuut… Who’s stopping you from adding more padding?
… The Unicode Consortium. They’re stopping you from adding more padding. It’s what they call “an overlong encoding” and “invalid UTF-8” and other nonsense like that. Well, I stand up to the establishment! I will make the overlongest of overlong encodings, consortium be darned!
So with that said, welcome to F-UTF-8! This is an extension of UTF-8 which not only permits overlong encodings with too much padding, it requires it. In fact, it requires putting the maximal amount of padding for every possible byte. The table above turns into:
Character | Binary codepoint | F-UTF-8 bytes |
---|---|---|
a | 01100001 | 11111111 + 10000000 + 10000000 + 10000000 + 10000000 + 10000000 + 10000001 + 10100001 |
ä | 11100100 | 11111111 + 10000000 + 10000000 + 10000000 + 10000000 + 10000000 + 10000011 + 10100100 |
あ | 00110000 + 01000010 | 11111111 + 10000000 + 10000000 + 10000000 + 10000000 + 10000011 + 10000001 + 10000010 |
🦆 | 00000001 + 11111001 + 10000110 | 11111111 + 10000000 + 10000000 + 10000000 + 10000000 + 10011111 + 10100110 + 10000110 |
This is glorious! This is beautiful! Rejoice!
As is typical for trendsetters like this, I have implemented this new standard in Rust.
Comparisons
Standard UTF-8 is ubiquitous in digital communications and data storage, for a few simple reasons:
- It’s compact; UTF-8 leads to space and bandwidth savings1.
- It’s compatible with ASCII; programs that handle UTF-8 support the enormous world of software dispensing ASCII, out of the box!
- It’s standardized; UTF-8 is specified by the Unicode Consortium and adopted by just about every major technical system and programming language worldwide2.
- It has sane error handling and fallback mechanisms; you’ll be glad when you’re messsing around with bytes you don’t exactly know the source of.
F-UTF-8 has none of these features!
- It’s bloated; every single codepoint is encoded with 8 bytes, which is twice as wide as even the chonky UTF-32! More than half of this space is wasted on the unnecessary padding!
- It’s incompatible with ASCII; code points that fit in one ASCII byte are nevertheless expanded into 8. Hell, it’s even incompatible with UTF-8, due to its liberal use of 0xFF bytes (which are always invalid in UTF-8).
- It’s niche and nonstandard; do you trust a random git repo to be well supported?
- It has no error handling; if something fails to decode, TOO BAD!