Introducing F-UTF-8

  • programming
  • esoteric
  • projects

F-UTF-8 (an acronym) is an extension of UTF-8 that hates you.

Standard UTF-8 is cool. It’s a variable-width encoding, which means that some Unicode codepoints are encoded using 1 byte, some using 2, some 3, and some are even 4 bytes long.

Now, the mechanism used to determine the length of a codepoint in UTF-8 is quite simple, actually! For ASCII characters, just output the byte as-is. Then, for all other characters, the first byte of the sequence has its N highest bits set, and the N-1th highest bit unset, where N is the number of bytes used in the final encoding. The rest of the bytes all have the two top bits set to 0b10, and the unset bits encode the actual binary data of the codepoint. Here, let me show you:

Character Binary codepoint UTF-8 bytes
a 0b01100001 0b01100001
ä 0b11100100 0b11000011 + 0b10100100
0b00110000 + 0b01000010 0b11100011 + 0b10000001 + 0b10000010
🦆 0b00000001 + 0b11111001 + 0b10000110 0b11110000 + 0b10011111 + 0b10100110 + 0b10000110

Red bits encode the data, blue bits tell you how long the sequence is, and green bits let you know that you’re in the middle of a multibyte sequence.

Why am I telling you this? Well, if you notice above, UTF-8 inserts some padding zero bits into the red section. This is of course done to make the final result byte-aligned. Buuuuuuuut… Who’s stopping you from adding more padding?

… The Unicode Consortium. They’re stopping you from adding more padding. It’s what they call “an overlong encoding” and “invalid UTF-8” and other nonsense like that. Well, I stand up to the establishment! I will make the overlongest of overlong encodings, consortium be darned!

So with that said, welcome to F-UTF-8! This is an extension of UTF-8 which not only permits overlong encodings with too much padding, it requires it. In fact, it requires putting the maximal amount of padding for every possible byte. The table above turns into:

Character Binary codepoint UTF-8 bytes
a 0b01100001 0b11111111 + 0b10000000 + 0b10000000 + 0b10000000 + 0b10000000 + 0b10000000 + 0b10000001 + 0b10100001
ä 0b11100100 0b11111111 + 0b10000000 + 0b10000000 + 0b10000000 + 0b10000000 + 0b10000000 + 0b10000011 + 0b10100100
0b00110000 + 0b01000010 0b11111111 + 0b10000000 + 0b10000000 + 0b10000000 + 0b10000000 + 0b10000011 + 0b10000001 + 0b10000010
🦆 0b00000001 + 0b11111001 + 0b10000110 0b11111111 + 0b10000000 + 0b10000000 + 0b10000000 + 0b10000000 + 0b10011111 + 0b10100110 + 0b10000110

This is glorious! This is beautiful! Rejoice!

As is typical for trendsetters like this, I have implemented this new standard in Rust.

Comparisons

Standard UTF-8 is ubiquitous in digital communications and data storage, for a few simple reasons:

F-UTF-8 has none of these features!

  1. Unless you’re Japanese. 

  2. Except the Windows API3, and JavaScript, and Qt 

  3. for now

subscribe via RSS