Understanding String Encoding β
Encoding β
Strings are represented as raw bytes at memory, bytes are chaotic unless we invent a rule to read/write them in manner. Encodings are standards for both translating raw bytes into characters and writing characters into raw bytes.
In other words, characters are truths while Encodings are languages for the truths, Encoding is the language of language.
- Binary: The Language of Electricity.
- Hex: The Language of Binary Groups.
- Encoding: The Language of Characters.
- Character Set: The Truth of Human Expression.
Coded Character Set β
A Character Set is certain set of concrete symbols, such as A to Z in human language. A Coded Character Set is a Character Set with each item uniquely mapped to a numeric value.^1 These numeric values are named as Code Points.
Coded Charset is a Map β
A coded character set is a character set with each item uniquely mapped to a numeric value.^1 A character set has no real index, they theoretically exist on math, but to represent its range on computer, we have to cope with the size of integers.
NOTE
In Unix and Unix-like systems, the term charmap is commonly used.
Encoding as Charset? β
In the early days of computing, we don't have a intermediate map like Coded Character Set for encodings, we encode everything directly into bytes and vice versa. Encoding itself became the de facto Coded Character Set when there's no intermediate map, this is the very reason why so many ancient encodings are also listed along with UTF encodings in your editor. ASCII for example, it sets 65 as the concrete value in byte of A while 65 is also the "code point" of A in ASCII table, meaning that it does not have any abstraction gap.
NOTE
Of course ASCII has no concept of code point, we just borrow the idea from Unicode to adapt the mental model of Unicode over ASCII.
// "code point" in ASCII is identical to its character presence in byte
// meaning that it has no abstraction gap
_ = Encoding.ASCII.GetBytes("AB") is [65, 66]; // true
_ = Encoding.ASCII.GetBytes("AB") is [(byte)'A', (byte)'B']; // true2
3
4
Unicode β
Universal Character Set β
Universal Character Set(UCS) is a Coded Character Set design by Unicode. It originally contain 2^16 characters to cover all human characters at that time, using 16-bit(2 bytes) value. This initial version of UCS is known as UCS-2, 2 means it takes 2 bytes to represent any characters in its set(by code point).
What is Plane β
But obviously we can imagine that our civilization keep evolving and creates more and more symbols to be added to UCS. Apparently 16-bit isn't enough to represent them all, Unicode started to design larger range for UCS. The original 2^16 range of UCS was named as Basic Multilingual Plane(BMP) afterward, which includes most commonly used characters. Unicode added extra 16 planes(17 in total) for future use, namely supplementary planes. Each supplementary planes has same amount of slots, which is 65536
Not all of slots of those planes are assigned with code point, it's just a pre-allocation.
// 'A' is in BMP
_ = new Rune('A').Plane is 0;
// thumb-up emoji is in plane 1
_ = "π".EnumerateRunes().Select(r => r.Plane).ToArray() is [1];2
3
4
Surrogate Pair β
Anything beyond BMP in UCS are of surrogate pair, two surrogates can be composed as 4-byte(Int32) code points. Only code points within certain range can be composed as part of the surrogate. The possible value of surrogates is selectively chosen from BMP, grouped as high surrogates(D800-DBFF) and low surrogates(DC00-DFFF), each has 1024 candidates in its range. Meaning that we have 1024 times 1024 possibilities to represent characters beyond BMP, exactly the same amount of code points in supplementary planes.
NOTE
The high surrogates and low surrogates, 1024+1024=2048 in total, are reserved surrogates in BMP.
Given that 2048 code points are reserved surrogates, we have remaining 1,112,064 code points available for actual use.
Unicode officially named the characters in BMP and those composed by surrogates as Unicode Scalar Value. Due to the variable-bit nature of UCS, some languages have a dedicated type to describe Unicode Scalar Value. System.Text.Rune is a implementation since .NET Core 3.0, the term Rune came from Go programming language^2.
// Too many characters in character literal [CS1012]
_ = 'π';
// .NET calculates it by 16-bit as single char
// a surrogate pair contains 32-bit which results Length in 2
_ = "π".Length is 2; // it's a surrogate
// System.Text.Rune is a proper abstraction on Unicode Scalar Value
_ = "π".EnumerateRunes().Count() is 1; // it's a single unit in Unicode2
3
4
5
6
7
8
Two surrogates are not combined directly by their form into new a code point, there's a simple transformation calculation formula.
C# has a \u escape syntax for 4-digits hex for UCS code points, and \U for 8-digits hex code points(primarily for calculated code points of surrogates).
Console.WriteLine("\U0001F44D" == "π");NOTE
Rune.Value is the calculated code point value.
Combining Character^3 β
One character might have counterpart/variant in other cultures, with a little modification. Chinese pin-yin has diacritics(ει³η¬¦ε·) on final(ι΅ζ―), a ΓΌ is a modification on u, appending diacritic on ΓΌ can make it a final Η. Here u is the Base Letter and the diacritics are Combining Marks^4.
IMPORTANT
Combining Mark is just one of the types of Combining Character in Unicode specification. We can inspect the category of a Unicode characters using Rune.GetUnicodeCategory.
Most of these letter with diacritics have dedicated code point on BMP, which is called Precomposed character:
// They're all on BMP
_ = "u".Length is 1;
_ = "ΓΌ".Length is 1;
_ = "Η".Length is 1;
_ = new Rune('u').Plane is 0;
_ = new Rune('ΓΌ').Plane is 0;
_ = new Rune('Η').Plane is 0;2
3
4
5
6
7
But they also have a composed form, which combines base letter with combining mark, as the specification describes:
In the Unicode Standard, all combining characters are to be used in sequence following the base characters to which they apply.^5
We can "unpack" the characters being combined using Normalization
The benefit of inventing yet another form to represent existing characters is, the composing form can enforce corresponding encoding to read base letter before combining marks, making older versions of Unicode encoding implementations able to display the base and leave combining characters as non-character(box/question mark).
This preserved some backward compatibility, a typical example is emoji, they're also composable. A emoji can be composed with a regular base π(U+1F44D) with skin tone (U+1F3FC) in the following example. When someone receives the emoji with skin tone in old device that has old encoding implementation, it will display a regular thumb-up and a non-character after it since the old encoding doesn't support it. This composing method preserved important information from the modern world to the ancient clients.
NOTE
Both emoji base and skin tones are on Plane 1, their 5-digit code points already implied this.
_ = "π".Length is 2; // it's a surrogate pair
// emoji with skin tone is 64-bit
_ = "ππΌ".Length is 4;
// System.Text.Rune is a proper abstraction on Unicode Scalar Value
_ = "π".EnumerateRunes().Count() is 1; // it's a single unit in Unicode
// emoji with skin tone is also scalar value in Unicode but not in C# Rune
_ = "ππΌ".EnumerateRunes().Count() is 2;
// confirms that regular thumb-up is head of skin tone version
Console.WriteLine("ππΌ".EnumerateRunes().First().Value == "π".EnumerateRunes().First().Value);2
3
4
5
6
7
8
9
10
Unicode Normalization β
The term Unicode Normalization then appeared to describe conversion between these composing and non-composing form mentioned above. Before discussing normalization, we must understand the difference between Canonical and Compatible.
Conceptually, compatibility characters are characters that would not have been encoded in the Unicode Standard except for compatibility and round-trip convertibility with other standards. Such standards include international, national, and vendor character encoding standards.^6
That is, even though Unicode has a composing strategy to represent characters as many as possible, it has to retain the backward-compatibility for older encodings. So there's some composed-like characters such as ΒΌ is actually a non-composable. But the standard also specified it as a decomposable, named as a dedicated Normalization Form. C# has string.Normalize(NormalizationForm) to convert a string to target form if possible. Of course the byte presence would be different in another possible normalization form.
byte[] original = Encoding.Unicode.GetBytes("ΒΌ");
byte[] decomposed = Encoding.Unicode.GetBytes("ΒΌ".Normalize(NormalizationForm.FormKD));
// byte presence is different
_ = original.SequenceEqual(decomposed) is false;
// string normalizations are not identical on comparison
_ = ("ΒΌ" == "ΒΌ".Normalize(NormalizationForm.FormKD)) is false;2
3
4
5
6
7
Keywords for compatibility decomposable characters include: <initial>, <medial>, <final>, <isolated>, <wide>, <narrow>, <small>, <square>, <vertical>, <circle>, <noBreak>, <fraction>, <sub>, <super>, and <compat>.^7
NOTE
See: Full list of decomposable characters, ΒΌ is under keyword <fraction>.
// Η is not compatibility character
_ = "Η".Normalize(NormalizationForm.FormD)
.EnumerateRunes()
.Select(r => r.Value)
.ToArray() is ['u', 'Μ', 'Μ'];
// ΒΌ is compatibility character
_ = "ΒΌ".Normalize(NormalizationForm.FormKD)
.EnumerateRunes()
.Select(r => r.Value)
.ToArray() is ['1', 'β', '4'])2
3
4
5
6
7
8
9
10
11
Unicode Category β
Any Unicode character has its named group in the specification, we may use Rune.GetUnicodeCategory to inspect it. The classification is rather trivial in the Unicode specification, it currently has 29 types of characters.
_ = Rune.GetUnicodeCategory(
"Γ©".Normalize(NormalizationForm.FormD)
.EnumerateRunes()
.ElementAt(1) // the diacritic is a NonSpacingMark
) is UnicodeCategory.NonSpacingMark;
_ = Rune.GetUnicodeCategory(
"ππΌ".EnumerateRunes()
.ElementAt(1) // skin tone is a ModifierSymbol
) is UnicodeCategory.ModifierSymbol;2
3
4
5
6
7
8
9
10
Universal Transformation Format β
TIP
UTF is in fact a derived format based on Universal Coded Character Set(UCS). This implies that we always have Coded Character Set before we implement a Encoding, we need the map as a guide of transformation. The charset itself is already an Interface, a intermediate presence, allowing one to implement conversion from one encoding to another.
UTF16 is the direct child implemented upon UCS. As aforementioned, UCS originally only represent characters in 16-bit code point, so UTF16 adopted the 16-bit minimum length to represent character in byte naturally. In UTF16-first languages like C#, we might have char type as an abstraction layer for characters in Plane 0.
sizeof(char) == 2 * sizeof(byte) // 2 * 8 = 16bitsThe BMP, the first plane of UCS contains characters that, each of its byte presence in UTF16 are identical to its code point, because why not? This implies UTF16 originally had no intermediate mapping just like ASCII until surrogates were added.
Little & Big Endian β
TODO: figure out what's word size and how CPU process certain size when writing string bytes. TODO: and how such architecture difference requires BOM on UTF16.
For files, a non-visible indicator character is required to tell the byte-order(aka endian) which precedes the first actual value, namely a byte order mark (BOM).
U+FEFF: Indicates Big endianU+FFFE: Indicates Little endian
_ = Encoding.Unicode.GetBytes("A") is [65, 0];
_ = Encoding.BigEndianUnicode.GetBytes("A") is [0, 65];
// the endian affects on each 16-bit
_ = Encoding.Unicode.GetBytes("π") is [061, 216, 077, 220];
_ = Encoding.BigEndianUnicode.GetBytes("π") is [216, 061, 220, 077];2
3
4
5
6
Why UTF8 and UTF32? β
UTF16 is not ASCII compatible because it requires 16bits for code unit while ASCII as an encoding requires only 8bits. UTF8 was then invented primarily for backward compatibility with ASCII, and to be more memory/storage efficient.
The byte count of UTF8 characters varies from 1 to 4, making it memory efficient.
U+0000toU+007F(0-127): ASCII characters take 1 byte.U+0080toU+07FF(128-2047): Latin, Cyrillic alphabets etc, take 2 bytes.U+0800toU+FFFF(2048-65535): The rest characters of BMP, take 3 bytesU+010000toU+10FFFF(65536-1114111): Characters of surrogate pair, take 4 bytes.
You can notice that UTF8 takes one more byte than UTF16 for representing most of BMP characters. Meaning that UTF8 is more memory-efficient when content is primarily made of ASCII characters. While UTF16 would out-perform when content contains more BMP characters like CJK characters.
Almost all old C/C++ libraries prefer UTF8 due to its ASCII compatibility, because char type in C has 8-bit size. UTF8 also prevailed on the web since its more compact, has higher transmission-efficiency, because html document always has more ASCII than CJK characters.
UTF8 doesn't have endian issue like UTF16, because it knows how long should it read next when reading current byte. Each byte already implied the range of the character located, the ranges was specifically designed because we have distinct leading binary digits for these ranges.
Once you read A which is of binary 01000001, the leading 0 implies its a ASCII character, we should only read one byte and consider this byte as a whole character. If the current byte is 11100101, then we know we should read next two bytes to consider the three as a whole. This design made UTF8 extremely comprehensive at byte presence, with no byte-order issue.
| First Byte Bit Pattern | Meaning | Total Bytes |
|---|---|---|
| 0xxxxxxx | ASCII Character | 1 Byte |
| 110xxxxx | Latin, Cyrillic alphabets etc | 2 Bytes |
| 1110xxxx | The rest characters of BMP | 3 Bytes |
| 11110xxx | Characters of surrogate pair | 4 Bytes |
| 10xxxxxx | Continuation Byte | (Invalid as a start) |
UTF32 is simply a radical evolution over UTF16, it takes 32-bit for every character, making it consistent but super consuming, also has byte-order problem.
Unicode as Interface β
To convert UTF16 to UTF8, we use UCS as an interface. We first convert UTF16 bytes into their code points and then convert code points to UTF8 bytes. So UTF encodings are implementations about how to convert bytes to UCS code points and vice versa.