User:Twrl/Unicode
Unicode is a standard way of representing text in many different scripts and languages. As of Unicode version 8.0 the standard covers more than 120,000 characters in 129 different scripts.
Structure
Unicode itself is an abstract thing. It consists of abstract characters and code points, their properties, and rules about composing and decomposing them. The actual representation of Unicode text is in one of the various Unicode Transformation Formats (the UTF-*s that you see around the place).
The practical upshot of which is that there are multiple levels of abstraction. There is not a one-to-one correspondence between code points and abstract characters: for example, é can be represented by the code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE), or by the sequence U+0065 (LATIN SMALL LETTER E) followed by U+0301 (COMBINING ACUTE ACCENT), and for almost all purposes these two representations are identical.
A code point is a number in the range U+0000 to U+10FFFF. These are conventionally written with the prefix U+ rather than 0x to make the semantics clear: they are abstract codes not data values. Each code point has various properties associated with it, such as a name, category, block, case conversions, directionality, and so on. Not every property is defined for every code point.
Code points are organised into planes and blocks. There are 17 planes, numbered 0 to 10 in hex, which is the high order part of the code point's value. The majority of characters are assigned in plane 0 (the Basic Multilingual Plane) or plane 1 (the Supplementary Multilingual Plane)
Plane | Code Range | Name |
---|---|---|
0 | U+0000 to U+FFFF | Basic Multilingual Plane |
1 | U+10000 to U+1FFFF | Supplementary Multilingual Plane |
2 | U+20000 to U+2FFFF | Supplementary Ideographic Plane |
3 | U+30000 to U+3FFFF | Unused |
4 | U+40000 to U+4FFFF | Unused |
5 | U+50000 to U+5FFFF | Unused |
6 | U+60000 to U+6FFFF | Unused |
7 | U+70000 to U+7FFFF | Unused |
8 | U+80000 to U+8FFFF | Unused |
9 | U+90000 to U+9FFFF | Unused |
10 | U+A0000 to U+AFFFF | Unused |
11 | U+B0000 to U+BFFFF | Unused |
12 | U+C0000 to U+CFFFF | Unused |
13 | U+D0000 to U+DFFFF | Unused |
14 | U+E0000 to U+EFFFF | Supplementary Special Use Plane |
15 | U+F0000 to U+FFFFF | Supplementary Private Use Area A |
16 | U+100000 to U+10FFFF | Supplementary Private Use Area B |
As you can see a very large chunk of the code point space is completely unused, and another big chunk is given over to private use. There's also a smaller Private Use Area in the Basic Multilingual Plane, as well as all of planes 15 and 16 - lots of place for people into conlanging and conscripting to define their own (and there's a conscript registry which tries to coordinate use of the PUAs), or more likely for linguists and anthropologists to try out different things.
Blocks roughly correspond to scripts, or script variants. A lot of them are familiar, for example Basic Latin, General Punctuation, Box Drawing, etc. Blocks are useful for organising such a vast array of code points, because each block gives a name to a contiguous range of characters with related usage.
Encoding
There are various ways of encoding Unicode text. These are called Unicode Transformation Formats. The three most common are UTF-8, UTF-16, and UTF-32.
UTF-32
UTF-32 is the simplest. Each code point is represented as a 32-bit integer, whose value is the value of the code point. There's no real transformation step. A string in UTF-32 might be represented as
char32_t* myString;
and manipulated just like a regular C-style ASCII string. It is still worth bearing in mind that there is more than one possible sequence of code points to represent some characters.
UTF-16
UTF-16 is what vendors have historically meant when they talk about Unicode support. In UTF-16, text is represented as a sequence of 16-bit values. Every code point in the basic multilingual plane is coded as it's own value. Every code point in other planes is represented as sequence of two values called a surrogate pair.
There is a reserved block in the Basic Multilingual Plane, code points U+D800 to U+DFFF, which are unused for representing characters. These values are used by UTF-16 to construct surrogate pairs.
Each surrogate pair consists of a high surrogate (in the range U+D800 to U+DBFF) followed by a low surrogate (in the range U+DC00 to U+DFFF). To encode a code point outside the BMP into UTF-16:
- Subtract 0x10000 from the code point, leaving a 20-bit number
- Add the 10 most significant bits to the value 0xD800, this give the value of the high surrogate
- Add the 10 least significant bits to the the value 0xDC00, this is the value of the low surrogate
For example, a simple function (in C++) which reads a UTF-32 stream and writes to a UTF-16 stream might look like this:
void translate(basic_istream<char32_t>& utf32in, basic_ostream<char16_t>& utf16out) {
char32_t u32;
char16_t u16hi, u16lo;
while (!utf32in.eof()) {
u32 = utf32in.get();
if (u32 > 0xFFFF) {
u32 -= 0x10000;
u16lo = 0xDC00 + (u32 & 0x3FF);
u16hi = 0xD800 + ((u32 >> 10) & 0x3FF);
utf16out.put(u16hi).put(u16lo);
} else {
u16lo = static_cast<char16_t>(u32);
utf16out.put(u16lo);
}
}
}
Obviously there are other things that you can do in a function like this. For example, you can normalise as you go, you can check that the code points are actually defined, etc.
Since virtually all modern text only requires code points in the Basic Multilingual Plane, surrogate pairs are actually quite rare in general use and functions to encode and decode them are often under-tested. On the plus side, the algorithm is relatively simple so there's little to go wrong with it.
UTF-8
UTF-8 is probably the most common of the unicode encodings today, and is very widely used in web and email. It has the advantage of being compatible with 7-bit ASCII, which makes it good for applications where there's likely to be legacy data. The down side is that UTF-8 encodes code points as sequences of 1, 2, 3, or 4 bytes, which makes it slightly more complicated to implement than UTF-16.
Others
There are multiple other encodings for Unicode. The best known are probably UTF-7, UTF-EBCDIC, and GB 18020. Wikipedia has a good article on the different encodings.