13. Character Sets and Encodings

There ain't no such thing as plain text.
— Spolsky/Atwood 

Character sets and encodings are what we use to store and transmit text with computers, making them fundamental to most every application. The two terms are sometimes used interchangeably, but the difference is important.

While the character set declares what symbols are defined and in what order, the encoding declares its byte-level format in storage or during transit over a network. If we’ve created a message and want to actually read it again later, these details are just as significant as knowledge of the human language used to convey it. Therefore, we shall strive to:

Tip:  on Text

Always declare an encoding when creating or handling textual data.

It does not make sense to have a string without knowing what encoding it uses.
— Joel Spolsky, Blogger/co-founder StackExchange 

The point Misters Spolsky and Atwood have attempted to hammer home above, is that any piece of text with no encoding specified will soon have some or all of its meaning lost—later to be rendered as garbage.

Due to the complexities of human languages and historical reasons, this subject is more complicated than you might first imagine.

Hint:  Missing Characters?

If symbols in the text below are not displayed correctly, you may need to install additional fonts described below.

13.1. History

The Internet has progressed from smart people in front of dumb terminals, to dumb people in front of smart terminals.
— Proverb

Building on the human need to communicate, the history of text encoding for communication purposes is quite colorful. A few of the most notable early advances are listed below:

  • Electrical Telegraph  - early 1800s-1860, made long-distance text-based communication possible for the first time, supplanting the “Pony Express.” 

    _images/morse_code.svg

    Fig. 15 International Morse Code 

  • Morse Code  - 1830-1840’s,
    a manual telegraph code.
    S.O.S. –>  •••  –––  •••


  • Braille  (Version 2) - 1837,
    the first binary form of writing,
    designed for blind folks.


_images/baudot_patent.png

Fig. 16 Baudot’s US Patent 

  • Baudot code  (5-bit code) - 1870s, an automated (performed by machines) telegraph code.


  • Teletypes and Teleprinters - 1910s-1970s

    Invented to speed communication and reduce costs—messages could be transmitted remotely without the need for operators trained in Morse code.

How did they work? Take a look. Rotary dial included! 

In computing, especially in Unix-like operating systems, the legacy of teletypes lives on in the designations for serial ports and consoles (i.e., the text-only display mode), which have the prefix tty, such as /dev/tty5 for the fifth virtual console.
— linfo 
  • EBDIC  (a 6-bit code) - 1950s, a text encoding for early IBM mainframes; the only pre-ASCII encoding for computers you’ll find mentioned in modern times.

See also:  Books

Code, by Charles Petzold

This nerdy book explores the early history of communications, writing systems, text encodings, number systems, basic electronics, logic gates, and introduces computer architecture to the beginner. Together, you’ll build a simple computer with the pieces both discovered and invented. Petzold’s enthusiasm draws you in, making for a thrilling read despite a subject matter some might consider dry. Recommended for the education of every computing professional that wants to know how it all actually works under the hood.

13.2. ASCII, a Seven-Bit Encoding

An encoding… with the unlikely pronunciation of ASS-key.
— Unknown (read the phrase somewhere, no help from Google)
                ____        .-'''-.      _______    .-./`)  .-./`)
    .-,       .'  __ `.    / _     \    /   __  \   \ .-.') \ .-.')     .-,
 ,-.|  \ _   /   '  \  \  (`' )/`--'   | ,_/  \__)  / `-' \ / `-' \  ,-.|  \ _
 \  '_ /  |  |___|  /  | (_ o _).    ,-./  )         `-'`"`  `-'`"`  \  '_ /  |
 _`,/ \ _/      _.-`   |  (_,_). '.  \  '_ '`)       .---.   .---.   _`,/ \ _/
(  '\_/ \    .'   _    | .---.  \  :  > (_)  )  __   |   |   |   |  (  '\_/ \
 `"/  \  )   |  _( )_  | \    `-'  | (  .  .-'_/  )  |   |   |   |   `"/  \  )
   \_/``"    \ (_ o _) /  \       /   `-'`-'     /   |   |   |   |     \_/``"
              '.(_,_).'    `-...-'      `._____.'    '---'   '---'
Flower Power font, courtesy FIGlet .
_images/tty.png

Fig. 17 The Teletype Model 33, one of the first terminals to employ ASCII. 

Introduced in 1963 then updated in 1967 and 1986, ASCII  (aka ISO 646) is the American Standard Code for Information Interchange, a ubiquitous standard for character encoding. Developed from automatic telegraph codes  (no humans directly required, such as in Morse code), and printing teletype  (tty) technology, it was defined as a seven-bit code, therefore able to encode 128 different values. ASCII was carefully designed,  as was common in the days of acute resource constraints—each bit cost money in storage and in transit.

Centered around English text it contains the twenty-six letter basic Latin alphabet without diacritics, in upper and lower case (a shiny new feature), the numerals 0-9, the most common punctuation characters, and a number (32) of now largely obsolete printing control characters. It also introduced ordered alphanumeric symbols for convenient sorting.

Diving in!

Let’s take a closer look. Below ASCII is rendered as 8 columns of 16 rows (4-bits), with indexes in Decimal then Hex:

 0 00 NUL  16 10 DLE  32 20    48 30 0  64 40 @  80 50 P   96 60 `  112 70 p
 1 01 SOH  17 11 DC1  33 21 !  49 31 1  65 41 A  81 51 Q   97 61 a  113 71 q
 2 02 STX  18 12 DC2  34 22 "  50 32 2  66 42 B  82 52 R   98 62 b  114 72 r
 3 03 ETX  19 13 DC3  35 23 #  51 33 3  67 43 C  83 53 S   99 63 c  115 73 s
 4 04 EOT  20 14 DC4  36 24 $  52 34 4  68 44 D  84 54 T  100 64 d  116 74 t
 5 05 ENQ  21 15 NAK  37 25 %  53 35 5  69 45 E  85 55 U  101 65 e  117 75 u
 6 06 ACK  22 16 SYN  38 26 &  54 36 6  70 46 F  86 56 V  102 66 f  118 76 v
 7 07 BEL  23 17 ETB  39 27 '  55 37 7  71 47 G  87 57 W  103 67 g  119 77 w
 8 08 BS   24 18 CAN  40 28 (  56 38 8  72 48 H  88 58 X  104 68 h  120 78 x
 9 09 HT   25 19 EM   41 29 )  57 39 9  73 49 I  89 59 Y  105 69 i  121 79 y
10 0A LF   26 1A SUB  42 2A *  58 3A :  74 4A J  90 5A Z  106 6A j  122 7A z
11 0B VT   27 1B ESC  43 2B +  59 3B ;  75 4B K  91 5B [  107 6B k  123 7B {
12 0C FF   28 1C FS   44 2C ,  60 3C <  76 4C L  92 5C \  108 6C l  124 7C |
13 0D CR   29 1D GS   45 2D -  61 3D =  77 4D M  93 5D ]  109 6D m  125 7D }
14 0E SO   30 1E RS   46 2E .  62 3E >  78 4E N  94 5E ^  110 6E n  126 7E ~
15 0F SI   31 1F US   47 2F /  63 3F ?  79 4F O  95 5F _  111 6F o  127 7F DEL

These eight columns were called “sticks” in the official docs.

The 4-bit columns were meaningful in the design of ASCII. The original influence was binary-coded decimal and one of the major design choices involved which column should contain the decimal digits. ASCII was very carefully designed; essentially no character has its code by accident. Everything had a reason, although some of those reasons are long obsolete.
— kps on HN 

Below is a different view, in order to make a few relationships more obvious. Four columns of thirty-two (5 bits at a time) in binary are shown with the first two bits separated by a hyphen to visually separate them:

Group 00          01              10              11
--------------    ------------    ------------    ------------
00-00000  NUL     01-00000        10-00000  @     11-00000  `
00-00001  SOH     01-00001  !     10-00001  A     11-00001  a
00-00010  STX     01-00010  "     10-00010  B     11-00010  b
00-00011  ETX     01-00011  #     10-00011  C     11-00011  c
00-00100  EOT     01-00100  $     10-00100  D     11-00100  d
00-00101  ENQ     01-00101  %     10-00101  E     11-00101  e
00-00110  ACK     01-00110  &     10-00110  F     11-00110  f
00-00111  BEL     01-00111  '     10-00111  G     11-00111  g
00-01000  BS      01-01000  (     10-01000  H     11-01000  h
00-01001  HT      01-01001  )     10-01001  I     11-01001  i
00-01010  LF      01-01010  *     10-01010  J     11-01010  j
00-01011  VT      01-01011  +     10-01011  K     11-01011  k
00-01100  FF      01-01100  ,     10-01100  L     11-01100  l
00-01101  CR      01-01101  -     10-01101  M     11-01101  m
00-01110  SO      01-01110  .     10-01110  N     11-01110  n
00-01111  SI      01-01111  /     10-01111  O     11-01111  o
00-10000  DLE     01-10000  0     10-10000  P     11-10000  p
00-10001  DC1     01-10001  1     10-10001  Q     11-10001  q
00-10010  DC2     01-10010  2     10-10010  R     11-10010  r
00-10011  DC3     01-10011  3     10-10011  S     11-10011  s
00-10100  DC4     01-10100  4     10-10100  T     11-10100  t
00-10101  NAK     01-10101  5     10-10101  U     11-10101  u
00-10110  SYN     01-10110  6     10-10110  V     11-10110  v
00-10111  ETB     01-10111  7     10-10111  W     11-10111  w
00-11000  CAN     01-11000  8     10-11000  X     11-11000  x
00-11001  EM      01-11001  9     10-11001  Y     11-11001  y
00-11010  SUB     01-11010  :     10-11010  Z     11-11010  z
00-11011  ESC     01-11011  ;     10-11011  [     11-11011  {
00-11100  FS      01-11100  <     10-11100  \     11-11100  |
00-11101  GS      01-11101  =     10-11101  ]     11-11101  }
00-11110  RS      01-11110  >     10-11110  ^     11-11110  ~
00-11111  US      01-11111  ?     10-11111  _     11-11111  DEL

Note: Each column above differs only in the first two bits (from the left). This form makes the relationships between the control characters and letters more apparent across rows. By default on a traditional keyboard, the first two bits of each character are set to one, in most cases producing a lower case letter (last column). Also:

  • The Ctrl key clears the first two bits, producing the “control character” corresponding to the left of the letter.

  • The Shift key clears the second bit producing an uppercase character (in most cases), simplifying “case-insensitive character matching and construction of keyboards/printers.”  This is why the cases have a difference of thirty-two. For example in Python:

    >>> ord('a') - ord('A')
    32
    >>> ord('z') - ord('Z')
    32
    
  • Note the Delete key is all ones, which effectively “rubs out” a character on paper tape or punch card.  In fact, early keyboards labeled it as “Rubout.”

Modern keyboards are a bit more complicated, but this explains several design cues of ASCII. One final view up next. If we factor the last five bits in common to each row, the relationships are further clarified:

Last-5  00    01   10   11      ← First 2 bits
  ↓
00000   NUL   SP    @    `
00001   SOH    !    A    a
00010   STX    "    B    b
00011   ETX    #    C    c
00100   EOT    $    D    d      <-- End transmission
00101   ENQ    %    E    e
00110   ACK    &    F    f
00111   BEL    '    G    g      <-- Bell
01000   BS     (    H    h      <-- Backspace
01001   TAB    )    I    i      <-- Tab
01010   LF     *    J    j
01011   VT     +    K    k
01100   FF     ,    L    l      <-- Form feed
01101   CR     -    M    m      <-- Carriage return
01110   SO     .    N    n
01111   SI     /    O    o
10000   DLE    0    P    p
10001   DC1    1    Q    q
10010   DC2    2    R    r
10011   DC3    3    S    s
10100   DC4    4    T    t
10101   NAK    5    U    u
10110   SYN    6    V    v
10111   ETB    7    W    w
11000   CAN    8    X    x
11001   EM     9    Y    y
11010   SUB    :    Z    z
11011   ESC    ;    [    {      <-- Escape
11100   FS     <    \    |
11101   GS     =    ]    }
11110   RS     >    ^    ~
11111   US     ?    _    DEL

Courtesy, user soneil at HN :

If you think of each byte as being 2 bits of “group” and 5 bits of “character”:

00 11011 is Escape
10 11011 is [

The Ctrl key sets the first two bits to zero, eh? 🤔  All of a sudden these keyboard combinations make a lot more sense, no? Mind, blown.

  • Ctrl+D - EOT - End transmission

  • Ctrl+G - BEL - Ring bell

  • Ctrl+H - BS  - Backspace

  • Ctrl+I - TAB - Tab

  • Ctrl+L - FF  - Form feed/Clear screen

  • Ctrl+M - CR  - Carriage return

  • Ctrl+[ - ESC - Escape

See also: Moar History

Lots of additional juicy details may be found in the works below:

13.3. Eight-bit Encodings

As desktop computers increasingly outnumbered those in laboratories during the 1980s (the Personal Computer revolution  woot!) demand for characters in languages other than English skyrocketed. It wasn’t long before extended  8-bit variants based on ASCII, (such as legendary IBM code-page 437) were being whipped up left and right to support the plethora of languages used around the world. There was also quite a bit of space allocated to dialog box drawing characters on IBM and Commodore computers of the era. An intermediate step on the way to communication Nirvana.

While the strategy mostly worked, there were some significant drawbacks.

On some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (ג), so when Americans would send their résumés to Israel they would arrive as rגsumגs.
— Joel Spolsky, Blogger/co-founder StackExchange 
  • Only one language could be used in a particular document, and often per system at a time.

  • Document interchange was difficult, as different “code pages” needed be swapped in and out to view them.

  • Lack of industry-wide standards made documents non-interoperable with other operating systems and often versions.

  • Fonts often needed to be installed, without fallbacks.

While there were hundreds of these encodings, two of the most common eight-bit character encodings are detailed below, as you might still encounter them in the wild. (Older encodings particular to your locale may be found as well.)

ISO 8859-## (aka latin1)

§

The most popular text encodings used on the early web, “Web 1.0,” during the interim between ASCII and Unicode were the ISO/IEC 8859  series. Separated into parts, ISO 8859-1 to 15, they support many European and Middle-Eastern languages based on the Latin, Cyrillic, Greek, Arabic, Hebrew, Celtic, and Thai scripts.

Windows-1252 and Others

Sometimes erroneously called ANSI format , Windows-1252  and other “code pages”  substantially overlap with ISO 8859 series mentioned above.

They differ mainly in the removal of a block of rarely-used extended control characters, called C1 . Microsoft decided to replace them with a few perhaps more useful characters such as curly quotes and later the Euro (€) symbol, though this lead to further incompatibilities. 

Of course these 8-bit encodings (with space for up to 256 characters) were never able to handle the Asian writing systems (CJKV ) that have thousands of symbols . When the Internet (specifically the world-wide web) exploded it soon became clear that swapping code-pages was a glaring dead-end.

Luckily, Unicode had already been invented and folks started noticing. With the success of Windows XP  in 2001, Unicode had hit its stride and was finally being used by “the masses.”

13.4. DBCS

…or double-byte character sets  are traditionally fixed-length encodings taking up 16-bits of space per character (with space for 65 thousand glyphs), necessary for CJKV  (aka East-Asian) writing systems. Notably:

The term DBCS is now somewhat obsolete, with “double” being swapped for “multiple.” These encodings, while once popular are in decline due to the emergence of Unicode. Another intermediate step on the way to…

13.5. Unicode

It’s hard to say anything pithy about Unicode that is entirely correct.
— John D. Cook, Why Unicode is subtle 

Unicode is an ambitious attempt to catalog all the symbols used by humans, currently and throughout history into a single reference, called the Universal Character Set.  Why everything?  So that all text written in any language can be exchanged anywhere. The goal is a complete solution to symbolic communication between humans. This is a much larger job than often realized due to the complexities of the worlds writing systems, and why it is still augmented and improved regularly, over twenty-five years after its first release in 1991.

Universal Character Set (UCS)

Unicode is like a big database of characters and properties and rules, and on its own says nothing about how to encode a sequence of characters into a sequence of bytes for actual use by a computer.
— James Bennett 

Note that the UCS is not concerned with storage formats, we shall deal with that shortly. Rather, the catalog of symbols is indexed by what are called code points—simply a list of positive integers with matching metadata such as a standard name and type attributes (letter, number, symbol…), lined up in space. One might think of them like the number-line  you learned in math class in elementary or middle school.

“Sixteen Is Not Enough”

Despite a common early misconception that Unicode was limited to 16 bits or 65k characters, it is not. Currently Unicode allows for 17 planes of 65,536 possible characters, for a total of 1 million (1,114,112) possible characters.  The current standard as of this writing, Unicode version 12.0 (2019) contains a total of 137,929 characters assigned. 

13.5.1. Planes and Blocks

As mentioned, Unicode consists of seventeen planes  i.e., of existence. Each plane is further divided into blocks, containing the following (highlights):

  1. Basic Multilingual Plane (BMP)

    The first plane is designed around backwards compatibility and tries to pack in everything practical that’s possible to fit. Blocks enshrined in history:

    1. ASCII, as “Basic Latin”

    2. ISO 8859-1 (the top half), as “Latin 1 Supplement”

    It also contains:

    • Most modern languages (with non-huge numbers of codepoints)

    • Almost all common symbols, punctuation, currency

    • Mathematical operators and most symbols

    • Dingbats, geometric, and box-drawing symbols

  1. Supplementary Multilingual Plane (SMP)

    The second plane contains:

    • Historical scripts:

      • Ancient Greek

      • Egyptian Hieroglyphs

      • South, Central, and East Asian scripts.

    • Additional math and music notation (some historical)

    • Game symbols (cards, dominoes)

    • Emoji and other picto-graphics (food, sports)

  1. Supplementary Ideographic Plane
    Additional and historic Asian (CJKV) ideographs.

  1. – 13. Unassigned Planes

  1. Special-Purpose Plane

    Currently contains non-graphical characters.

  1. & 16. Private Use Area

    Available for character assignment by third parties. Icon fonts such as FontAwesome often use these, however it uses the smaller private block under BMP 0.

13.5.2. Encodings

_images/real.jpg

Fig. 18 Shit just got real–Dora the Exploder version.

Encodings are where the “rubber hits the road,” so to speak, moving from the theoretical to concrete. In other words, we need to convert the symbols in our catalog into bits in memory, on disk, or traveling over the ‘net. How do we do that?  With encodings.

Early on it was assumed that since eight bits wasn’t enough (it wasn’t), we should double that to sixteen. As any self-respecting geek knows, that does a lot more than double the number of slots available, it makes enough room for 65 thousand plus characters! (2¹⁶ = 65,536) To someone in the West that probably sounds like an incredible amount of space for symbols, on the order of perhaps-even-too much? I mean, we’ve been limping along with 256 at a time so far, right gramps?

Well… no. Not long afterwards someone realized that Chinese (actually CJKV) alone has on the order of roughly 100,000 characters!  ((boggle))  Though most are obscure, historical, variants, and/or not regularly used. Still, sixteen is not nearly enough when we consider all the other things we’d want in a universal character set.

Definitions

The names for Unicode encodings are lame if you ask me, and not very helpful in describing what they are. Let’s clarify a couple of concepts first.

Fixed-length encodings:

So far we’ve studied only fixed-length encodings. That means each character is defined with a fixed number of bits, and that never changes, period. e.g.:

  • Baudot code is 5 bits per character.

  • ASCII is 7 bits per character.

  • ISO 8859 is 8 bits per character

This design has some helpful properties:

  • It’s easy to measure the length of a string.

  • You can start reading anywhere in the string and get a correct sub-string.

  • One can slice strings without worry.

The main drawback is that, if we increase the number space by blindly adding more bits then a lot of it will be wasted for simple strings that didn’t need it before. i.e.: Frequently-used text is now taking two bytes of space per character, where it used to take only one.

Variable-length encodings:

Save space, hooray! Variable is accomplished using only one byte of space with frequently-used text, but two or more bytes when dealing with the less frequently-used. Using a special character designated as an escape code, an additional byte can be added each time we run out of space, as the encoding moves from the essential to obscure. 

This design has helpful properties as well:

  • Frequently-used chars are compact.

  • Medium-usage chars are medium.

  • Less frequently-used chars are possible, the sky is the limit.

Great, but variable length encodings have drawbacks as well—losing the helpful features of the fixed-length encoding:

  • Counting length requires inspecting the whole string.

  • Slicing the string or reading a sub-string becomes problematic.

There are ways to program around these issues, but they require additional processor time and perhaps memory too.

With those concepts out of the way, let’s look at the types of Unicode encodings.

UCS (fixed, deprecated):

Confusingly named after the Universal Character Set (when they thought these would be the only encodings), these are the original fixed-length encodings of Unicode in the tradition of ASCII and ISO 8859. They are given a suffix denoting the number of bytes allocated per symbol, e.g.:

  • UCS-2 (2 bytes, 16 bits, ~65k chars)

  • UCS-4 (4 bytes, 32 bits [actually fewer at 21], ~1.2M chars)

UTF (variable):

Designated the “Unicode Transformation Format,” (huh?) these encodings are variable-length and came later. They are given a suffix denoting the number of bits (sigh) allocated per symbol, e.g.:

  • UTF-8 (1-4 bytes)

  • UTF-16 (2-4 bytes)

  • UTF-32 (4 bytes, 32 bits [actually limited to 21])

All three can encode the full ~1.1M character space.

The reason there is less than the full bits of space in the encodings is that the “escape sequences” described earlier (to add additional bytes) prevent the use of several blocks of addresses, each block larger than the last.

Variable-size encodings have come to dominate over time, as the drawbacks in processing speed have been outweighed by storage and transfer concerns, especially as computers have increased in processing power  every year. Bottom line—UCSs are out and deprecated, while UTFs are in! Let’s take a look at how that happened.

13.5.2.1. UCS-2 –> UTF-16

The “First Stab” at the Problem, aka UCS-2

The first encoding of Unicode, UCS-2, was released in 1991. This is where the misconception started that Unicode was sixteen-bit—only and limited to it. Indeed UCS-2 is limited to it, and only capable of encoding the Basic Multilingual Plane (BMP). It was later deprecated for this reason.

Early implementors of Unicode that implemented UCS-2 included:

  • Windows NT

    • Visual Studio, Visual Basic

    • COM, etc.

  • Joliet file system for CD-ROM media (an ISO 9660 extension)

  • Java, The Programming language

  • JavaScript, The Programming language

Windows, Java, and JavaScript were later upgraded to use…

UTF-16

Is the variable-length version of UCS-2 created for Unicode 2.0 in 1996. It uses 16-bit values by default but can expand to 32 to express the upper planes when needed utilizing what is called, “surrogate pairs”. This makes it the most efficient Unicode encoding of Asian languages such as Chinese, Japanese Kanji, and Korean Hanja (CJKV), as common characters take only two bytes. 

Unfortunately, this encoding doubles the size of formerly eight-bit strings that fit in ISO 8859.

Also, the need for UTF-16’s “surrogate pairs,” in characters above the first 16-bits is what limits the Unicode standard to ~1.1 million codepoints. 

Endianness and BOM

Due to the fact that each character uses more than one byte, Endianness  becomes an issue for both encoders and decoders. This may be specified by prepending either of the characters below as a tip to the decoder:

  • Byte Order Mark (BOM), U+FEFF

  • Non-character value, U+FFFE

It may also be stated explicitly as the encoding type: UTF-16BE or UTF-16LE. When using these encodings, do not prepend a BOM.

13.5.2.2. UCS-4 –> UTF-32

UCS-4 was never officially released, but was repackaged as UTF-32 and introduced with Unicode 2.0 as well. If you thought UTF-16 was wasteful, you ain’t seen nothin’ yet. UTF-32 is markedly wasteful of space, utilizing a full four bytes of memory for each character, whether needed or not, mostly not (since due to other compatibility rules twenty-one bits are the most it could potentially use anyway).

However, when memory usage is not an issue, it can be useful to use UTF-32 to boost text processing speed. This is due to the fact that, remember back to fixed-length encodings, every character is the same size. They are directly addressable so can be accessed efficiently (in constant time) as you would an array. The presence of combining characters can reduce these advantages, however. Endianness is a factor here as well.

Analysis

We’ve got choices of UTF-16 and UTF-32 now from Unicode 2.0, but they’re still not so compelling are they? As this amusing take by Joel on Software illustrates:

For a while it seemed like that might be good enough, but programmers were complaining. “Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings. 

There must be a better way, hah. Moving on.

13.5.2.3. UTF-8 FTW, Baby!

UTF-8 was designed, in front of my eyes, on a placemat in a New Jersey diner one night in September or so, 1992.
— Rob Pike 

Fortunately some smart people were thinking the same thing. Finally the problem landed in the lap of Ken Thompson , who designed it with Rob Pike  “cheering him on.”  Yes, that Ken and Rob, creators of Unix:

What happened was this. We had used the original UTF (16) from ISO 10646 to make Plan 9  support 16-bit characters, but we hated it. We were close to shipping the system when, late one afternoon, I received a call from some folks, I think at IBM, who were in an X/Open committee meeting. They wanted Ken and me to vet their file-system safe UTF design. We understood why they were introducing a new design, and Ken and I suddenly realized there was an opportunity to use our experience to design a really good standard and get the X/Open guys to push it out. We suggested this and the deal was, if we could do it fast, OK. So we went to dinner, Ken figured out the bit-packing, and when we came back to the lab after dinner we called the X/Open guys and explained our scheme. We mailed them an outline of our spec, and they replied saying that it was better than theirs, and how fast could we implement it? 

The design turned out to have a number of very useful properties.

Backward Compatibility

UTF-8 is a variable-length encoding that can store a character using one to four bytes (shortened from six when Unicode was limited to twenty-one bytes in 2003). It is designed for backward compatibility with ASCII, as every code point from 0-127 is stored in a single byte. Additionally, it’s compatible with program code that use a single null character (of 0) to terminate strings.

¡Every ASCII file is also valid UTF-8!  ¡Arriba! !Arriba! 😍

How does it work? Like this, the underscore “ˍ” character below representing a container for a significant bit value:

Avail. Bits

First

Last

Byte 1

Byte 2

Byte 3

Byte 4

7

U+0000

U+007F

0ˍˍˍˍˍˍˍ

11

U+0080

U+07FF

110ˍˍˍˍˍ

10ˍˍˍˍˍˍ

16

U+0800

U+FFFF

1110ˍˍˍˍ

10ˍˍˍˍˍˍ

10ˍˍˍˍˍˍ

21

U+10000

U+10FFFF

11110ˍˍˍ

10ˍˍˍˍˍˍ

10ˍˍˍˍˍˍ

10ˍˍˍˍˍˍ

As Wikipedia describes well:

The first 128 characters (US-ASCII) need only one byte. The next 1,920 characters need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, as well as Combining Diacritical Marks. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[11] including most Chinese, Japanese and Korean characters. Four bytes are needed for characters in the other planes of Unicode, which include less common CJKV characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols). 

Self-Synchronization

The length of the multi-byte UTF-8 sequences is easily determined just by looking at their prefix codes:

  • Initial byte prefix, one of:

    • 0ˍˍˍˍˍˍˍ

    • 11ˍˍˍˍˍˍ

  • Continuation byte prefix:

    • 10ˍˍˍˍˍˍ

Note in the table above, that the number of ones at the start of the first byte gives the number of bytes of the whole character (when not 0). That’s a great property when looking at the encoded text in a hex editor, or when writing decoders or other bit-twiddling programs.

Due to the fact that continuation bytes do not share the same prefix with the initial byte of a character sequence means that a search will not find a character inside another. This property makes slicing and dicing UTF-8 strings much less worrisome.

Last, but not least UTF-8 is not byte-order dependent due to the fact it is byte oriented. 

Auto-detection and Fallback

In addition to complete ASCII support, there is also partial ISO 8859/Windows-1252 support as well. When given “extended” text by mistake, the UTF-8 decoder can recognize and fix it in many cases, due to the fact that most higher character values are impossible for UTF-8. These values can therefore be transformed to the Latin-1 Supplement Block  block.

Comparison

UTF-8

UTF-16

UTF-32

Highest code point

10FFFF

10FFFF

10FFFF

Code unit size

8 bit

16 bit

32 bit

Byte order dependent

no

yes

yes

Fewest bytes per character

1

2

4

Most bytes per character

4

4

4

13.5.3. Complexity and Foot-Dragging

Unicode is a fascinating mess.
— Armin Ronacher 
_images/xkcd_uni.png

Fig. 19 Right to Left Override Character (U+202E) 

Using Unicode day to day is not hard on modern systems, but when one digs in a bit things get a little scarier, which tends to turn off the “graybeards who grew up on ASCII.” Unicode has:

  • Planes and Blocks

  • Full characters, and “combining characters”

    • e.g.,  É or (E and ´) in combination.

    • Normalization , or conversion to a standard representation. For example compare the character “n” and a combining tilde “~”, with the composed version: “ñ”.

  • Bi-directional text display.

    • Oh, up and down too!

  • Collation rules

  • Regular expression rules 

  • Historical scripts, dead languages

    Phoenician and Aramaic anyone? Cuneiform, Runes?

  • Duplicate code points with different semantics. e.g.:

    • Roman Letter V vs. the Roman Numeral Ⅴ, which is used as a number.

    • Greek letter Omega Ω, and the symbol for electrical resistance in Ohms, Ω.

  • Font availability across operating systems.

  • Encoding issues:

Language Issues

The real craziness with Unicode isn't in the sheer number of characters that have been assigned. The real fun starts when you look at how all these characters interact with one another.
— Ben Frederickson, Unicode is Kind of Insane 

A tiny slice of what the Unicode designers had to deal with, questions and trivia to ponder:

  • Greek has two forms of the lower-case sigma: “ς” used only as the final letter of a word, and the more familiar “σ” used everywhere else. Is that one letter, or two?

  • Is the German Eszett  “ß” a separate letter or a decorative way of writing “ss?” Germans themselves have disagreed over the years.

  • Does a diaeresis/umlaut mark over a letter e.g.: ö, make it a different letter? German says yes, English/Spanish/Portuguese say no.

    WorldMaker on HN:

    Language is messy. At some point you have to start getting into the raw philosophy of language and it’s not just a technical problem at that point but a political problem and an emotional problem.

    Which language is “right”? The one that thinks that diaresis is merely a modifier or the one that thinks of an accented letter as a different letter from the unmodified? There isn’t a right and wrong here, there’s just different perspectives, different philosophies, huge histories of language evolution and divergence, and lots of people reusing similar looking concepts for vastly different needs.  

  • JoelSky chimes in:

    If a letter’s shape changes at the end of the word, is that a different letter? Hebrew says yes, Arabic says no.

    Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or so, accompanied by a great deal of highly political debate, and you don’t have to worry about it. They’ve figured it all out already. 

Lazy Transition

Back in my day, we didn't even have uppercase, and we liked it!
— Gramps on Computing (in the voice of Wilford "Dia-BEET-uss" Brimley)
_images/enough.jpg

Fig. 20 I beg to differ… 

You have to understand (as a newbie) what it was like around the turn of the thentury, cough, get my teeth! Most of the hand-wringing over Unicode back then had to do with the fact that the existing docs with eight-bit code-pages were a mess (who would convert them?), Unicode was still immature and transition incomplete, UTF-8 wasn’t invented/well-publicized yet, 16-bit encodings equaled wasted space, and lots of legacy systems were still around that couldn’t handle it.

I think it's also worth remembering the combination of arrogance and laziness which was not uncommon in the field, especially in the 90s.
— acdha on HN 

Meanwhile Americans, who by cosmic luck had one of the simplest alphabets on the planet, and coincidentally happened to own the big computing, IT, and operating systems companies thought, “what’s the problem? Things look fine from here! Hell, ASCII was good enough for Gramps, why not me?” Well yeah, not very forward thinking, but the issues listed previously were real as well.

What’s the Verdict?

An interesting judgement from etrnloptimist at reddit :

The question isn’t whether Unicode is complicated or not. Unicode is complicated because languages are complicated.

The real question is whether it is more complicated than it needs to be. I would say that it is not.

Agreed, though if we were to start over, yes, a few things might be simplified. But, let me ask you how many times a complex project landed a perfectly designed 1.0?

13.5.4. The Growth of UTF-8

Somewhere around 2010 perhaps 2012, not sure because it happened so gradually this author wasn’t paying attention. But, after:

  • UTF-8 was invented, solving:

    • Storage concerns

    • Backwards compatibility

    • Self-synchronization

  • Out with the old Windows, in with the new, w/ better support

  • The Web took it and ran! Many old docs were updated.

  • iPhone was released, w/ better support

  • Emoji became a sought after feature, first in Asia, then the world.

Suddenly everything was Unicode! No more hurdles to cross and very little for user or developer to do, it’s just handled by the operating system except in obscure circumstances. Most if not all of the issues preventing Unicode success had been solved. Just use UTF-8 everywhere  became a rallying cry! That lead to the explosive growth seen in the chart below.

_images/utf8.svg

Fig. 21 Courtesy 

13.5.5. Now What!? AKA, Actually Working With It

🙈 🙉 🙊

One thing that’s really bugged me in all the Unicode docs, tutorials, and blog posts I’ve read over the years is that they teach you all the facts about it (like this chapter so far) and then walk away.

Okay ((brain filled with tons of facts about Unicode)), so now I’m sitting in front of my editor and its time to start taking advantage of Unicode! What now? When do I encode? When do I decode? What about the encode and decode exceptions that pop up in various languages?

Somehow that part never got explained. 😠 The writers were just too tired to write part two, I guess. 🤷

So, even armed with copious Unicode knowledge, as a developer you’d still find yourself patching programs here and there in a piecemeal fashion, catching errors and encoding/decoding data with no rhyme or reason. You’d never be sure you handled every case, and sometimes a fix in one part of program would break another in another part. (Part of the problem was programming languages ill-equiped for the Unicode transition, but I digress.) Sheesh.

Cue THX sound effects - SCHWIIIIIIiiinnnggg… 

Kindly hit the play button to the right –>
now, wait a sec, then keep reading…

I’ll let you in on the secret. Years, and I mean years later as in close to a decade of fumbling around in the dark, the answer was finally stumbled upon—by me anyway Grrrrr… 

Finally, a mental model for What. To. Do. to solve these issues once and for all. What was it, you ask… ??

“The Unicode Sandwich”

Decode from Input
100% All-Unicode Patty
Encode to Output
  1. Top Bun - Read encoded bytes from input (file, network) and decode them into Unicode.

  2. Meat - Process in memory as Unicode text.

  3. Bottom Bun - Encode to bytes on output as close to the edge as possible then:

    • Save to disk

    • Write to network

That’s it for the sandwich ((burp)). Achievement unlocked, enlightenment followed, whew.

String Indexing

Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.
— ekidd on HN 

Next up are the problems with finding lengths and offsets in strings—otherwise known as indexing—that may not be composed of full but also combining characters. It’s important to remember with Unicode that:

  • One code point != one character

  • Code points are not the same as characters/graphmemes/glyphs

As ekidd mentions above, this may work fine until someone slips in an obscure character. There is not a full solution to the problem unfortunately. Strings may need to be normalized  first and/or inspected programmatically.

Regular expressions are another area where Unicode complexities should be anticipated  and dealt with. 

13.5.6. Fonts and Errors

To conclude this section, did we happen to mention you’ll need comprehensive space-consuming fonts to see all these amazing characters? Don’t despair, operating system vendors have been improving Unicode support every year. While they’ve been good at providing fonts for various languages, there still seem to be symbols missing here and there. Perhaps, you’ve got a copy of an older OS that can’t be updated.

For those reasons, you may find the Symbola font quite useful. It’s a symbol font that contains a fully up-to-date set of the upper blocks and planes implemented, from math to music to food to emoji! The latest version (up to Unicode v13) can be found here.  Fonts for “ancient scripts” are found there as well. Enjoy.

It’s also worth mentioning that modern OSs tend to replace the emoji block with full-colored icons. This is somewhat of a mixed blessing, but a common practice now, so you’ve been warned.

Hint:  Mo-ji-ba-ke

文 字
化 け

What is this Mojibake?  Not the cute little ice-cream balls, not pronounced “bake” like in an oven, but “bah-keh,” and means “character transform” in Japanese.

Simply put, it is jumbled text that has been decoded using the wrong character encoding for display to the user, and may look similar to this: æ–‡å—化ã‘

While you’ll often see boxes, question marks, and hexadecimal numbers, (similar to the lack of font support mentioned above) additional corruption may occur. This may result in a loss of sync over where one character starts and ends, display in an incorrect writing system, or worse.

¡d∩ ƃuᴉddɐɹM 

Godspeed, Unicode. We await your frowning pile of poo  with open arms.

  • “There ain’t no such thing as plain text.”

  • “Always declare an encoding when creating or handling textual data.” (Professionals do.)

  • The history of telecommunications is interesting and a precursor to computing and networking.

  • 7-bit ASCII is a fundamental industry standard and backward compatibility a bridge to earlier times.

  • The “Control” key on your keyboard maps keys to ASCII control values by clearing the top two bits of the seven from the corresponding letter.

  • The Delete key once was called Rubout because it punched all the holes in paper tape or punch card, directing the reader to ignore it.

  • There were hundreds of 8-bit ASCII extensions now considered a “dead-end.” Two of the most famous, that live on were ISO 8859 and Windows-1252.

  • Unicode is an ambitious attempt to catalog all the symbols used by humans, currently and throughout history into a single reference, called the Universal Character Set. (Resulting in:)

  • “Unicode is a fascinating mess.”

  • Unicode is not limited to 16-bit encodings, which are insufficient to solve the full problem.

  • UTF-16 is the most compact encoding for modern texts in “CJKV” (Asian) languages.

  • UTF-8 is the best all around encoding for other uses, helping propel it to incredible popularity. Use it!

  • When programming, use the “Unicode Sammich” model to remember how to handle textual data properly.

  • The Symbola font is highly useful when an OS vendor has let you down.