13.   Character Sets and Encodings

There ain't no such thing as plain text.
— Spolsky/Atwood 

Character sets and encodings are how we store and transmit text with computers. While sometimes the two terms are used interchangeably, the set declares what symbols are defined and in what order, while the encoding declares its format when stored or in transit over a network.

It does not make sense to have a string without knowing what encoding it uses.
— Joel Spolsky, Blogger/co-founder StackExchange 

The point Misters Spolsky and Atwood allude to above is that for any piece of text with no encoding specified as metadata, some portion of meaning will be lost—rendered as garbage.

Tip:  on Text

Always declare an encoding when creating or handling textual data.

Due to the complexities of human languages and historical reasons, this subject is more complicated than you might imagine at first glance.

Hint:  Missing Characters?

If symbols in the text below are not displayed correctly, you may need to install additional fonts described below.

13.1.   History

The Internet has progressed from smart people in front of dumb terminals, to dumb people in front of smart terminals.
— Proverb
_images/baudot_patent.png

Fig. 15 Baudot’s US Patent 

Building on the human need to communicate, the history of text encoding for communication purposes is quite colorful. A few of the most notable early advances are listed below:

  • Braille  (Version 2) - 1837, the first binary form of writing. (See at right.)

  • Baudot code  (5-bit code) - 1870s, an automated (performed by machines) telegraph code.

  • Teletypes, Teleprinters - 1910s

    Invented to speed communication and reduce costs—messages could be transmitted remotely without the need for operators trained in Morse code.

    In computing, especially in Unix-like operating systems, the legacy of teletypes lives on in the designations for serial ports and consoles (i.e., the text-only display mode), which have the prefix tty, such as /dev/tty5 for the fifth virtual console.
    — linfo 
  • EBDIC  (6-bit code) - 1950s, a text encoding for early IBM mainframes; the only pre-ASCII encoding for computers you’ll likely find mentioned in modern times.

See also:  Books

Code, by Charles Petzold

This nerdy book explores the early history of communications, writing systems, text encodings, number systems, basic electronics, logic gates, and introduces computer architecture. Together, you’ll build a simple computer with the pieces both discovered and invented. Petzold’s enthusiasm draws you in, making for a thrilling read despite a subject matter some might consider dry. Recommended for the education of every computing professional that wants to know how it actually works.

13.2.   ASCII

An encoding… with the unlikely pronunciation of ASS-key.
— Unknown (read the phrase somewhere, no help from Google)
                 ____        .-'''-.      _______    .-./`)  .-./`)
    .-,        .'  __ `.    / _     \    /   __  \   \ .-.') \ .-.')      .-,
 ,-.|  \ _    /   '  \  \  (`' )/`--'   | ,_/  \__)  / `-' \ / `-' \   ,-.|  \ _
 \  '_ /  |   |___|  /  | (_ o _).    ,-./  )         `-'`"`  `-'`"`   \  '_ /  |
 _`,/ \ _/       _.-`   |  (_,_). '.  \  '_ '`)       .---.   .---.    _`,/ \ _/
(  '\_/ \     .'   _    | .---.  \  :  > (_)  )  __   |   |   |   |   (  '\_/ \
 `"/  \  )    |  _( )_  | \    `-'  | (  .  .-'_/  )  |   |   |   |    `"/  \  )
   \_/``"     \ (_ o _) /  \       /   `-'`-'     /   |   |   |   |      \_/``"
               '.(_,_).'    `-...-'      `._____.'    '---'   '---'
Flower Power font, courtesy FIGlet .
_images/tty.png

Fig. 16 The Teletype Model 33, one of the first terminals to employ ASCII. 

Introduced in 1963 and updated in 1967 and 1986, ASCII  (aka ISO 646) is the American Standard Code for Information Interchange, a ubiquitous standard for character encoding. Developed from automatic telegraph codes  (no humans directly required, such as in Morse code), and printing teletype  (tty) technology, it was defined as a seven-bit code, giving it the ability to encode 128 different values. ASCII was carefully designed,  as was common in the days of acute resource constraints—each bit cost money in storage and in transit.

Centered around English text it contains the twenty-six letter basic Latin alphabet without diacritics, in upper and lower case (a shiny new feature), the numerals 0-9, the most common punctuation characters, and a number (32) of now mostly obsolete printing control characters. It also introduced ordered alphanumeric symbols for convenient sorting.

Diving in!

Let’s take a closer look. Below ASCII is rendered as 8 columns of 16 rows (4-bits), with indexes in Decimal then Hex:

 0 00 NUL  16 10 DLE  32 20    48 30 0  64 40 @  80 50 P   96 60 `  112 70 p
 1 01 SOH  17 11 DC1  33 21 !  49 31 1  65 41 A  81 51 Q   97 61 a  113 71 q
 2 02 STX  18 12 DC2  34 22 "  50 32 2  66 42 B  82 52 R   98 62 b  114 72 r
 3 03 ETX  19 13 DC3  35 23 #  51 33 3  67 43 C  83 53 S   99 63 c  115 73 s
 4 04 EOT  20 14 DC4  36 24 $  52 34 4  68 44 D  84 54 T  100 64 d  116 74 t
 5 05 ENQ  21 15 NAK  37 25 %  53 35 5  69 45 E  85 55 U  101 65 e  117 75 u
 6 06 ACK  22 16 SYN  38 26 &  54 36 6  70 46 F  86 56 V  102 66 f  118 76 v
 7 07 BEL  23 17 ETB  39 27 '  55 37 7  71 47 G  87 57 W  103 67 g  119 77 w
 8 08 BS   24 18 CAN  40 28 (  56 38 8  72 48 H  88 58 X  104 68 h  120 78 x
 9 09 HT   25 19 EM   41 29 )  57 39 9  73 49 I  89 59 Y  105 69 i  121 79 y
10 0A LF   26 1A SUB  42 2A *  58 3A :  74 4A J  90 5A Z  106 6A j  122 7A z
11 0B VT   27 1B ESC  43 2B +  59 3B ;  75 4B K  91 5B [  107 6B k  123 7B {
12 0C FF   28 1C FS   44 2C ,  60 3C <  76 4C L  92 5C \  108 6C l  124 7C |
13 0D CR   29 1D GS   45 2D -  61 3D =  77 4D M  93 5D ]  109 6D m  125 7D }
14 0E SO   30 1E RS   46 2E .  62 3E >  78 4E N  94 5E ^  110 6E n  126 7E ~
15 0F SI   31 1F US   47 2F /  63 3F ?  79 4F O  95 5F _  111 6F o  127 7F DEL

These eight columns were called “sticks” in the official docs.

The 4-bit columns were meaningful in the design of ASCII. The original influence was binary-coded decimal and one of the major design choices involved which column should contain the decimal digits. ASCII was very carefully designed; essentially no character has its code by accident. Everything had a reason, although some of those reasons are long obsolete.
— kps on HN 

Below is a different view, in order to make a few relationships more obvious. Four columns of thirty-two (5 bits at a time) in binary are shown with the first two bits separated by a hyphen to visually separate them:

00-00000  NUL     01-00000        10-00000  @     11-00000  `
00-00001  SOH     01-00001  !     10-00001  A     11-00001  a
00-00010  STX     01-00010  "     10-00010  B     11-00010  b
00-00011  ETX     01-00011  #     10-00011  C     11-00011  c
00-00100  EOT     01-00100  $     10-00100  D     11-00100  d
00-00101  ENQ     01-00101  %     10-00101  E     11-00101  e
00-00110  ACK     01-00110  &     10-00110  F     11-00110  f
00-00111  BEL     01-00111  '     10-00111  G     11-00111  g
00-01000  BS      01-01000  (     10-01000  H     11-01000  h
00-01001  HT      01-01001  )     10-01001  I     11-01001  i
00-01010  LF      01-01010  *     10-01010  J     11-01010  j
00-01011  VT      01-01011  +     10-01011  K     11-01011  k
00-01100  FF      01-01100  ,     10-01100  L     11-01100  l
00-01101  CR      01-01101  -     10-01101  M     11-01101  m
00-01110  SO      01-01110  .     10-01110  N     11-01110  n
00-01111  SI      01-01111  /     10-01111  O     11-01111  o
00-10000  DLE     01-10000  0     10-10000  P     11-10000  p
00-10001  DC1     01-10001  1     10-10001  Q     11-10001  q
00-10010  DC2     01-10010  2     10-10010  R     11-10010  r
00-10011  DC3     01-10011  3     10-10011  S     11-10011  s
00-10100  DC4     01-10100  4     10-10100  T     11-10100  t
00-10101  NAK     01-10101  5     10-10101  U     11-10101  u
00-10110  SYN     01-10110  6     10-10110  V     11-10110  v
00-10111  ETB     01-10111  7     10-10111  W     11-10111  w
00-11000  CAN     01-11000  8     10-11000  X     11-11000  x
00-11001  EM      01-11001  9     10-11001  Y     11-11001  y
00-11010  SUB     01-11010  :     10-11010  Z     11-11010  z
00-11011  ESC     01-11011  ;     10-11011  [     11-11011  {
00-11100  FS      01-11100  <     10-11100  \     11-11100  |
00-11101  GS      01-11101  =     10-11101  ]     11-11101  }
00-11110  RS      01-11110  >     10-11110  ^     11-11110  ~
00-11111  US      01-11111  ?     10-11111  _     11-11111  DEL

Each row above differs only in the first two bits (from the left). This form makes the relationships between the control characters and letters more apparent. By default, the first two bits of each character are set to one, in most cases producing a lower case letter (last column). Note:

  • The Ctrl key clears the first two bits, producing a “control character.”

  • The Shift key clears the second bit producing an uppercase character (in most cases), simplifying “case-insensitive character matching and construction of keyboards/printers.”  This is why the cases have a difference of thirty-two. For example in Python:

    >>> ord('a') - ord('A')
    32
    >>> ord('z') - ord('Z')
    32
    
  • Note the Delete key is all ones, which effectively “rubs out” a character on paper tape or punch card.  In fact, early keyboards labeled it as “Rubout.”

Modern keyboards are a bit more complicated, but this explains several design cues of ASCII. One final view up next. If we factor the last five bits in common, the relationships are further clarified:

00    01   10   11     ← First 2 bits
                       ↓ Last 5 bits
NUL   SP    @    `     00000
SOH    !    A    a     00001
STX    "    B    b     00010
ETX    #    C    c     00011
EOT    $    D    d     00100    <-- End transmission
ENQ    %    E    e     00101
ACK    &    F    f     00110
BEL    '    G    g     00111    <-- Bell
BS     (    H    h     01000    <-- Backspace
TAB    )    I    i     01001    <-- Tab
LF     *    J    j     01010
VT     +    K    k     01011
FF     ,    L    l     01100    <-- Form feed
CR     -    M    m     01101    <-- Carriage return
SO     .    N    n     01110
SI     /    O    o     01111
DLE    0    P    p     10000
DC1    1    Q    q     10001
DC2    2    R    r     10010
DC3    3    S    s     10011
DC4    4    T    t     10100
NAK    5    U    u     10101
SYN    6    V    v     10110
ETB    7    W    w     10111
CAN    8    X    x     11000
EM     9    Y    y     11001
SUB    :    Z    z     11010
ESC    ;    [    {     11011    <-- Escape
FS     <    \    |     11100
GS     =    ]    }     11101
RS     >    ^    ~     11110
US     ?    _    DEL   11111

Courtesy, user soneil at HN :

If you look at each byte as being 2 bits of “group” and 5 bits of “character”:

00 11011 is Escape
10 11011 is [

All of a sudden these keyboard combinations make a lot more sense, don’t they?

  • Ctrl+D - EOT - End transmission
  • Ctrl+G - BEL - Ring bell
  • Ctrl+H - BS  - Backspace
  • Ctrl+I - TAB - Tab
  • Ctrl+L - FF  - Form feed/Clear screen
  • Ctrl+M - CR  - Carriage return
  • Ctrl+[ - ESC - Escape

Mind, blown. ;-)

See also: History

Lots of additional juicy details may be found in the works below:

13.3.   Eight-bit Encodings

As desktop computers increasingly outnumbered those in laboratories during the 1980s (the Personal Computer revolution  woot!) demand for characters in languages other than English skyrocketed. It wasn’t long before extended  8-bit variants based on ASCII, (such as legendary IBM code-page 437) were being whipped up left and right to support the plethora of languages used around the world. There was also quite a bit of space allocated to dialog box drawing characters on IBM and Commodore computers of the era.

While the strategy mostly worked, there were some significant drawbacks.

On some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (ג), so when Americans would send their résumés to Israel they would arrive as rגsumגs.
— Joel Spolsky, Blogger/co-founder StackExchange 
  • Only one language could be used in a particular document, and often per system at a time.
  • Document interchange was difficult, as different “code pages” needed be swapped in and out to view them.
  • Lack of industry-wide standards made documents non-interoperable with other operating systems and often versions.
  • Fonts often needed to be installed, without fallbacks.

While there were hundreds of these encodings, two of the most common eight-bit character encodings are detailed below, as you might still encounter them in the wild. (Older encodings particular to your locale may be found as well.)

ISO 8859-## (aka latin1)

§

The most popular text encodings used on the early web, “Web 1.0,” during the interim between ASCII and Unicode were the ISO/IEC 8859  series. Separated into parts, ISO 8859-1 to 15, they support many European and Middle-Eastern languages based on the Latin, Cyrillic, Greek, Arabic, Hebrew, Celtic, and Thai scripts.

Windows-1252 and Others

Sometimes erroneously called ANSI format , Windows-1252  and other “code pages”  substantially overlap with ISO 8859 series mentioned above.

They differ mainly in the removal of a block of rarely-used extended control characters, called C1 . Microsoft decided to replace them with a few perhaps more useful characters such as curly quotes and later the Euro (€) symbol, though this lead to further incompatibilities. 

Of course these 8-bit encodings (with space for up to 256 characters) were never able to handle the Asian writing systems (CJKV ) that have thousands of symbols . When the Internet (specifically the world-wide web) exploded it soon became clear that swapping code-pages was a glaring dead-end.

Luckily, Unicode had already been invented and folks started noticing. With the success of Windows XP  in 2001, Unicode had hit its stride and was finally being used by “the masses.”

13.4.   DBCS

…or double-byte character sets  are traditionally fixed-length encodings taking up 16-bits of space per character (with space for 65 thousand glyphs), necessary for CJKV  (East-Asian) writing systems. Notably:

The term DBCS is now somewhat obsolete, with “double” being swapped for “multiple.” These encodings, while once popular are in decline due to the emergence of Unicode.

13.5.   Unicode

It’s hard to say anything pithy about Unicode that is entirely correct.
— John D. Cook, Why Unicode is subtle 

Unicode is an ambitious attempt to catalog all the symbols used by humans, currently and throughout history into a single reference, called the Universal Character Set.  Why everything?  So that all text written in any language can be exchanged anywhere. The goal is a complete solution to symbolic communication between humans. This is a much larger job than often realized due to the complexities of the worlds writing systems, and why it is still being augmented regularly over twenty-five years after its first release in 1991.

Universal Character Set (UCS)

Unicode is like a big database of characters and properties and rules, and on its own says nothing about how to encode a sequence of characters into a sequence of bytes for actual use by a computer.
— James Bennett 

Note that the UCS is not concerned with storage formats, we shall deal with that shortly. Rather, the catalog of symbols is indexed by what are called code points—simply a list of positive integers with matching metadata such as a standard name and type attributes (letter, number, symbol…), lined up in space. One might think of them like the number-line  you learned in math class in elementary or middle school.

“Sixteen Is Not Enough”

Despite a common early misconception that Unicode was limited to 16 bits or 65k characters, it is not. Currently Unicode allows for 17 planes of 65,536 possible characters, for a total of 1 million (1,114,112) possible characters.  The current standard as of this writing, Unicode version 9.0 (2016) contains a total of 128,172 characters assigned. 

13.5.1.   Planes and Blocks

As mentioned, Unicode consists of seventeen planes  i.e., of existence. Each is further divided into blocks, containing the following (highlights):

  1. Basic Multilingual Plane (BMP)

    The first plane is designed around backwards compatibility and tries to pack in everything practical that’s possible to fit. Blocks enshrined in history:

    1. ASCII, as “Basic Latin”
    2. ISO 8859-1 (top half), as “Latin 1 Supplement”

    It also contains:

    • Most modern languages
    • Almost all common symbols, punctuation, currency
    • Mathematical operators and most symbols
    • Dingbats, geometric, and box-drawing symbols
  1. Supplementary Multilingual Plane (SMP)

    The second plane contains:

    • Historical scripts:
      • Ancient Greek
      • Egyptian Hieroglyphs
      • South, Central, East Asian scripts.
    • Additional math and music notation (some historical)
    • Game symbols (cards, dominoes)
    • Emoji and other picto-graphics (food, sports)

  1. Supplementary Ideographic Plane
    Additional and historic Asian (CJKV) ideographs.

  1. – 13. Unassigned Planes
  1. Special-Purpose Plane

    Currently contains non-graphical characters.

  1. & 16. Private Use Area

    Available for character assignment by third parties. Icon fonts such as FontAwesome use these, (however it uses the smaller private block under plane 0 or the BMP).

13.5.2.   Encodings

_images/real.jpg

Fig. 17 Shit just got real–Dora the Exploder version.

Encodings are where the “rubber hits the road,” so to speak, moving from the theoretical to concrete. In other words, we need to convert our symbols in space into bits in memory, on disk, or on the net. How do we do that?  With encodings.

Early on it was assumed that since eight bits wasn’t enough (it wasn’t), we should double that to sixteen. As any self-respecting geek knows, that does a lot more than double the number of slots available, it makes enough room for 65 thousand plus characters! (2¹⁶ = 65,536) To someone in the West that probably sounds like an incredible amount of space for symbols, on the order of perhaps too much? I mean, we’ve been limping along with 256 at a time so far, right gramps?

Well… no. Not long afterwards someone realized that Chinese (actually CJKV) alone has on the order of roughly 100,000 characters 😲 ((boggle))!  Though most are obscure, historical, variants, and/or not regularly used. Still, sixteen is not enough when we consider all the other things we’d want in a universal character set.

Definitions

The names for Unicode encodings are lame if you ask me, and not very helpful in describing what they are. Let’s clarify a couple of concepts first.

Fixed-length encodings:

So far we’ve studied only fixed-length encodings. That means each character is defined with a fixed number of bits, and that never changes, period. e.g.:

  • Baudot code is 5 bits per character.
  • ASCII is 7 bits per character.
  • ISO 8859 is 8 bits per character

This design has some helpful properties:

  • It’s easy to measure the length of a string.
  • You can start reading anywhere in the string and get a correct sub-string.
  • One can slice strings without worry.

The main drawback is that, if we increase the number space by blindly adding more bits then a lot of it will be wasted for simple strings that didn’t need it before. i.e.: Frequently-used text is now taking two bytes of space per character, where it used to take only one.

Variable-length encodings:

Save space, hooray! Variable is accomplished using only one byte of space with frequently-used text, but two or more bytes when dealing with the less frequently-used. Using a special character designated as an escape code, an additional byte can be added each time we run out of space, as the encoding moves from the essential to obscure. 

This design has helpful properties as well:

  • Frequently-used chars are compact.
  • Medium-usage chars are medium.
  • Less frequently-used chars are possible, the sky is the limit.

Great, but variable length encodings have drawbacks as well—losing the helpful features of the fixed-length encoding:

  • Counting length requires inspecting the whole string.
  • Slicing the string or reading a sub-string becomes problematic.

There are ways to program around these issues, but they require additional processor time and perhaps memory too.

With those concepts out of the way, let’s look at the types of Unicode encodings.

UCS (fixed, deprecated):

Confusingly named after the Universal Character Set (when they thought these would be the only encodings), these are the original fixed-length encodings of Unicode in the tradition of ASCII and ISO 8859. They are given a suffix denoting the number of bytes allocated per symbol, e.g.:

  • UCS-2 (2 bytes, 16 bits, ~65k chars)
  • UCS-4 (4 bytes, 32 bits [actually fewer at 21], ~1.2M chars)
UTF (variable):

Designated the “Unicode Transformation Format,” (huh?) these encodings are variable-length and came later. They are given a suffix denoting the number of bits (sigh) allocated per symbol, e.g.:

  • UTF-8 (1-4 bytes)
  • UTF-16 (2-4 bytes)
  • UTF-32 (4 bytes, 32 bits [actually limited to 21])

All three can encode the full ~1.1M character space.

The reason there is less than the full bits of space in the encodings is that the “escape sequences” described earlier (to add additional bytes) prevent the use of several blocks of addresses, each block larger than the last.

Variable encodings have come to dominate over time, as the drawbacks in processing speed have been outweighed by storage and transfer concerns, especially as computers have increased in processing power  every year. Bottom line—UCSs are out and deprecated, while UTFs are in! Let’s take a look at how that happened.

13.5.2.1.   UCS-2 –> UTF-16

The “First Stab” at the Problem, aka UCS-2

The naive assumption discussed previously led to the first encoding of Unicode, UCS-2 in 1991. This is where the misconception started that Unicode was sixteen-bit—only and limited to it. Indeed UCS-2 is limited to it, and only capable of encoding the Basic Multilingual Plane (BMP). It was later deprecated for this reason.

Early implementors of Unicode that implemented UCS-2 included:

  • Windows NT
    • Visual Studio, Visual Basic
    • COM, etc.
  • Joliet file system for CD-ROM media (an ISO 9660 extension)
  • Java, The Programming language
  • JavaScript, The Programming language

Windows, Java, and JavaScript were later upgraded to use…

UTF-16

… the variable-length version of UCS-2 created for Unicode 2.0 in 1996. It uses 16-bit values by default but can expand to 32 to express the upper planes when needed utilizing what is called, “surrogate pairs”. This makes it the most efficient Unicode encoding of Asian languages such as Chinese, Japanese Kanji, and Korean Hanja (CJKV), as common characters take only two bytes. 

Unfortunately, this encoding doubles the size of formerly eight-bit strings that fit in ISO 8859. Also, the need for UTF-16’s “surrogate pairs,” in characters above the first 16-bits is what limits the Unicode standard to ~1.1 million codepoints. 

Endianness and BOM

Due to the fact that each character uses more than one byte, Endianness  becomes an issue for both encoders and decoders. This may be specified by prepending either of the characters below as a tip to the decoder:

  • Byte Order Mark (BOM), U+FEFF
  • Non-character value, U+FFFE

It may also be stated explicitly as the encoding type: UTF-16BE or UTF-16LE. When using these encodings, do not prepend a BOM.

13.5.2.2.   UCS-4 –> UTF-32

UCS-4 was never officially released, but was repackaged as UTF-32 and introduced with Unicode 2.0 as well. If you thought UTF-16 was wasteful, you ain’t seen nothin’ yet. UTF-32 is markedly wasteful of space, utilizing a full four bytes of memory for each character, whether needed or not, mostly not (since due to other compatibility rules twenty-one bits are the most it could potentially use anyway).

However, when memory usage is not an issue, it can be useful to use UTF-32 to boost text processing speed. This is due to the fact that, remember back to fixed-length encodings, every character is the same size. They are directly addressable so can be accessed efficiently (in constant time) as you would an array. The presence of combining characters can reduce these advantages, however. Endianness is a factor here as well.

Analysis

We’ve got choices of UTF-16 and UTF-32 now from Unicode 2.0, but they’re still not so compelling are they? As this amusing take by Joel on Software illustrates:

For a while it seemed like that might be good enough, but programmers were complaining. “Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings. 

There must be a better way, hah. Moving on.

13.5.2.3.   UTF-8 FTW, Baby!

UTF-8 was designed, in front of my eyes, on a placemat in a New Jersey diner one night in September or so, 1992.
— Rob Pike 

Fortunately some smart people were thinking the same thing. Finally the problem landed in the lap of Ken Thompson , who designed it with Rob Pike  “cheering him on.”  Yes, that Ken and Rob:

What happened was this. We had used the original UTF (16) from ISO 10646 to make Plan 9  support 16-bit characters, but we hated it. We were close to shipping the system when, late one afternoon, I received a call from some folks, I think at IBM, who were in an X/Open committee meeting. They wanted Ken and me to vet their file-system safe UTF design. We understood why they were introducing a new design, and Ken and I suddenly realized there was an opportunity to use our experience to design a really good standard and get the X/Open guys to push it out. We suggested this and the deal was, if we could do it fast, OK. So we went to dinner, Ken figured out the bit-packing, and when we came back to the lab after dinner we called the X/Open guys and explained our scheme. We mailed them an outline of our spec, and they replied saying that it was better than theirs and how fast could we implement it? 

Backward Compatibility

UTF-8 is a variable-length encoding that can store a character using one to four bytes (shortened from six when Unicode was limited to twenty-one bytes in 2003). It is designed for backward compatibility with ASCII, as every code point from 0-127 is stored in a single byte. Additionally, it’s compatible with program code that use a single null character (of 0) to terminate strings.

¡Every ASCII file is also valid UTF-8!  ¡Arriba! !Arriba! 😍

How does it work? Like this, the “ˍ” char representing a space for a bit value:

Bits First Last Byte 1 Byte 2 Byte 3 Byte 4
7 U+0000 U+007F 0ˍˍˍˍˍˍˍ      
11 U+0080 U+07FF 110ˍˍˍˍˍ 10ˍˍˍˍˍˍ    
16 U+0800 U+FFFF 1110ˍˍˍˍ 10ˍˍˍˍˍˍ 10ˍˍˍˍˍˍ  
21 U+10000 U+10FFFF 11110ˍˍˍ 10ˍˍˍˍˍˍ 10ˍˍˍˍˍˍ 10ˍˍˍˍˍˍ

As Wikipedia describes well:

The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, as well as Combining Diacritical Marks. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[11] including most Chinese, Japanese and Korean characters. Four bytes are needed for characters in the other planes of Unicode, which include less common CJKV characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols). 

Auto-detection and Fallback

In addition to complete ASCII support, there is also partial ISO 8859/Windows-1252 support as well. When given “extended” text by mistake, the UTF-8 decoder can recognize and fix it in many cases, due to the fact that most higher character values are impossible for UTF-8. These values can therefore be transformed to the Latin-1 Supplement Block  block.

Prefix Code

The length of the multi-byte sequences is easily determined by humans.
— Wikipedia 

Note, in the table above that the number of ones at the start of the first byte gives the number of bytes of the whole character (when not 0).

Self-Synchronization

Due to the fact that continuation bytes do not share the same prefix with the initial byte of a character sequence means that a search will not find a character inside another:

  • Initial byte prefix: or 11
  • Continuation prefix: 10

This property makes slicing and dicing UTF-8 strings much less worrisome. Last, but not least UTF-8 is not byte-order dependent due to the fact it is byte oriented. 

Comparison

  UTF-8 UTF-16 UTF-32
Highest code point 10FFFF 10FFFF 10FFFF
Code unit size 8 bit 16 bit 32 bit
Byte order dependent no yes yes
Fewest bytes per character 1 2 4
Most bytes per character 4 4 4

13.5.3.   Complexity and Foot-Dragging

Unicode is a fascinating mess.
— Armin Ronacher 
_images/xkcd_uni.png

Fig. 18 Right to Left Override Character (U+202E) 

Using Unicode day to day is not hard on modern systems, but when one digs in a bit things get a little scarier, which tends to turn off the “graybeards who grew up on ASCII.” Unicode has:

  • Planes and Blocks

  • Full characters, and “combining characters”

    • e.g.,  É or (E and ´) in combination.
    • Normalization , or conversion to a standard representation. For example compare the character “n” and a combining tilde “~”, with the composed version: “ñ”.
  • Bi-directional text display.

    • Oh, up and down too!
  • Collation rules

  • Regular expression rules 

  • Historical scripts, dead languages

    Phoenician and Aramaic anyone? Cuneiform, Runes?

  • Duplicate code points with different semantics. e.g.:

    • Roman Letter V vs. the Roman Numeral Ⅴ, which is used as a number.
    • Greek letter Omega Ω, and the symbol for electrical resistance in Ohms, Ω.
  • Font availability across operating systems.

  • Encoding issues:

Language Issues

The real craziness with Unicode isn't in the sheer number of characters that have been assigned. The real fun starts when you look at how all these characters interact with one another.
— Ben Frederickson, Unicode is Kind of Insane 

A tiny slice of what the Unicode designers had to deal with, questions and trivia to ponder:

  • Greek has two forms of the lower-case sigma: “ς” used only as the final letter of a word, and the more familiar “σ” used everywhere else. Is that one letter, or two?

  • Is the German Eszett  “ß” a separate letter or a decorative way of writing “ss?” Germans themselves have disagreed over the years.

  • Does a diaeresis/umlaut mark over a letter e.g.: ö, make it a different letter? German says yes, English/Spanish/Portuguese say no.

    WorldMaker on HN:

    Language is messy. At some point you have to start getting into the raw philosophy of language and it’s not just a technical problem at that point but a political problem and an emotional problem.

    Which language is “right”? The one that thinks that diaresis is merely a modifier or the one that thinks of an accented letter as a different letter from the unmodified? There isn’t a right and wrong here, there’s just different perspectives, different philosophies, huge histories of language evolution and divergence, and lots of people reusing similar looking concepts for vastly different needs.  

  • Joel chimes in:

    If a letter’s shape changes at the end of the word, is that a different letter? Hebrew says yes, Arabic says no.

    Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or so, accompanied by a great deal of highly political debate, and you don’t have to worry about it. They’ve figured it all out already. 

Lazy Transition

Back in my day, we didn't even have uppercase, and we liked it!
— Gramps on Computing (in the voice of Wilford Brimley) 
_images/enough.jpg

Fig. 19 I beg to differ… 

You have to understand (as a newbie) what it was like around the turn of the thentury, cough, get my teeth! Most of the hand-wringing over Unicode back then had to do with the fact that the existing docs with eight-bit code-pages were a mess (who would convert them?), Unicode was still immature and transition incomplete, UTF-8 wasn’t invented/publicized yet, 16-bit encodings equaled wasted space, and lots of legacy systems were still around that couldn’t handle it.

I think it's also worth remembering the combination of arrogance and laziness which was not uncommon in the field, especially in the 90s.
— acdha on HN 

Meanwhile Americans, who by cosmic luck had one of the simplest alphabets on the planet, and coincidentally happened to own the big computing, IT, and operating systems companies thought, “what’s the problem? Things look fine from here! Hell, ASCII was good enough for Gramps, why not me?” Well yeah, not very forward thinking, but the issues listed previously were real.

What’s the Verdict?

An interesting judgement from etrnloptimist at reddit :

The question isn’t whether Unicode is complicated or not. Unicode is complicated because languages are complicated.

The real question is whether it is more complicated than it needs to be. I would say that it is not.

Agreed, though if we were to start over, yes, a few things might be simplified. But, let me ask you how many times a complex project landed a perfectly designed 1.0?

13.5.4.   Unicode and Growth of UTF-8

Somewhere around 2010 perhaps 2012, not sure because it happened so gradually this author wasn’t paying attention. But, after:

  • UTF-8 was invented, solving:
    • Storage concerns
    • Backwards compatibility
    • Self-synchronization
  • The Web took it and ran!
  • Out with the old Windows, in with the new, w/ better support
  • iPhone released, better support
  • Emoji became a sought after feature

Suddenly everything was Unicode! No more hurdles to cross and very little for user or developer to do, it’s just handled by the operating system except in obscure circumstances. Most if not all of the issues preventing Unicode success had been solved. Just use UTF-8 everywhere!  That lead to the explosive growth seen in the chart below.

_images/utf8.svg

Fig. 20 Courtesy 

13.5.5.   Working with Unicode

🙈 🙉 🙊

One thing that’s bugged me in all the Unicode docs, tutorials, and blog posts I’ve read over the years is that they teach you all the facts about it (like this chapter so far) and then just walk away. Okay ((brain filled with tons of facts about Unicode)), so now I’m sitting in front of my editor and its time to start taking advantage of Unicode! What now? When do I encode? How do I decode? What about the encode and decode errors that pop up in various (now older) languages? Somehow that part never got explained. The writers were just too tired to write part two, I guess. 🤷

So, even armed with copious Unicode knowledge, as a developer you’d still find yourself patching programs here and there in a piecemeal fashion, catching errors and encoding/decoding data with no rhyme or reason. You’d never be sure you handled every case, and sometimes a fix in one part of program would break another in another part.

Cue THX sound effects - SCHWIIIIIIiiinnnggg… 

Sheesh. Years, and I mean years later as in pushing a decade, fumbling around in the dark the answere was finally stumbled upon—Grrrrr… 

Finally, a mental model for what to do, and solve these issues once and for all.

The Unicode Sandwich

Decode from Input
100% All-Unicode Patty
Encode to Output
  1. Top Bun - Read encoded bytes from input (file, network) and decode them.
  2. Meat - Process in memory as Unicode text.
  3. Bottom Bun - Encode to bytes on output as close to the edge as possible then:
    • Save to disk
    • Write to network

That’s it for the sandwich ((burp)). Achievement unlocked, enlightenment followed.

String Indexing

Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.
— ekidd on HN 

Next up are the problems with finding lengths and offsets in strings—otherwise known as indexing—that may not be composed of full but also combining characters. It’s important to remember with Unicode that:

  • One code point != one character
  • Code points are not the same as characters/graphmemes/glyphs

As ekidd mentions above, this may work fine until someone slips in an obscure character. There is not a full solution to the problem unfortunately. Strings may need to be normalized  first and/or inspected programmatically.

Regular expressions are another area where Unicode complexities should be anticipated  .

13.5.6.   Fonts and Errors

Did we happen to mention you’ll need comprehensive fonts to see these amazing characters? Don’t despair, operating system vendors have been improving Unicode support every year. While they’ve been good at providing fonts for various languages, there still seem to be symbols missing here and there. Perhaps, you’ve got a copy of an older OS that can’t be updated.

For those reasons, you may find the Symbola font quite useful. It’s a symbol font that contains a fully up-to-date set of the upper blocks and planes implemented, from math to music to food to emoji! The latest version (up to Unicode v11) can be found here.  Fonts for “ancient scripts” are found there as well. Enjoy.

It’s also worth mentioning that modern OSs tend to replace the emoji block with full-colored icons. This is somewhat of a mixed blessing, but a common practice now, so you’ve been warned.

Hint:  Mo-ji-ba-ke:  æ–‡å—化ã‘




What is this Mojibake?  Not pronounced “bake” like in an oven, but “bah-keh,” and means “character transform” in Japanese.

Simply put, it is jumbled text that has been decoded using the wrong character encoding for display to the user. While you’ll often see boxes, question marks, and hexadecimal numbers, (similar to the lack of font support mentioned above) additional corruption may occur.

This may be result in a loss of sync over where one character starts and ends, display in an incorrect writing system, or worse.


¡d∩ ƃuᴉddɐɹM 

Godspeed, Unicode. We await your frowning pile of poo with open arms. 

  • “There ain’t no such thing as plain text.”
  • “Always declare an encoding when creating or handling textual data.” (Professionals do.)
  • The history of telecommunications is interesting and a precursor to computing and networking.
  • The “Control” key on your keyboard maps keys to ASCII control values by clearing the top two bits of the seven from the corresponding letter.
  • The Delete key once was called Rubout because it punched all the holes in paper tape or punch card, directing the reader to ignore it.
  • There were hundreds of 8-bit ASCII extensions, that are now considered a “dead-end.” Two of the most famous, that live on were ISO 8859 and Windows-1252.
  • Unicode is an ambitious attempt to catalog all the symbols used by humans, currently and throughout history into a single reference, called the Universal Character Set.
  • Unicode is not limited to 16-bit encodings, which are not enough to solve the full problem anyway.
  • “Unicode is a fascinating mess.”
  • UTF-16 is the most compact encoding for modern texts in the CJKV languages.
  • UTF-8 is the best all around encoding for other uses, helping propel it to incredible popularity.
  • Use the “Unicode Sammich” model to remember how to handle Unicode and its encodings properly.
  • The Symbola font is useful when your OS vendor has let you down.