12.   Regional Concerns

There's no such thing as "international software," just software.
— A 'Disgruntled' Swedish Guy (I used to work with)

You’ve no doubt noticed the world is not getting any less connected; globalization marches on. Consequently, it’s been recognized that most of the money in the software industry is not in the author’s home market, but rather abroad.   Indeed, why target a product to one country (even a large one), when you could instead target a hundred, raking in several times more cash by following a few guidelines amounting to just a bit more work?

I went right on biggering… selling more Thneeds. And I biggered my money, which everyone needs.
— Dr. Seuss, The Lorax 

Thankfully, due to the magic of der Internet, additional software sales are practically free from a distribution angle and don’t require the destruction of additional natural resources, such as chopping down Truffula  trees. Support costs could be a factor however…

Unintended Consequences, Re: Politics

Though largely a simple, even mundane process, occasionally this area of software development bumps up against the messy world of humans and their international politics. What you name certain counties in interfaces or marketing materials, how you label maps, timezones and their borders, what information you choose to display (or not) can bruise egos or ruffle feathers at the clients’ end. Such offenses may prevent sales to large organizations such as governments or large enterprises. Blocking of websites might occur or worse in extreme circumstances.

Law Enforcement vs. Privacy

Customer privacy is a concern in many countries as are general law compliance and law-enforcement activities. This book can give no specific advice on this part of the topic, but simply notes that awareness of regional law where you are doing business and sensitivity to customer needs is always good policy. </understatement>

12.1.   Internationalization, aka I18N

I'm not crazy - Internationalized
You're the one that's crazy - Internationalized
You're driving me crazy - Internationalized
They stuck me in an institution,
Said it was the only solution,
to give me the needed professional help
to protect me from the enemy - Myself
— Suicidal Tendencies, Institutionalized (with apologies) 

That brings us to the potential expansion of the userbase of our software across the globe. Internationalization—it’s one hell of a word, isn’t it?  Twenty letters in length (only a third less than Mary Poppins’ famous adjective. ) That’s why you’ll often see it abbreviated as i18n, with the eighteen letters in the center replaced by the number 18 (known as a numeronym ). ‘Cuz, who wants to type out that monotony every time? 

So, what is it exactly?

Internationalization is the process of designing a software application so that it can *potentially* be adapted to various languages and regions without engineering changes.
— Wikipedia (Emphasis added) 

So, internationalization is the technique and process that enables localization of a software product to different countries or communities. In this manner, we’ll be able to provide interfaces in languages and data views in formats that the end-user understands best.

In more concrete terms, utilizing i18n frameworks and libraries we further modularize the software—separating visible strings in the interface and/or icons into separate files organized by region or language, loadable at runtime. Addresses, dates, numbers and/or currency are formatted and displayed according the the preferences of the user. This infrastructure provided by developers is what is known as i18n, the actual details of which are filled in under the next section.

A benefit to this separation of concerns is that translators in your employ won’t need to be familiar with the implementation of the application. They also won’t need to be developers or technical folks (within reason). They’ll need a bit of geeky aptitude however, as software translation often requires use of technical and/or industry jargon, depending on application domain.

12.2.   Localization, aka L10N

Localization is the process of adapting internationalized software for a specific region or language by adding locale-specific components and translating text.
— Wikipedia 

During the localization process, we’ll perform translations and configure the following items, and possibly others for a particular locale:

  • Interface language(s)
  • Writing system direction 
  • Number formats
  • Date/time formats
    • AM/PM or 24 hour clock
  • Timezones
  • Currency
  • Weights and measures

This stage is where the bulk of the work is done.

I18n vs. L10n - What’s the Difference?

Localization (which is potentially performed multiple times, for different locales) uses the infrastructure or flexibility provided by internationalization (which is ideally performed only once, or as an integral part of ongoing development).
— Wikipedia 

To reiterate, internationalization libraries enable and support the process of localization that occurs afterward. From the gettext manual:

Roughly speaking, when it comes to multi-lingual messages, internationalization is usually taken care of by programmers, and localization is usually taken care of by translators. 

Keep these differences in mind when speaking or writing on the subject.

Hint:  Continuous I18N

I18n and l10n are ongoing processes in the lifecycle of a software package, project, or product, and not a one-time job. As each new version of the software is prepared and interface updated, new translation work will need to be done. While a significant portion of the work is done prior to the first public release, there will be a proportional amount needed for each new version or update.

12.3.   Native Language Support (NLS)

བཀྲ་ཤིས་བདེ་ལེགས

Tashi Delek! Overview

The first step of internationalizing an application from a language perspective is marking those strings visible in the interface for translation. This is done by replacing the strings with a function that will perform the translation at runtime. That entails:

  • Marking user visible strings in code and/or templates.
  • Running a tool to extract the marked strings.
  • Translating the strings for each supported language.
  • Displaying the correct translation for the user at runtime.

Tip:  Country vs. Language

Languages should not be hard-coded by country. First, there may be a large communities of non-official language speakers in a country, whether immigrant or expat.

Second, believe it or not people travel! For business, education, or leisure. Often these folks are disappointed in the Google website, whose search interface and results are forced to the locale of the current country, rather than the language setting configured in the browser. WTF?   

12.3.1.   Tools

Sometimes I try to do things and it just doesn't work out the way I wanted to
And I get real frustrated, and its like
And I try hard to do it and take my time
And it just doesn't work out the way I want it to
It's like I concentrate on it real hard but it just doesn't work out
And everything I do and everything I try, it never turns out
It's like I need time to figure these things out.
— Suicidal Tendencies, Institutionalized 

The following sections are meant to give a general overview of the process and are not a complete tutorial. The i18n process for a professional application or platform will likely require additional steps.

The main i18n framework in the FLOSS world is GNU gettext . We’ll use it as an example due to its ubiquity and its extensive documentation:

These tools include:

  • A set of conventions about how programs should be written to support message catalogs.
  • A directory and file naming organization for the message catalogs themselves.
  • A runtime library supporting the retrieval of translated messages.
  • A few stand-alone programs to massage in various ways the sets of translatable strings, or already translated strings.
  • A library supporting the parsing and creation of files containing translated messages.

GNU gettext is designed to minimize the impact of internationalization on program sources, keeping this impact as small and hardly noticeable as possible. Internationalization has better chances of succeeding if it is very light-weight, when looking at program sources. 

12.3.2.   Preparing Strings

The GNU manual gives many examples of preparing the strings beforehand, so that there is a greater chance of a correct translation later on:

  • Decent English style

  • Entire sentences preferred (gives context to translator)

  • Split at paragraphs

  • Translatable strings should be limited to one paragraph; don’t let a single message be longer than ten lines.

  • Use format strings instead of string concatenation.

    • When formatting, variable names are preferred over placeholder forms. This allows the translator to change the order of the sentence when needed, e.g.:

      '{name}' instead of '{}', or

      '%(name)s' instead of '%s'.

  • Avoid unusual markup and control characters that might confuse non-technical translators.

The subject is covered in greater depth in the manual .

Tip: 

Define message strings in one-go when possible, as sentence order is often locale dependent.

12.3.3.   Marking Strings

Here we simply replace a given string with a translation function. Though not a requirement, by convention the function is spelled “_”, as in the underscore character. This keeps it as short and non-visually distracting as possible.

Here’s an example, using a minimal script (in Python 3 for clarity):

print('Hello World!')

Now, lets mark the strings:

from gettext import gettext as _

print(_('Hello World!'))  # internationalized!

This string 'Hello World!', is now a key with which a suitable translation may be found at runtime. This is the simplest case.

Plural Forms

But what about common cases where messages include quantities that must be matched by other words in the message such as plurals? Most languages have different word forms for singular and plural quantities, and some have multiple forms, depending on number and occasionally gender. Did ya know Portuguese has a female alternative for the number two (dois/duas)? And Polish has at least five forms?  ((boggle))

Here’s an example of how the ngettext function might be used:

from gettext import ngettext

def i18ned_function(num_files):
    return ngettext('There is %s file.', 'There are %s files.', num_files)

12.3.4.   Extracting Strings

Strings to be translated are known as messages in i18n terminology. Once the strings are marked as messages and extracted, they are compiled into a template for later use, a text-file called a Portable Object Template (POT).  This file will be used to create the multiple translation files described below.

For C sources and a number of other supported languages, the command-line tool used to extract strings is called xgettext (while the Python language is often packaged with pygettext3). Consult the i18n tools for your chosen platform and language for alternatives, if desired.

By convention, during development locale files are put into subdirectories with the general form of ./locale/{language_code}/LC_MESSAGES/{app_name}, while at install time they typically end up under /usr/share/locale.

12.3.5.   Translation

Brains! Brains! I want your brain!
— Return of the Living Dead II 

This is the point where tasty human brains need to get involved.

Translations using the gettext system are placed into “Portable Object” files, name with .po extension. These are plain text files, derived from the main template (.pot), suitable for editing by translators. Each portable object file supports a single target language, and is placed into the directory tree mentioned above.

The files will contain line-separated entries that look similar to this, the header composed of the file path and line number where the message was found:

#: project/folder/hello.py:6
msgid "Hello World!"
msgstr "¡Hola Mundo!"

# next entry…

Tools

_images/poedit.png

Fig. 13 Poedit screen shot, courtesy 

The MIT-licensed Poedit   is a good choice for this kind of work, as it maintains the correct file format, checks for consistency, and has a simple, convenient interface for translators, which can reduce errors. There are many alternatives however.

See also:  Participate in Translation

12.3.6.   Compilation

When the completed translations files in .po format are returned or checked-in, they are then converted into binary message catalogs in Machine Object (MO) .mo format, the purpose of which is to speed loading and parsing at runtime.

With gettext this is done using the msgfmt command. 

Packaging

Finally, the compiled machine object files are packaged up (using tools of choice), in order to be installed on the user’s computer.

12.3.7.   Runtime

At runtime the internationalized application will inspect its environment to determine its locale, load appropriate translations if available, otherwise falling back to the original language (most likely English), and configure itself appropriately for use.

Locale Environment Variables

In order of precedence:

  1. LANGUAGE, a list of locale preferences.

  2. LC_ALL, sets all categories to the preferred locale.

  3. Variables to configure categories separately:

    • LC_CTYPE
    • LC_NUMERIC
    • LC_TIME
    • LC_COLLATE
    • LC_MONETARY
    • LC_MESSAGES
  4. LANG

The variables above are set to a locale, which are named with the following form: ll_CC. ll is a lower-case ISO 639  two-letter language code, and CC is an ISO 3166  two-letter country code.

For example, German in Germany is de_DE. There may also be a tag afterwards to specify a variant, such as .UTF-8 or @latin, e.g., en_US.UTF-8.

Desktop Apps

The locale of GUI desktop programs is often configured globally in the control panel. Look for options described with the terms such as “regional” or “language” settings.

Tip:  localepurge

So now you (as an end-user), have dozens of apps and hundreds of packages installed that have been lovingly translated into a hundred different languages. But you can only read one or two? Are the rest wasting a lot of disk space? Unless you are administering hot-desking  workstations at U.N.  headquarters, the answer is a very likely yes.

On Linux, the answer is one of localepurge , localedef , or the now-infamous bleachbit .

With the Native Language Support (NLS) section now concluded, you should have a good idea how multiple language support is achieved during modern application development.

12.4.   Additional Matters

Mom just get me a Pepsi, please
All I want is a Pepsi, and she wouldn't give it to me
All I wanted was a Pepsi, just one Pepsi, and she wouldn't give it to me
Just a Pepsi
— Suicidal Tendencies, Institutionalized 

When internationalizing your project, other concerns  you’ll often need to be aware of are listed below:

  • Postal address format, postal codes,
    and choice of delivery services
  • Telephone number format
  • Currency (symbols, positions of currency markers, and reasonable amounts due to differing inflation histories)
  • Systems of measurement
  • Electricity standards, battery sizes
  • Printing - paper sizes
  • Broadcast television systems

12.4.1.   Number and Currency Formatting

Numbers unfortunately have differing representations in different locales and will need to formatted properly. For example:

12,345.67       English
12.345,67       Germany and other European countries.
 12345,67       French
1,2345.67       Asia

Additional variations are measurement systems such as Metric or Imperial, and numbers spelled out in various languages. This is typically handled by calling a function similar to:

locale.format('%d', value)

before outputting a numerical value.

12.4.2.   Time and Timezones

There’s not a lot to be said about this sizable subject  that couldn’t be looked up elsewhere . However, there are two main guidelines to when developing with dates and times across timezones.

Tip:  UTC FTW, Baby!

  1. Store times as UTC .

    The first rule of “timezone club” is that all times that will be needed later should be stored as Coordinated Universal Time (aka UTC).

  2. Convert to local time for display.

    Only when a time is needed to be shown to a user should we convert it back to local time for display.

The use of UTC handles default use cases well. There are a few minor exceptions to the guidelines above, however. Namely, it may be better to use local time when storing times for:

  • Future time scheduling or repetitive events (alarms & scheduled tasks), so not affected by DST / timezone changes, or
  • Date-only values

If local times are chosen after all, storage of the the UTC offset can be used to convert the time unambiguously at a later date. 

12.4.3.   Keyboard Layouts & IMEs

Using an input method is obligatory for any language that has more graphemes than there are keys on the keyboard.
— Wikipedia 
_images/ime.png

Fig. 14 Chinese Pinyin-based IME 

While this subject is handled for you under modern operating systems, it can be helpful to know this is a thing.

Ever wonder how they type in Chinese on a ~100-or-so-key keyboard?  Input Method Editors (IMEs)  are how. Symbols may typed in phonetically, drawn with a mouse, touch, or in potentially other ways. They may also be available to type the less-common diacritics with Latin-based scripts.

See also: 

Wrap Up

It doesn't matter, I'll probably get hit by a car anyway.
— Suicidal Tendencies, Institutionalized 

Hopefully, you’ve now got a good head-start on the regional concerns of software development.

  • Software products are likely to make more money abroad than in the author’s home country.
  • International politics, law enforcement, privacy rules, and user support may be factors, be aware.
  • Internationalization (I18N) is the process of preparing your application to be localized to specific locales.
  • Localization (L10N) is the process of translating and configuring an application or platform for a locale (place and/or community).
  • “I18n is usually taken care of by programmers, and l10n is usually taken care of by translators.”
  • I18n/L10n are ongoing processes that occur until a product is discontinued.
  • GNU gettext is an ubiquitous and free tool for i18n/l10n.
  • Use UTC when storing and calculating dates, converting to local time at display-time.
  • Input Method Editors (IMEs) are a thing.
  • Get your son a Pepsi if he asks.

There’s a lot more to cover about development from the regional angle in the following chapter about character sets and text encodings. Onward!