Category Archives: localization

glibc 2.28: New and Updated Locales

See also:

New version 2.28 of glibc library has been released according to the schedule, that means on August 1, 2018. This time the changes in locale support are not revolutionary. Most of them just continue the works started and partially completed in the previous versions.

Alphabetic Collation

Shortly after the previous release the work on polishing the alphabetic sorting has been finished. This applies to all languages because the collation algorithm must support not just the current locale but also must be able to handle the foreign characters. The collation rules according to ISO 14654:2016 standard have been fully imported. The standard itself also evolves although the changes are little relevant from the end users’ point of view. For example, they apply to the punctuation marks or some rare letters used in little known languages.

Internally in glibc automatic collation tests for over 50 languages have been added. In future they will detect any upstream errors immediately.

Regular Expressions

Here are some problems being a consequence of corrected sort order. It turns out that the range regular expressions take the collation rules of the current locale into account. As a result:

  • The [a-z] expression matches not only the lowercase letters but also uppercase because they are interlaced between the lowercase in the collation order (e.g., a, A, b, B, …), but does not match Z because it is collated after z.
  • Source code manipulation systems which so far assumed that all source file names start with the lowercase letter, that means the regular expression [a-z]* matches them all except Makefile, have stopped working correctly.
  • The [0-9] expression now matches not just the digits from 0 to 9 but also all mathematical symbols which can be interpreted as numbers, that means, for example, fractions, superscript and subscript digits, digits from other numeral systems (Eastern Arabic, Indian, etc.)

There are many good reasons why ranges in regular expressions should be based on Unicode codepoint order rather than locale dependent collation order.

Since the error has been spotted late in the development cycle, in the beginning of July, a quick workaround has been introduced which deinterlaced the collation order of the lowercase and uppercase letters of the Latin alphabet. However, this workaround is temporary and will be reverted as soon as the correct implementation of the regular expressions is available.

Unicode 11

Full support of Unicode 11 standard has been introduced. This means that the rules of assignment of the new characters and alphabets to their proper categories (like letters, digits, punctuation marks etc.) and the new transliteration rules have been added. Also new emoji characters have been added. Of course, those changes usually apply to other alphabets than those commonly used in Europe. For example, single characters have been added to Armenian, Hebrew, Arabic and some Indian scripts. Whole Mtavruli block in Georgian script, Hanifi Rohingya, Sogdian and Old Sogdian, Dogri, Gondi, and Makasar scripts, Maya and Siyaq numerals, etc. Many of these characters and scripts are just historic.

New Locales

This time only two new locales have been added: Lower Sorbian and Yakut. Lower Sorbian is a Slavic language, closely related with Polish, used in Lower Lusatia which is part of Germany, near Cottbus (Lower Sorbian: Chóśebuz). Sadly, this language is heavily endangered: it is used by only 6–7 thousand people. Yakut language (also known as Sakha) belongs to the Turkic family, it is used by approx. 450 thousand people in Sakha Republic (Yakutia) which is part of the Russian Federation. They make nearly half of the population of the region.

It’s worth mentioning that both of these languages are inflected and require a genitive case of a month name when formatting a date.

Correct Date Formats in Inflected Languages

While talking about this, 2.28 is the second release of glibc, after 2.27, supporting two grammatical forms of month names. The previous work can be called successful and subsequent changes just include the support of more languages which have not been supported in the previous release due to lack of time.

Two grammar forms (usually nominative and genitive) of month names are now supported in the languages: ArmenianAsturian, Catalan, Czech, Kashubian, OccitanOssetianScottish Gaelic, Upper Sorbian, and Walloon. Together with those two newly added they make total of 19 languages using grammatically correct forms in dates.

It turned out that the difference between nominative and genitive case in abbreviated month names are visible not just in Russian and Belarusian, whose word for May is short enough so it cannot be abbreviated (nominative: май – pronounce: may, genitive: мая – pronounce: maya) but also in Greek in multiple month names (e.g., July, nominative: Ιούλιος, genitive: Ιουλίου, abbreviated forms: Ιούλ and Ιουλ, respectively).

In Kashubian language the difference between the nominative and genitive case in the month May turned out to be viisble also in the abbreviated form (nominative: môj, genitive: maja, abbreviated: môj and maj, respectively), and translators of Catalan languages asked to add, according to CLDR as well, the prefixes de and d’ to the abbreviated forms as well. As a reminder, a request to introduce the support of two grammatical cases of the abbreviated month names to the POSIX standard has been filed more than one year ago.

Minor Changes

Names of the week days and months in Aragonese language have been corrected. Abbreviated month names in Lithuanian language have been corrected, according to the current implementation in Glib library (part of the GNOME project) and CLDR, which by the way soon caused the automatic Glib tests to fail with older versions of glibc. Minor typos have been fixed in Kashubian language and Scottish Gaelic.

glibc 2.27: New and Updated Locales

See also: glibc 2.26: New and Updated Locales.

The new version glibc 2.27 has been released on February 1, 2018 (or February 2, depending on your time zone). This is the much belated report of the new changes in locale support.


Major rework has been started on the correct alphabetic sorting using ISO 14651:2016 standard (click here to download a publicly available version). It has been finished only after the glibc 2.27 release but the work in progress had fixed collation rules in many languages including Mandarin Chinese (Taiwan), Croatian, Czech, Estonian, Canadian French, Icelandic, Latvian, Lithuanian, Polish, Turkish, and Upper Sorbian. Much of this work has been completed or at least started during the Internationalization FAD and therefore it has been sponsored by Fedora Project. Big thanks to Mike Fabian for his great contribution!

Correct Date Formats

Another major change which must be mentioned here is the introduction of date formats using the correct grammar forms in inflected languages. This feature needs a separate article which will be written later. Shortly: from now the glibc functions nl_langinfo() and strftime() from now can support not only two forms of month names (full and abbreviated) but four (for months as used in dates, which often means a genitive grammar case in inflected languages, and for months as used standalone, which often means a nominative case). For example, in Polish language the month May is maj but in order to express a date it is obligatory to use a genitive case: 29 maja. The feature is optional which means that the languages which don’t need it will not see any change.

Introduction of a software feature does not cause any changes until the locale data using it is provided. First Polish locale data has been updated, shortly followed by Ukrainian, and then Russian, Greek, Belarusian, Lithuanian, and finally Croatian. Ukrainian locale data has been using alternative digits feature to provide month names in a genitive case for last 11 years. This solution has been recognized as a dirty hack and removed, also it seems it was not widely known and therefore not widely used by actual users.

The change has appeared in the upstream repository only 10 days before the final release, there was not enough time to add more languages. The next release will include the updated locale data for Catalan, Czech, and few other languages.

New Locales

As every release, this adds new locales. There are 6 new languages: Kabyle, Karbi, Mauritian Creole (Morisyen), Miskito, Shan, and Yau (also called Uruwa), also 3 new variants: Bhojpuri for Nepal, English for the Seychelles, and Valencian (dialect of Catalan).

Kabyle is a language spoken by about 5 million people in Algeria, this makes it the third most spoken language of the country. Karbi is a minority language spoken by about 400,000 people in north-eastern India and north-eastern Bangladesh. Morisyen is the most spoken language of Mauritius (about 1 million speakers). Miskito is a native language spoken by about 150,000 people in Nicaragua and Honduras. Shan is a language spoken by more than 3 million people in Myanmar, this is the second most spoken language of the country. Yau is the smallest language added in this release, spoken by about 1,700 people in Papua New Guinea.

Bhojpuri is the third most spoken language of Nepal (6% of total population). It is also spoken in India and as such has been supported by glibc previously. Valencian Catalan language (ca_ES@valencia) is spoken by about 2.3 million people in Valencia, a community in Spain. It has been supported by some Linux distributions as a downstream patch for many years. From now it is officially in glibc. English does not need its introduction: of course, it has been present in computer industry since forever. It is also an official language of Seychelles along with French and Seychellois Creole.

Lots of Minor Fixes

There are also many other minor bug fixes in this release. The localized messages for yes and no and single-letter answers have been updated in many locales. Chinese, Japanese, and Korean accept full-width Y and N characters as valid answers. Some redundant data have been removed, for example all monetary data for all locales of India are now dynamically copied from Hindi. If there are bugs detected or changes are introduced in future it will be easy to change only one file. More updates include monetary and numerical formats, also less used data like phone number formats, address data, or ISBN numbers have been updated in many locales.

Finally, most of the Unicode sequences (like: <Uxxxx> where each x means a hexadecimal digit) in a source code of locale data have been replaced with ASCII characters, wherever possible. Nowadays nobody remembers why these sequences were required but plain ASCII turned out to be working perfectly. Of course, the characters from outside the basic ASCII range still remain encoded as the Unicode sequences.

How Polish Plurals in MATE Went Broken

On March 13, 2017 the new version 1.18 of MATE Desktop was released. One of the last minute changes in the project was pulling the most recent translations from Transifex. Usually this is a good thing but apparently for the Polish language this turned out to be a little disaster because the plural rules have been (incorrectly) changed.

Plural rules

Foreign readers deserve an explanation here. Polish plural rules (as well as of several other Slavic languages) are a little more complex than English. There are three forms required:

  • 1 – singular – that’s obvious and similar to English and other Indo-European languages.
  • 2, 3, 4, and anything ending with 2, 3, 4 except 12, 13, 14 (for example: 22, 23, 24, 32, 33, 34 and so on). This group is sometimes referred to as few in some internationalization toolkits.
  • everything else (5 and greater except the numbers mentioned above). This group is sometimes referred to as many.

Plurals support in gettext package is good and complete. All we need is to write the correct rules in the header of a *.po file. This task should be done once and the rules can be reused for every translation into the same language because the grammar rules don’t change often, we can safely assume that they never change. Usually for Polish translations we use this formula:

"Plural-Forms: nplurals=3; plural=(n==1 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2);n"

This expression is neither simple nor complex. Just sufficient to describe what the language needs.

Here comes the disaster

On March 13 the commit synchronizing translations from Transifex changed the plural rules for Polish language. The new formula is:

“Plural-Forms: nplurals=4; plural=(n==1 ? 0 : (n%10>=2 && n%10<=4) && (n%100<12 || n%100>=14) ? 1 : n!=1 && (n%10>=0 && n%10<=1) || (n%10>=5 && n%10<=9) || (n%100>=12 && n%100<=14) ? 2 : 3);n" [/code] Now this is complex, isn't it? What's wrong with this expression:

  • it states that Polish language needs 4 forms to support plurals which is not true;
  • it is unnecessarily complex: if the expression states that n==1 belongs to the group 0 there is no need to make sure that n!=1 in the further part;
  • the complexity leads to one actual bug: the second group includes all numbers which end with 2, 3, 4 (correct), except 12 and 13 (incorrect, 14 must be excluded as well);
  • the result 3 is unreachable which is correct but confusing for translators.

As MATE Desktop is a large project consisting of multiple applications (like Caja file manager, Pluma text editor etc.) the same happened to every single application of the project.

Difficult to fix

The bug has been reported to the upstream immediately. The MATE project maintainres responded that the bug came from Transifex: it is pointless to fix it in the MATE source code repository because the next pull will overwrite the fix.

Unfortunately, it is not so easy to file a ticket in Transifex. It does not have Bugzilla nor any other ticket system. However, some people managed to contact Transifex team. They responded that they have pulled the plural rules from CLDR which lists 4 plural forms for the Polish language although they admitted that assigning the number 14 to the few plural group is their fault and fixed this. As MATE project continues pulling translations from Transifex more and more of their applications will start handling the number 14 correctly. Some of the applications have been updated recently, the update is a part of the 1.19 development release.

What CLDR says

Let’s look what CLDR database says about the Polish plural rules. Indeed, it lists 4 groups and there is a mysterious v parameter which has something in common with fractions because the sample expressions display the fractional forms. But as gettext supports integer values only we should drop the fractional cases totally.

The documentation of that v parameter is difficult to find but as soon as you find it you can read it means number of visible fraction digits in n, with trailing zeros. In this sentence, n is the number controlling the plural form itself.

Other languages

CLDR provides additional forms for fractions for other languages as well: Czech, Manx, Russian, Slovak, Ukrainian. For some other languages (Bosnian, Croatian, Filipino, Macedonian, Serbian, Lower and Upper Sorbian) the rules seem to be even more complex: fractional values belong to multiple integer groups.

This should be a warning for other languages that their rules might have been broken in Transifex as well. However, the further investigation of MATE Desktop source code does not reveal any recent changes in plural rules of other languages.


It seems that pulling plural rules from CLDR automatically is not a good idea.

Translators and language coordinators: please make sure that your plural rules are correct.

Transifex and other translation platforms: please don’t pull the translation rules from CLDR without a thorough analysis. Better ask the language communities and reuse the existing rules.

CLDR: please simplify your plural expressions and make the documentation of fractions support easier to access.

glibc 2.26: New and Updated Locales

On August 2, 2017 glibc (The GNU C library) version 2.26 has been released. Among others, many issues related with supported locales have been addressed, most of them shortly before the release. Let’s see what has been changed.

New locales

Compared to the previous version, this release introduces the support of 6 new languages: Aguaruna, Bislama, Fiji Hindi, Samoan, Tok Pisin, and Tongan as well as 2 new variants: South Azerbaijani for Iran, and Maithili for Nepal.

Aguaruna is a language spoken by about 38,000–45,000 indigenous people in Peru. Bislama is an official language of Vanuatu although spoken by about 10,000 people only. Fiji Hindi is a language descending from although different than Hindi. It is spoken by about 300,000 citizens of Fiji which makes about ⅓ of its total population and is one of the official languages of the country. It is written using both the Latin and the Devanagari script. This release introduces the Latin script only but Devanagari is also considered to be introduced in future. Tok Pisin is one of the official languages of Papua New Guinea. Although spoken by only 120,000 native speakers which makes 1.7% of total population it is the most widely used language of the country. No wonder since Papua New Guinea features about 850 native languages.

South Azerbaijani is a variant of Azerbaijani language spoken by about 13 million people (16% of total population) in Iran and Maithili is spoken by about 3 million people (11.5% of total population) in Nepal. Both have been previously represented by their variants for Azerbaijan and India, respectively. Now their users may enjoy more granularity.


Bugs in alphabetic sorting in Hungarian and Malayalam (see also: here) have been fixed. But lots of other fixes have been introduced in date and time elements, mostly in month names. Typos in either full or abbreviated or both names have been fixed, among others, in Arabic (many variants), Belarusian, Breton, Friulian, Hindi, Kannada, Konkani, Malayalam, Marathi, Mongolian, Northern Sami, Serbian (Latin only), Spanish (Peru and Uruguay), Uzbek, Yoruba, Zulu — total of 55 languages have been updated to the content of CLDR version 31. Weekday names have been updated in Arabic, Chechen, and Kashmiri — Saudi Arabian users had them displayed in English so far. Yes and no translated strings have been added or fixed in many languages.

Incorrectly appended trailing spaces have been removed in several locales, usually from weekday names. They mainly include languages of India but also Albanian (where the issue has been first spotted), Haitian, Maltese, and more. This change will polish date formatting in these locales.

Unicode 10.0

This version also introduces the full support of Unicode 10.0. The changes are mainly focused on new emoji characters.

It’s worth mentioning that the full Unicode 10.0 support has been added to glibc only 2 days after its official release by the Unicode Consortium.