Category Archives: localization

glibc 2.27: New and Updated Locales

See also: glibc 2.26: New and Updated Locales.

The new version glibc 2.27 has been released on February 1, 2018 (or February 2, depending on your time zone). This is the much belated report of the new changes in locale support.

Collation

Major rework has been started on the correct alphabetic sorting using ISO 14651:2016 standard (click here to download a publicly available version). It has been finished only after the glibc 2.27 release but the work in progress had fixed collation rules in many languages including Mandarin Chinese (Taiwan), Croatian, Czech, Estonian, Canadian French, Icelandic, Latvian, Lithuanian, Polish, Turkish, and Upper Sorbian. Much of this work has been completed or at least started during the Internationalization FAD and therefore it has been sponsored by Fedora Project. Big thanks to Mike Fabian for his great contribution!

Correct Date Formats

Another major change which must be mentioned here is the introduction of date formats using the correct grammar forms in inflected languages. This feature needs a separate article which will be written later. Shortly: from now the glibc functions nl_langinfo() and strftime() from now can support not only two forms of month names (full and abbreviated) but four (for months as used in dates, which often means a genitive grammar case in inflected languages, and for months as used standalone, which often means a nominative case). For example, in Polish language the month May is maj but in order to express a date it is obligatory to use a genitive case: 29 maja. The feature is optional which means that the languages which don’t need it will not see any change.

Introduction of a software feature does not cause any changes until the locale data using it is provided. First Polish locale data has been updated, shortly followed by Ukrainian, and then Russian, Greek, Belarusian, Lithuanian, and finally Croatian. Ukrainian locale data has been using alternative digits feature to provide month names in a genitive case for last 11 years. This solution has been recognized as a dirty hack and removed, also it seems it was not widely known and therefore not widely used by actual users.

The change has appeared in the upstream repository only 10 days before the final release, there was not enough time to add more languages. The next release will include the updated locale data for Catalan, Czech, and few other languages.

New Locales

As every release, this adds new locales. There are 6 new languages: Kabyle, Karbi, Mauritian Creole (Morisyen), Miskito, Shan, and Yau (also called Uruwa), also 3 new variants: Bhojpuri for Nepal, English for the Seychelles, and Valencian (dialect of Catalan).

Kabyle is a language spoken by about 5 million people in Algeria, this makes it the third most spoken language of the country. Karbi is a minority language spoken by about 400,000 people in north-eastern India and north-eastern Bangladesh. Morisyen is the most spoken language of Mauritius (about 1 million speakers). Miskito is a native language spoken by about 150,000 people in Nicaragua and Honduras. Shan is a language spoken by more than 3 million people in Myanmar, this is the second most spoken language of the country. Yau is the smallest language added in this release, spoken by about 1,700 people in Papua New Guinea.

Bhojpuri is the third most spoken language of Nepal (6% of total population). It is also spoken in India and as such has been supported by glibc previously. Valencian Catalan language (ca_ES@valencia) is spoken by about 2.3 million people in Valencia, a community in Spain. It has been supported by some Linux distributions as a downstream patch for many years. From now it is officially in glibc. English does not need its introduction: of course, it has been present in computer industry since forever. It is also an official language of Seychelles along with French and Seychellois Creole.

Lots of Minor Fixes

There are also many other minor bug fixes in this release. The localized messages for yes and no and single-letter answers have been updated in many locales. Chinese, Japanese, and Korean accept full-width Y and N characters as valid answers. Some redundant data have been removed, for example all monetary data for all locales of India are now dynamically copied from Hindi. If there are bugs detected or changes are introduced in future it will be easy to change only one file. More updates include monetary and numerical formats, also less used data like phone number formats, address data, or ISBN numbers have been updated in many locales.

Finally, most of the Unicode sequences (like: <Uxxxx> where each x means a hexadecimal digit) in a source code of locale data have been replaced with ASCII characters, wherever possible. Nowadays nobody remembers why these sequences were required but plain ASCII turned out to be working perfectly. Of course, the characters from outside the basic ASCII range still remain encoded as the Unicode sequences.

How Polish Plurals in MATE Went Broken

On March 13, 2017 the new version 1.18 of MATE Desktop was released. One of the last minute changes in the project was pulling the most recent translations from Transifex. Usually this is a good thing but apparently for the Polish language this turned out to be a little disaster because the plural rules have been (incorrectly) changed.

Plural rules

Foreign readers deserve an explanation here. Polish plural rules (as well as of several other Slavic languages) are a little more complex than English. There are three forms required:

  • 1 – singular – that’s obvious and similar to English and other Indo-European languages.
  • 2, 3, 4, and anything ending with 2, 3, 4 except 12, 13, 14 (for example: 22, 23, 24, 32, 33, 34 and so on). This group is sometimes referred to as few in some internationalization toolkits.
  • everything else (5 and greater except the numbers mentioned above). This group is sometimes referred to as many.

Plurals support in gettext package is good and complete. All we need is to write the correct rules in the header of a *.po file. This task should be done once and the rules can be reused for every translation into the same language because the grammar rules don’t change often, we can safely assume that they never change. Usually for Polish translations we use this formula:

"Plural-Forms: nplurals=3; plural=(n==1 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2);n"

This expression is neither simple nor complex. Just sufficient to describe what the language needs.

Here comes the disaster

On March 13 the commit synchronizing translations from Transifex changed the plural rules for Polish language. The new formula is:

“Plural-Forms: nplurals=4; plural=(n==1 ? 0 : (n%10>=2 && n%10<=4) && (n%100<12 || n%100>=14) ? 1 : n!=1 && (n%10>=0 && n%10<=1) || (n%10>=5 && n%10<=9) || (n%100>=12 && n%100<=14) ? 2 : 3);n" [/code] Now this is complex, isn't it? What's wrong with this expression:

  • it states that Polish language needs 4 forms to support plurals which is not true;
  • it is unnecessarily complex: if the expression states that n==1 belongs to the group 0 there is no need to make sure that n!=1 in the further part;
  • the complexity leads to one actual bug: the second group includes all numbers which end with 2, 3, 4 (correct), except 12 and 13 (incorrect, 14 must be excluded as well);
  • the result 3 is unreachable which is correct but confusing for translators.

As MATE Desktop is a large project consisting of multiple applications (like Caja file manager, Pluma text editor etc.) the same happened to every single application of the project.

Difficult to fix

The bug has been reported to the upstream immediately. The MATE project maintainres responded that the bug came from Transifex: it is pointless to fix it in the MATE source code repository because the next pull will overwrite the fix.

Unfortunately, it is not so easy to file a ticket in Transifex. It does not have Bugzilla nor any other ticket system. However, some people managed to contact Transifex team. They responded that they have pulled the plural rules from CLDR which lists 4 plural forms for the Polish language although they admitted that assigning the number 14 to the few plural group is their fault and fixed this. As MATE project continues pulling translations from Transifex more and more of their applications will start handling the number 14 correctly. Some of the applications have been updated recently, the update is a part of the 1.19 development release.

What CLDR says

Let’s look what CLDR database says about the Polish plural rules. Indeed, it lists 4 groups and there is a mysterious v parameter which has something in common with fractions because the sample expressions display the fractional forms. But as gettext supports integer values only we should drop the fractional cases totally.

The documentation of that v parameter is difficult to find but as soon as you find it you can read it means number of visible fraction digits in n, with trailing zeros. In this sentence, n is the number controlling the plural form itself.

Other languages

CLDR provides additional forms for fractions for other languages as well: Czech, Manx, Russian, Slovak, Ukrainian. For some other languages (Bosnian, Croatian, Filipino, Macedonian, Serbian, Lower and Upper Sorbian) the rules seem to be even more complex: fractional values belong to multiple integer groups.

This should be a warning for other languages that their rules might have been broken in Transifex as well. However, the further investigation of MATE Desktop source code does not reveal any recent changes in plural rules of other languages.

Conclusions

It seems that pulling plural rules from CLDR automatically is not a good idea.

Translators and language coordinators: please make sure that your plural rules are correct.

Transifex and other translation platforms: please don’t pull the translation rules from CLDR without a thorough analysis. Better ask the language communities and reuse the existing rules.

CLDR: please simplify your plural expressions and make the documentation of fractions support easier to access.

glibc 2.26: New and Updated Locales

On August 2, 2017 glibc (The GNU C library) version 2.26 has been released. Among others, many issues related with supported locales have been addressed, most of them shortly before the release. Let’s see what has been changed.

New locales

Compared to the previous version, this release introduces the support of 6 new languages: Aguaruna, Bislama, Fiji Hindi, Samoan, Tok Pisin, and Tongan as well as 2 new variants: South Azerbaijani for Iran, and Maithili for Nepal.

Aguaruna is a language spoken by about 38,000–45,000 indigenous people in Peru. Bislama is an official language of Vanuatu although spoken by about 10,000 people only. Fiji Hindi is a language descending from although different than Hindi. It is spoken by about 300,000 citizens of Fiji which makes about ⅓ of its total population and is one of the official languages of the country. It is written using both the Latin and the Devanagari script. This release introduces the Latin script only but Devanagari is also considered to be introduced in future. Tok Pisin is one of the official languages of Papua New Guinea. Although spoken by only 120,000 native speakers which makes 1.7% of total population it is the most widely used language of the country. No wonder since Papua New Guinea features about 850 native languages.

South Azerbaijani is a variant of Azerbaijani language spoken by about 13 million people (16% of total population) in Iran and Maithili is spoken by about 3 million people (11.5% of total population) in Nepal. Both have been previously represented by their variants for Azerbaijan and India, respectively. Now their users may enjoy more granularity.

Updates

Bugs in alphabetic sorting in Hungarian and Malayalam (see also: here) have been fixed. But lots of other fixes have been introduced in date and time elements, mostly in month names. Typos in either full or abbreviated or both names have been fixed, among others, in Arabic (many variants), Belarusian, Breton, Friulian, Hindi, Kannada, Konkani, Malayalam, Marathi, Mongolian, Northern Sami, Serbian (Latin only), Spanish (Peru and Uruguay), Uzbek, Yoruba, Zulu — total of 55 languages have been updated to the content of CLDR version 31. Weekday names have been updated in Arabic, Chechen, and Kashmiri — Saudi Arabian users had them displayed in English so far. Yes and no translated strings have been added or fixed in many languages.

Incorrectly appended trailing spaces have been removed in several locales, usually from weekday names. They mainly include languages of India but also Albanian (where the issue has been first spotted), Haitian, Maltese, and more. This change will polish date formatting in these locales.

Unicode 10.0

This version also introduces the full support of Unicode 10.0. The changes are mainly focused on new emoji characters.

It’s worth mentioning that the full Unicode 10.0 support has been added to glibc only 2 days after its official release by the Unicode Consortium.