Category Archives: localization

How Polish Plurals in MATE Went Broken

On March 13, 2017 the new version 1.18 of MATE Desktop was released. One of the last minute changes in the project was pulling the most recent translations from Transifex. Usually this is a good thing but apparently for the Polish language this turned out to be a little disaster because the plural rules have been (incorrectly) changed.

Plural rules

Foreign readers deserve an explanation here. Polish plural rules (as well as of several other Slavic languages) are a little more complex than English. There are three forms required:

  • 1 – singular – that’s obvious and similar to English and other Indo-European languages.
  • 2, 3, 4, and anything ending with 2, 3, 4 except 12, 13, 14 (for example: 22, 23, 24, 32, 33, 34 and so on). This group is sometimes referred to as few in some internationalization toolkits.
  • everything else (5 and greater except the numbers mentioned above). This group is sometimes referred to as many.

Plurals support in gettext package is good and complete. All we need is to write the correct rules in the header of a *.po file. This task should be done once and the rules can be reused for every translation into the same language because the grammar rules don’t change often, we can safely assume that they never change. Usually for Polish translations we use this formula:

"Plural-Forms: nplurals=3; plural=(n==1 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2);n"

This expression is neither simple nor complex. Just sufficient to describe what the language needs.

Here comes the disaster

On March 13 the commit synchronizing translations from Transifex changed the plural rules for Polish language. The new formula is:

“Plural-Forms: nplurals=4; plural=(n==1 ? 0 : (n%10>=2 && n%10<=4) && (n%100<12 || n%100>=14) ? 1 : n!=1 && (n%10>=0 && n%10<=1) || (n%10>=5 && n%10<=9) || (n%100>=12 && n%100<=14) ? 2 : 3);n" [/code] Now this is complex, isn't it? What's wrong with this expression:

  • it states that Polish language needs 4 forms to support plurals which is not true;
  • it is unnecessarily complex: if the expression states that n==1 belongs to the group 0 there is no need to make sure that n!=1 in the further part;
  • the complexity leads to one actual bug: the second group includes all numbers which end with 2, 3, 4 (correct), except 12 and 13 (incorrect, 14 must be excluded as well);
  • the result 3 is unreachable which is correct but confusing for translators.

As MATE Desktop is a large project consisting of multiple applications (like Caja file manager, Pluma text editor etc.) the same happened to every single application of the project.

Difficult to fix

The bug has been reported to the upstream immediately. The MATE project maintainres responded that the bug came from Transifex: it is pointless to fix it in the MATE source code repository because the next pull will overwrite the fix.

Unfortunately, it is not so easy to file a ticket in Transifex. It does not have Bugzilla nor any other ticket system. However, some people managed to contact Transifex team. They responded that they have pulled the plural rules from CLDR which lists 4 plural forms for the Polish language although they admitted that assigning the number 14 to the few plural group is their fault and fixed this. As MATE project continues pulling translations from Transifex more and more of their applications will start handling the number 14 correctly. Some of the applications have been updated recently, the update is a part of the 1.19 development release.

What CLDR says

Let’s look what CLDR database says about the Polish plural rules. Indeed, it lists 4 groups and there is a mysterious v parameter which has something in common with fractions because the sample expressions display the fractional forms. But as gettext supports integer values only we should drop the fractional cases totally.

The documentation of that v parameter is difficult to find but as soon as you find it you can read it means number of visible fraction digits in n, with trailing zeros. In this sentence, n is the number controlling the plural form itself.

Other languages

CLDR provides additional forms for fractions for other languages as well: Czech, Manx, Russian, Slovak, Ukrainian. For some other languages (Bosnian, Croatian, Filipino, Macedonian, Serbian, Lower and Upper Sorbian) the rules seem to be even more complex: fractional values belong to multiple integer groups.

This should be a warning for other languages that their rules might have been broken in Transifex as well. However, the further investigation of MATE Desktop source code does not reveal any recent changes in plural rules of other languages.

Conclusions

It seems that pulling plural rules from CLDR automatically is not a good idea.

Translators and language coordinators: please make sure that your plural rules are correct.

Transifex and other translation platforms: please don’t pull the translation rules from CLDR without a thorough analysis. Better ask the language communities and reuse the existing rules.

CLDR: please simplify your plural expressions and make the documentation of fractions support easier to access.

glibc 2.26: New and Updated Locales

On August 2, 2017 glibc (The GNU C library) version 2.26 has been released. Among others, many issues related with supported locales have been addressed, most of them shortly before the release. Let’s see what has been changed.

New locales

Compared to the previous version, this release introduces the support of 6 new languages: Aguaruna, Bislama, Fiji Hindi, Samoan, Tok Pisin, and Tongan as well as 2 new variants: South Azerbaijani for Iran, and Maithili for Nepal.

Aguaruna is a language spoken by about 38,000–45,000 indigenous people in Peru. Bislama is an official language of Vanuatu although spoken by about 10,000 people only. Fiji Hindi is a language descending from although different than Hindi. It is spoken by about 300,000 citizens of Fiji which makes about ⅓ of its total population and is one of the official languages of the country. It is written using both the Latin and the Devanagari script. This release introduces the Latin script only but Devanagari is also considered to be introduced in future. Tok Pisin is one of the official languages of Papua New Guinea. Although spoken by only 120,000 native speakers which makes 1.7% of total population it is the most widely used language of the country. No wonder since Papua New Guinea features about 850 native languages.

South Azerbaijani is a variant of Azerbaijani language spoken by about 13 million people (16% of total population) in Iran and Maithili is spoken by about 3 million people (11.5% of total population) in Nepal. Both have been previously represented by their variants for Azerbaijan and India, respectively. Now their users may enjoy more granularity.

Updates

Bugs in alphabetic sorting in Hungarian and Malayalam (see also: here) have been fixed. But lots of other fixes have been introduced in date and time elements, mostly in month names. Typos in either full or abbreviated or both names have been fixed, among others, in Arabic (many variants), Belarusian, Breton, Friulian, Hindi, Kannada, Konkani, Malayalam, Marathi, Mongolian, Northern Sami, Serbian (Latin only), Spanish (Peru and Uruguay), Uzbek, Yoruba, Zulu — total of 55 languages have been updated to the content of CLDR version 31. Weekday names have been updated in Arabic, Chechen, and Kashmiri — Saudi Arabian users had them displayed in English so far. Yes and no translated strings have been added or fixed in many languages.

Incorrectly appended trailing spaces have been removed in several locales, usually from weekday names. They mainly include languages of India but also Albanian (where the issue has been first spotted), Haitian, Maltese, and more. This change will polish date formatting in these locales.

Unicode 10.0

This version also introduces the full support of Unicode 10.0. The changes are mainly focused on new emoji characters.

It’s worth mentioning that the full Unicode 10.0 support has been added to glibc only 2 days after its official release by the Unicode Consortium.