glibc 2.28: New and Updated Locales

Alphabetic Collation

Shortly after the previous release the work on polishing the alphabetic sorting has been finished. This applies to all languages because the collation algorithm must support not just the current locale but also must be able to handle the foreign characters. The collation rules according to ISO 14654:2016 standard have been fully imported. The standard itself also evolves although the changes are little relevant from the end users’ point of view. For example, they apply to the punctuation marks or some rare letters used in little known languages.

Internally in glibc automatic collation tests for over 50 languages have been added. In future they will detect any upstream errors immediately.

Regular Expressions

Here are some problems being a consequence of corrected sort order. It turns out that the range regular expressions take the collation rules of the current locale into account. As a result:

The [a-z] expression matches not only the lowercase letters but also uppercase because they are interlaced between the lowercase in the collation order (e.g., a, A, b, B, …), but does not match Z because it is collated after z.
Source code manipulation systems which so far assumed that all source file names start with the lowercase letter, that means the regular expression [a-z]* matches them all except Makefile, have stopped working correctly.
The [0-9] expression now matches not just the digits from 0 to 9 but also all mathematical symbols which can be interpreted as numbers, that means, for example, fractions, superscript and subscript digits, digits from other numeral systems (Eastern Arabic, Indian, etc.)

There are many good reasons why ranges in regular expressions should be based on Unicode codepoint order rather than locale dependent collation order.

Since the error has been spotted late in the development cycle, in the beginning of July, a quick workaround has been introduced which deinterlaced the collation order of the lowercase and uppercase letters of the Latin alphabet. However, this workaround is temporary and will be reverted as soon as the correct implementation of the regular expressions is available.

Unicode 11

Full support of Unicode 11 standard has been introduced. This means that the rules of assignment of the new characters and alphabets to their proper categories (like letters, digits, punctuation marks etc.) and the new transliteration rules have been added. Also new emoji characters have been added. Of course, those changes usually apply to other alphabets than those commonly used in Europe. For example, single characters have been added to Armenian, Hebrew, Arabic and some Indian scripts. Whole Mtavruli block in Georgian script, Hanifi Rohingya, Sogdian and Old Sogdian, Dogri, Gondi, and Makasar scripts, Maya and Siyaq numerals, etc. Many of these characters and scripts are just historic.

New Locales

This time only two new locales have been added: Lower Sorbian and Yakut. Lower Sorbian is a Slavic language, closely related with Polish, used in Lower Lusatia which is part of Germany, near Cottbus (Lower Sorbian: Chóśebuz). Sadly, this language is heavily endangered: it is used by only 6–7 thousand people. Yakut language (also known as Sakha) belongs to the Turkic family, it is used by approx. 450 thousand people in Sakha Republic (Yakutia) which is part of the Russian Federation. They make nearly half of the population of the region.

It’s worth mentioning that both of these languages are inflected and require a genitive case of a month name when formatting a date.

Correct Date Formats in Inflected Languages

While talking about this, 2.28 is the second release of glibc, after 2.27, supporting two grammatical forms of month names. The previous work can be called successful and subsequent changes just include the support of more languages which have not been supported in the previous release due to lack of time.

Two grammar forms (usually nominative and genitive) of month names are now supported in the languages: Armenian, Asturian, Catalan, Czech, Kashubian, Occitan, Ossetian, Scottish Gaelic, Upper Sorbian, and Walloon. Together with those two newly added they make total of 19 languages using grammatically correct forms in dates.

It turned out that the difference between nominative and genitive case in abbreviated month names are visible not just in Russian and Belarusian, whose word for May is short enough so it cannot be abbreviated (nominative: май – pronounce: may, genitive: мая – pronounce: maya) but also in Greek in multiple month names (e.g., July, nominative: Ιούλιος, genitive: Ιουλίου, abbreviated forms: Ιούλ and Ιουλ, respectively).

In Kashubian language the difference between the nominative and genitive case in the month May turned out to be viisble also in the abbreviated form (nominative: môj, genitive: maja, abbreviated: môj and maj, respectively), and translators of Catalan languages asked to add, according to CLDR as well, the prefixes de and d’ to the abbreviated forms as well. As a reminder, a request to introduce the support of two grammatical cases of the abbreviated month names to the POSIX standard has been filed more than one year ago.

Minor Changes

Names of the week days and months in Aragonese language have been corrected. Abbreviated month names in Lithuanian language have been corrected, according to the current implementation in Glib library (part of the GNOME project) and CLDR, which by the way soon caused the automatic Glib tests to fail with older versions of glibc. Minor typos have been fixed in Kashubian language and Scottish Gaelic.