glibc 2.30: New and Updated Locales

The glibc version 2.30 was released–according to the schedule–on August 1, 2019. The changes in the locale data are not big, most of them are related with but not limited to calendars support. Let’s look at the details, and also let’s take a look at the version 2.29.

Cyrillic transliteration

In the pseudo-locale C (built-in, used if the user selects a wrong locale or does not select any), the transliteration from Cyrillic to Latin has been added according to the standard GOST 7.79-2000 System B. The output text uses only ASCII characters. Here is how it can be used:

$ echo "Спутник" | LC_CTYPE=C iconv -f UTF-8 -t ASCII//TRANSLIT
Sputnik

A similar feature had already existed in other locales, e.g. Serbian and Ukrainian, but the implementation has always been imperfect and incomplete. This newly added feature only works if the current locale is set to C. The patch was contributed by Egor Kobylkin–thank you.

Why this transliteration standard and why it has not been implemented for other languages? It would be good to add at least support for the ISO 9 standard in Russian. We are working on this but we encountered many difficulties. The main problem is that the transliteration of some letters may depend on the context (i.e. what are the surrounding letters), while the entire transliteration algorithm in glibc is context-free, that is, it transliterates each letter separately, without taking its neighbors into account.

Support for Unicode 12.1

The addition of new characters to the Unicode standard means that they are correctly classified as letters, numbers, etc., as well as sorted and sometimes transliterated. In the version 12.0 of Unicode, published on March 5, 2019, the entire Nyiakeng Puachue Hmong script has been added, the entire Wancho script (developed in 2001-12), several dozen additional characters for the Tamil script, several letters of the Lao script, several new characters for the Pollard script (Miao), Latin vowels with a glottal stop, ancient Elymaic script, ancient Nandinagari script, several Old Egyptian hieroglyphs, old Turkish numbers (Ottoman Siyaq). Dozens of new emoticons have also been added, including chess symbols. The update was handled by Mike Fabian, thanks to him glibc began to support the new standard 3 days after its publication.

But actually the title of this section says 12.1, why not 12.0? The version 12.1 of Unicode was published on May 7, 2019 and it contains only one change from 12.0: the sign has been added to mark the new Reiwa era in the Japanese calendar. A separate article has been written about this topic. This is an important change for Japanese users, which is why it was published so quickly. As before, it’s Mike Fabian who made changes to glibc.

Japanese and Chinese calendars…

Of course, the new era in the Japanese calendar has been supported since the beginning of April 2019, however, 2.30 is the first version of glibc that supports this change from the beginning, from the first day of the release. Support for the new Japanese era has also been moved to older versions of glibc, but it works as a patch there: distributions must download the patch, apply it, and publish the update themselves.

While about this, it is worth mentioning that in version 2.29 the display of dates in the Japanese and other traditional calendars has been improved: the year number generated by the "%Ey" format now has two digits by default. If you want years 1 to 9 to be printed without a leading zero, you must explicitly add the '-' flag to the format. In addition, flags such as '_' and '-' can added to the "%EY" format, so you can decide whether the number in the full year name (which also includes the era name) is preceded by zero, or a space, or nothing.

The display of the first year of the Taisho era (from 30.07.1912 in the Gregorian calendar) was corrected: previously it had been displayed as year 2.

Not many locales in glibc support traditional calendars. Since the version 2.30, support for the traditional Chinese calendar (Minguo) has been added in the locales used in Taiwan.

…and other

The spelling of names of months and days of the week in the Tatar, Afar, and Silesian locales were corrected. In version 2.29 similar amendments were introduced to the Greenlandic language (weekday names, month names, date formats).

Speaking of the Silesian language, correct formatting of dates was introduced according to the grammar rules (a month in genitive case, as in most other Slavic languages). However, it turned out that CLDR does not support the Silesian language, so a request was filed to add it. This task was taken up by the Silesian linguist Grzegorz Kulik. Thanks to this, we can expect that the Silesian language will soon be properly supported by other operating systems as well.

The first day of the week setting in the Irish locales (English and Gaelic) was corrected. From now on, the first day of the week is Monday.

In the previous version 2.29, the default date and time display formats in 80 different languages and variants have been improved. Most often it was about choosing the right 12-hour or 24-hour clock. According to CLDR, 12-hour clock is used in most countries of northern Africa, India and Hong Kong, while Morocco, Malta, Kenya, and Sri Lanka now use a 24-hour clock.

Interestingly, many locales do not have the date_fmt field defined, which is by default used by the date command to display the current time, and as a result they use the default format from the built-in C pseudo-language. Few locales (including US English) have got the correct format, however the others are still waiting.

Rounding errors in locale data

Finally, it is worth mentioning one curiosity. When generating binary locale archives, it turned out that the archives generated for different but compatible architectures (e.g. i686 and x86_64) differ, although it would be better if they did not because the binary files would be interchangeable between them. The reason was the use of multiplication by 1.4 in a hashing algorithm. This number, although it looks quite inconspicuous, is an infinite periodic fraction in the binary system, and therefore different processor architectures use different rounding. The bug was fixed upstream and applied to Fedora by DJ Delorie–thank you.