Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gloss city names in the local language #592

Merged
merged 10 commits into from
Dec 5, 2022
Merged

Gloss city names in the local language #592

merged 10 commits into from
Dec 5, 2022

Conversation

1ec5
Copy link
Member

@1ec5 1ec5 commented Nov 29, 2022

If the name of a city in the local language differs from its name in the user-preferred language (#578), the label now includes a gloss on a second line containing the local name in smaller type and in parentheses. The names are compared case-insensitively but without diacritic folding based on the user-preferred language’s collation rules.

The choice of cities, as opposed to other kinds of places, is mainly based on prior art, as seen in #471 (comment). This approach minimizes clutter in countries that use a different writing system than the user does, important because we don’t currently have reliable access to a transliterated name when a translated name is available in a given language. We could tighten up the filter further, so that only cities above a certain rank get a gloss, but I think the reason for glossing a label would become less intuitive to users.

English

Mexico City Milan Osaka Athens

Spanish

New York City Monterey Hanoi Jakarta

Korean

San Diego Toronto New Delhi Moscow

Arabic

Lansing Paris Samarkand Dar es-Salaam

Esperanto

This map has reached Peak Internationalization:

Chicago Brussels Johannesburg Colombo

🤖

The gloss makes it easy to spot typos, vandalism, and labels needing import cleanup in Wikidata:

Córdobaaa Beckley

Under the hood, this PR replaces the metadata-based approach to localizing text-field in #581 with a top-level expression variable that can be replaced safely each time the language preference changes. This is a purer style specification–based solution that’s less likely to force a full style reload when the language preference changes. The code can easily be extended to vary arbitrary layout and paint properties and even filters based on any number of environment variables, not just the language preference. For example, if we label elevations in the future, we could dynamically update them to reflect changes in the preferred measurement system.

Fixes #471.

@1ec5 1ec5 added enhancement New feature or request internationalization labels Nov 29, 2022
@1ec5 1ec5 self-assigned this Nov 29, 2022
src/layer/place.js Outdated Show resolved Hide resolved
src/americana.js Outdated Show resolved Hide resolved
src/layer/place.js Outdated Show resolved Hide resolved
src/layer/place.js Outdated Show resolved Hide resolved
src/americana.js Outdated Show resolved Hide resolved
Comment on lines -149 to -151
metadata: {
"americana:text-field-localized": true,
},
Copy link
Member Author

@1ec5 1ec5 Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, I’ve limited the glosses to cities to avoid clutter at mid zoom levels, leaving other places labeled by the name in the preferred language only. This is already more generous than American print atlases, which limit the glosses to “world-class” cities with well-known English names. However, we could explore factoring out the giant expression below and reusing it on the place=town and place=village layers at much higher zoom levels. It might require some fine-tuning to maintain the desired label density. The Wikidata items on these smaller settlements are less likely to have gotten cleaned up since an import, so this would be a nice way to surface data in need of attention. I’m thinking we could save that for tail work once we’ve proven out glosses in the city layer.

Copy link
Member Author

@1ec5 1ec5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is getting close in terms of the visual output. However, I still need to factor out some common parts of the massive text-field expression for readability. I may also look into adding unit tests, either in this PR or separately, because there are a lot of edge cases to juggle.

src/americana.js Outdated Show resolved Hide resolved
src/layer/place.js Outdated Show resolved Hide resolved
src/layer/place.js Outdated Show resolved Hide resolved
@ZeLonewolf
Copy link
Member

I noticed this oddity in Italy:

image

@1ec5
Copy link
Member Author

1ec5 commented Nov 30, 2022

I noticed this oddity in Italy:

Someone changed the English and Italian labels on Wikidata to include a curly apostrophe instead of a straight one. The OSM node has a straight apostrophe. One way to look at it is that the gloss is really good at catching things like this. 😅

Wikidata’s label guidelines are silent about which kind of punctuation to use in labels. In some subject areas, Wikidata has bots going around switching straight apostrophes to curly apostrophes, but otherwise, most labels are based on Wikipedia article titles. The English Wikipedia’s house style, which has been influential in other Wikipedia language editions, insists on straight quotation marks. However, some language editions like the Russian Wikipedia standardize on curly apostrophes instead: osmlab/name-suggestion-index#2969 (comment). So far, I haven’t come across any discrepancies in Russian-speaking countries when setting language=ru.

In OSM, mappers have disagreed on whether to use straight apostrophes,1 forcing data consumers like Mapbox Streets to replace straight apostrophes with curly apostrophes in postprocessing. Unfortunately, any postprocessing is potentially unreliable without certainty about which language is stored in name. A while back, @claysmalley caused a stir by correcting the punctuation on Amtrak stations, apparently because people were concerned about breaking Nominatim and bespoke Overpass queries. I personally prefer curly apostrophes and en dashes, but you’ll only find them on unbranded POIs, because the name-suggestion-index has standardized on straight apostrophes and hyphens.

Unfortunately, GL JS doesn’t support the ignorePunctuation collator option in collator option objects, so there’s no way to account for these discrepancies on the client side.

Footnotes

  1. On the bright side, in Hawaiʻi, we did replace GNIS- and TIGER-imported straight apostrophes with proper ʻokina and that has been uncontroversial so far.

@ZeLonewolf
Copy link
Member

One way to look at it is that the gloss is really good at catching things like this.

Good explanation. I'm cool with that :)

Overall I'm pretty happy with how this looks!

@jleedev
Copy link
Member

jleedev commented Dec 1, 2022

The English Wikipedia’s house style […] insists on straight quotation marks.

They also think en dashes in URLs is a cute idea, so I take them with a grain of salt.

@1ec5 1ec5 mentioned this pull request Dec 1, 2022
@1ec5
Copy link
Member Author

1ec5 commented Dec 1, 2022

I may also look into adding unit tests, either in this PR or separately, because there are a lot of edge cases to juggle.

Once a test harness lands in #594, we can refactor the ginormous city name label text-field expression to be testable and add test cases for all the cities we examined above in code review.

@1ec5 1ec5 force-pushed the 1ec5-place-gloss-471 branch 2 times, most recently from 5d76fd3 to 090b407 Compare December 3, 2022 12:43
@1ec5 1ec5 marked this pull request as ready for review December 3, 2022 12:43
Copy link
Member Author

@1ec5 1ec5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature is now ready for review.

expectGloss("en", "Montreal", "Montréal", "Montréal", null);
expectGloss("en", "Quebec City", "Québec", "Québec City", null);
expectGloss("en", "Da Nang", "Đà Nẵng", "Đà Nẵng", null);
expectGloss("en", "Nūll Island", "Ñüłl Íşlåńđ", "Ñüłl Íşlåńđ", null);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve implemented a compact syntax for testing that, given a name in the preferred language and a name in the local language both in the properties of a feature, the massive expression will evaluate to a format expression with the expected primary label and gloss. In fact, just about any expression in this style can now be tested in this manner, independently of the style, by providing the necessary bits of data.

Copy link
Collaborator

@zekefarwell zekefarwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me and ready to merge.

Here are some comments for future improvement, but as discussed at length on this PR, we don't have the neccessary data in the tiles to do anything about them right now.

It's a bit cluttered having a gloss on every city name in areas where local name uses a different writing system than the user's language. Hopefully we can find a way to use transliterated names in the future.
Screen Shot 2022-12-04 at 10 00 23 PM

The diacritic folding when the language is set to English cuts down on redundancy very well. It would be nice if we could do the same for other languages, but I understand it's difficult since diacritics are much more important in languages other than English. I could be wrong, but I imagine a Spanish speaker would find this kind of thing somewhat silly looking:
Screen Shot 2022-12-04 at 10 03 03 PM
Seems like this is the best we can do with the available data, but maybe a better solution will be possible with changes to the tiles.

St. vs Saint is an issue. I hope this can be resolved on the data side by standardizing on one form. It sure would feel dumb to have to add a special case to treat them as equivalent for gloss purposes.
Screen Shot 2022-12-04 at 10 15 53 PM Screen Shot 2022-12-04 at 10 15 45 PM Screen Shot 2022-12-04 at 10 15 36 PM

@1ec5
Copy link
Member Author

1ec5 commented Dec 5, 2022

The diacritic folding when the language is set to English cuts down on redundancy very well. It would be nice if we could do the same for other languages, but I understand it's difficult since diacritics are much more important in languages other than English. I could be wrong, but I imagine a Spanish speaker would find this kind of thing somewhat silly looking:

This is specifically an issue when a Spanish speaker looks at an English-speaking country, as opposed to say a French-speaking country as in #592 (comment). Similarly, there was a request in OSMUS Slack to enable diacritic folding in Polish too, but it turned out to be counterproductive: Latvian diacritics overriding Polish diacritics on one city label (Jurmała → Jūrmala) right next to another label where the two names remained separate due to a base letter difference (Ryga + Rīga, Jełgawa + Jelgava).

What makes your Spanish example particularly ironic is that the English names are derived from Spanish to begin with, but we don’t have etymological data at our disposal. Perhaps we could conflate more aggressively if we had an expression operator that could search for a character class or regular expression. Then we could check if the local name matches name:en and is composed of only ASCII characters.

St. vs Saint is an issue. I hope this can be resolved on the data side by standardizing on one form. It sure would feel dumb to have to add a special case to treat them as equivalent for gloss purposes.

This is a consequence of OpenStreetMap’s policy of expanding abbreviations in names,1 even when a word would typically be abbreviated in prose. These particular cases will be fixed by onthegomap/planetiler#403. However, there may be features in OSM that inconsistently expand the word in name but abbreviate it in name:en.

Footnotes

  1. American English considers “St.” to be an abbreviation, whereas evidently British English does not. However, “St.” can be a non-abbreviation in American English too: the family name St. Clair comes from Sinclair, so it’s incorrect to spell it out as “Saint Clair”.

Replaced the metadata-based approach to localizing text-field with a top-level expression variable that can be replaced safely each time the language preference changes. This approach can easily be extended to support additional variables that depend on the environment.
When the name in the preferred language matches the name in the local language except for the addition of a prefix or suffix, which is likely to be an insignificant word like “City”, splice the local-language name into the preferred-language name to reduce redundancy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gloss names of places in the local language
6 participants