Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Street names are automatically title-cased in locales where they shouldn't be #4784

Closed
balsoft opened this issue Jan 31, 2023 · 35 comments · Fixed by #4785 or #5105
Closed

Street names are automatically title-cased in locales where they shouldn't be #4784

balsoft opened this issue Jan 31, 2023 · 35 comments · Fixed by #4785 or #5105

Comments

@balsoft
Copy link
Contributor

balsoft commented Jan 31, 2023

Because of this line, whenever one is typing in a name for a street or a place, it always capitalizes the first letter of each word in the street name. While this is great for many languages where uppercase letters are used at the start of each word in toponyms, this is a problem in other languages where that's not the case. For example, in Georgian language, while capital letters are included in the Unicode standard, they are not used for title casing. This makes it really annoying to input street names, as one has to edit the words afterwards to remove the capitalization.

I'm not sure what the best solution for this would be. Perhaps add a list of "exception" languages which are not to be titlecased, or even a simple toggle to disable the feature.

How to Reproduce

Start solving a "Street Name" quest somewhere in Georgia, type in street name in Georgian. The first letters are automatically capitalized, even though they shouldn't be.

Here's what it looks like:

23-01-31-22-32-25.mp4

Expected Behavior

No tilte-casing occurs.

Versions affected

v50.2 (latest from F-Droid)

@westnordost
Copy link
Member

So this first needs some research in which languages (or countries?) a title case is not used. Would you care to do this research?

@westnordost
Copy link
Member

By the way, you can tap on the street it belongs to instead of writing it each time manually.

@balsoft
Copy link
Contributor Author

balsoft commented Jan 31, 2023

By the way, you can tap on the street it belongs to instead of writing it each time manually.

Yep, I'm aware of that; However, this also happens for place names (which are quite common here, in the form of microdistricts, and here I can't use that feature with place names).

@balsoft
Copy link
Contributor Author

balsoft commented Jan 31, 2023

So this first needs some research in which languages (or countries?) a title case is not used. Would you care to do this research?

I think a good first approximation would be to look at endonyms of countries, to see which ones are and aren't using title-casing: https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_and_their_capitals_in_native_languages

@matkoniecz
Copy link
Member

I found https://bugs.ruby-lang.org/issues/14839 which is quite related

MTAVRULI may not be used in the same contexts as Upper Case in other scripts. One very clear case is that MTAVRULI is only used for ALL CAPS. But this is covered by Unicode data, which means that .capitalize will be a no-op. The main area I can see where this can create problems is "Convention over Configuration" situations where all of lowercase, Uppercase, and ALLCAPS are used. If only lowercase and Uppercase are used, Georgian can be treated as an unicameral (only one case) script, similar to e.g. Hiragana. If only lowercase and ALLCAPS are used, then Georgian can be treated as a bicameral (two cases) script.

The current implementation of String.capitalize is not just a problem in Georgian, but also in other languages like Dutch. Words beginning with „ij“ must be titlecased with a leading „IJ“, e. g. „IJsbeer“ (polar bear).

hunting down that "Unicode data" part of Unicode may be helpful

@balsoft
Copy link
Contributor Author

balsoft commented Jan 31, 2023

Also, this might be somewhat useful: https://en.wikipedia.org/wiki/Capitalization

@matkoniecz
Copy link
Member

matkoniecz commented Jan 31, 2023

Given situation with Georgian alphabet this is a bug in the uppercase library function.

Curiously

import java.util.Locale

fun main(args: Array<String>) {

    
    val locale = Locale.forLanguageTag("ka")
    val capital = "თირაკლი".substring(0, 1).uppercase(locale)
    println(capital)
    println("თირაკლი აბაშიძის ქუჩა".capitalize())
    println("თირაკლი აბაშიძის ქუჩა".uppercase())
}

on Kotlin playground is without triggering this bug.

@balsoft
Copy link
Contributor Author

balsoft commented Jan 31, 2023

So, I've found the titlecase function which seems to do exactly what's needed in our case. #4785 should make use of that function. That said, I have never written Kotlin before, and I couldn't be bothered to build it locally (I'm sorry, I don't have the best PC with me currently).

@matkoniecz
Copy link
Member

https://github.com/streetcomplete/StreetComplete/blob/master/app/src/main/java/de/westnordost/streetcomplete/screens/user/statistics/CircularFlagView.kt#L154 looks like internal use

But there is also https://github.com/streetcomplete/StreetComplete/blob/master/app/src/main/java/de/westnordost/streetcomplete/data/meta/Abbreviations.kt#L77 used when abbreviations are expanded, I guess that it also should be fixed (maybe it is not triggered right now if no abbreviations are defined for Georgia)

@rhhsm
Copy link

rhhsm commented Apr 19, 2023

It's an issue for Bulgarian too (Стара планина becomes Стара Планина but it shouldn't). My wife, who's fluent in Russian, says it would be wrong in Russian too. So maybe for all languages using Cyrillic script?

@matkoniecz
Copy link
Member

@rhhsm Can you open a new issue? Here it can be come lost

@westnordost
Copy link
Member

Uh, we use title case function now. If that function returns false results, it is not in our hands to fix it. The fix for Georgian should have been the fix for any language.

@mnalis
Copy link
Member

mnalis commented Apr 19, 2023

Um, it behaves very strange to me too. Is that titlecase() supposed to behave differently depending on the language used (because it works the same for me in English and in Croatian)?

I mean, generally (outside of android), titlecase simply means that "all words are capitalized, except for minor words (typically articles, short prepositions, and some conjunctions) that are not the first or last word of the title".

And what android titlecase() function seems to provide simplified (see below) version of that wikipedia definition.

There are several issues with such approach:

  • instead of using correct wordlists of propositions, conjuctions etc. (which would need per-language databases etc), it seems to assume all words with 3 or less letters are "minor" and all other words should be capitalized (e.g. video 0:42-0:51)
  • it seems to have an annoying bug that it will do capitalization only if word is followed by space (so every last word in a name will not be capitalized even in cases where it should be, at least on my EMUI 12, e.g. video 0:16-0:20)
  • (biggest issue) Title case is generally mostly used only in English (specifically, in Northern American variant of English). In most other languages (at least in Croatian), "title case" is definitely not a correct way to capitalize street names (or anything else, for that matter). Specifically in Croatian, proper way of entering street names should be using Sentence case, i.e. "The first word, proper nouns and some specified words capitalised" (e.g. rest of the video like 0:28-0:38 gets basically everything wrong).

So I would suggest using sentence case in all countries except USA and maybe a few others (where title case might be more proper).

small_SVID_20230419_223637_1.mp4

@westnordost westnordost reopened this Apr 20, 2023
@westnordost
Copy link
Member

Do you know a source where one can find a list of languages where title case should be used as opposed to sentence case? For street names, place names?

@mnalis
Copy link
Member

mnalis commented Apr 20, 2023

Do you know a source where one can find a list of languages where title case should be used as opposed to sentence case? For street names, place names?

Uh, unfotunately not, and quick search does not reveal that to me. Wikipedia seems to imply that titlecase it is used mostly in English speaking countries:

Title case is widely used in many English-language publications, especially in the United States. However, its conventions are sometimes not followed strictly – especially in informal writing.

Quick checking of most common suspects would indicate those do use titlecase:

But of course there are likely more.

@mnalis
Copy link
Member

mnalis commented Apr 20, 2023

Now, if one were feeling adventurous, chatGPT says those countries might be using title case for street or place names:

[  "AG",  "AI",  "AU",  "BB",  "BD",  "BE",  "BM",  "BS",  "BZ",  "CA",  "CK",  "CY",  "FJ",  "FK",  "GB",  "GD",  "GG",  "GI",  "GM",  "GY",  "HK",  "IE",  "IM",  "IN",  "JE",  "JM",  "KN",  "KY",  "LC",  "LK",  "MS",  "MT",  "MU",  "MV",  "MW",  "MY",  "NA",  "NF",  "NG",  "NP",  "NR",  "NZ",  "PG",  "PH",  "PK",  "PN",  "SB",  "SC",  "SG",  "SH",  "SL",  "SS",  "SZ",  "TC",  "TO",  "TT",  "TV",  "VC",  "VG",  "VU",  "WS",  "ZA" ]

But I wouldn't trust (for anything important) that language model predictor farther than I could throw it 😄

@matkoniecz
Copy link
Member

"title case" is definitely not a correct way to capitalize street names (or anything else, for that matter). Specifically in Croatian, proper way of entering street names should be using Sentence case, i.e. "The first word, proper nouns and some specified words capitalised"

What about street names named after people?

In Warsaw we have for example street name=Mordechaja Anielewicza named after a person (Name Surname form).

@rhhsm
Copy link

rhhsm commented Apr 20, 2023

This source http://www.bibnet.be/files/download/525b23b4-d008-433b-bbc2-e24b32872d2b/Regelgeving/vlacc_bronnen/vlacc_bronnenTitelbeschrijven/Hoofdlettergebruik%20per%20taal.html states that compound geographic names (Samengestelde geografische namen) should be capitalised in Dutch, English, Spanish and German, while French is more complicated.
The keyboard app I'm using (Microsoft SwiftKey) does a pretty good job suggesting words with the correct capitalisation. If other keyboard apps do the same, I think we should not use the title case function by default unless we're absolutely sure it is correct for the current input language.

@mnalis
Copy link
Member

mnalis commented Apr 20, 2023

"title case" is definitely not a correct way to capitalize street names (or anything else, for that matter). Specifically in Croatian, proper way of entering street names should be using Sentence case, i.e. "The first word, proper nouns and some specified words capitalized"

What about street names named after people?

I believe the "Sentance case" mentioned above in its definition covers that - note that it does not say than only first letter of the sentence should be capitalized; but also proper nouns (which include personal names as such, I believe).

Of course, the actual implementation of Sentence case would like go from most basic (like only First letter of the sentence is capitalized) to more advanced (databases of words / names to capitalize or keep lowercase, or local common usage, or common cloud usage etc).

In Warsaw we have for example street name=Mordechaja Anielewicza named after a person (Name Surname form).

Yes, in Croatia we also have names like "Trg bana Josipa Jelačića" (where "Josip Jelačić" is a name of the person, and "ban" was his title) but in majority of the cases only the first letter would be capitalized "Zagrebačka cesta", "Glogov put", "Taborska ulica", "Vukovarska avenija".

The keyboard app I'm using (Microsoft SwiftKey) does a pretty good job suggesting words with the correct capitalisation. If other keyboard apps do the same, I think we should not use the title case function by default unless we're absolutely sure it is correct for the current input language.

That sounds like a good idea! It can be argued that it should be the keyboards job to do proper capitalization / word correction (if it is advanced enough to offer such functionality). The SC might as well just make a first letter uppercase (as that sounds pretty common mostly everywhere?) and call it a day, and if user is unhappy with how it works in their current keyboard, they can choose from plethora of other keyboards.

@balsoft
Copy link
Contributor Author

balsoft commented Apr 20, 2023

The SC might as well just make a first letter uppercase (as that sounds pretty common mostly everywhere?)

Well, this issue is specifically about the case where it's not true: in Georgian, upper-case letters are only used for ALL CAPS and not for title-case or sentence-case.

@mnalis
Copy link
Member

mnalis commented Apr 20, 2023

Well, this issue is specifically about the case where it's not true: in Georgian, upper-case letters are only used for ALL CAPS and not for title-case or sentence-case.

Interesting. But I suspect that if titlecase() does not use upper-case letters in Georgian (as it seems not to, as previous solution seem to have worked?), it probably would not do it either in sentence case (e.g. capitalize()? or whatever else is used these days) as it says it will follow locale settings.

It would need testing to confirm that, of course.

@westnordost westnordost self-assigned this May 1, 2023
@westnordost
Copy link
Member

westnordost commented May 2, 2023

So, I've searched extensively on the OpenSTreetMap. In these countries/languages, street names are (sometimes/often) not title-cased:

(Bulgaria is not amongst them. Omitted are countries with scripts that do not even have uppercase letters, such as Thai, Chinese etc.)

@westnordost westnordost removed their assignment May 2, 2023
@rhhsm
Copy link

rhhsm commented May 3, 2023

What is the source of info that made you conclude not to include Bulgaria?

@westnordost
Copy link
Member

westnordost commented May 3, 2023

I scrolled through the map. E.g. here, ~all road names are title cased: https://www.openstreetmap.org/#map=17/45.66213/25.60465

And here too: https://www.openstreetmap.org/#map=17/44.44219/26.10019

@Helium314
Copy link
Collaborator

That's Romania. But still looks similar in actual Bulgaria, e.g. https://www.openstreetmap.org/#map=17/42.12727/24.74976

@westnordost
Copy link
Member

Whoops, sorry. But I definitely looked in real Bulgaria too yesterday

@rhhsm
Copy link

rhhsm commented May 3, 2023

Why look on the map when you could have asked me :)

That's Romania. But still looks similar in actual Bulgaria, e.g. https://www.openstreetmap.org/#map=17/42.12727/24.74976

All streets names in this example are "title case" because they're all named after people (very very common in Bulgaria). The only 2-word name that's an exception is Шар Планина which is wrongly capitalised (it should be Шар планина, int_name=Shar planina, just corrected it).

See here https://wiki.openstreetmap.org/wiki/Multilingual_names#Bulgaria for transliteration guidelines for Bulgarian: the examples are copy-pasted from the quoted law.

Maybe the map of the Sofia Metro would be more convincing? The cases where second or later words are capitalised are because they themselves are names.
image

@westnordost
Copy link
Member

Apart from Bulgaria, are there more countries apart from those I already found where not titlecase should be used? Apparently I did not look closely enough in my swipe.

@jmontane
Copy link

jmontane commented May 4, 2023

At Spain, it depends on language used. Note that value of "name" can be in any of the following languages.

  • Spanish (es): titlecase, but few words are lowercase (articles and prepositions: el, la, los, las, de, del, de la, de las, de los...)
  • Catalan (ca): titlecase, but few words are lowercase (articles and prepositions: el, la, l', els, las, de, del, de la, de les, dels...) + middledot U+00B7 is used as innerword char (titlecase functions sometimes capitalize wrongly letter after middledot) + way type isn't always capitalized (carrer Ample, carrer del Pinter Joan Miró...)
  • Occitan (oc): I dont' know. I suspect it follows similar criteria than Spanish
  • Galician (gl): I dont' know. I suspect it follows similar criteria than Spanish
  • Basque (eu): I dont' know. I suspect it follows similar criteria than Spanish
  • Asturian (ast): I dont' know. I suspect it follows similar criteria than Spanish

@westnordost
Copy link
Member

westnordost commented May 4, 2023

Note that words are currently only automacially titlecased if the word has more than 3 characters. That should take care of all the el, la, las, de, del, van, von, etc...

(but misses out on "vía", but well 🤷 )

@westnordost
Copy link
Member

I started implementing this with a huge list of languages in which no title case should be applied automatically but stopped this after several hours because I realized this automatism is just not worth the extra complexity. After all, it will probably be never 100% correct, there are many languages after all and it was just a small convenience for regions where everything is in title case. The user might as well just press shift then 🤷

@rhhsm
Copy link

rhhsm commented Jun 29, 2023

If it's a matter of already having done the basic work but not yet completed the list of languages, then it's worth implementing what has been done so far. Applying title case to languages that shouldn't do that generates wrong data, and that should be prevented as much as possible. If the list of languages that don't apply title case is longer than the one for languages that do, then maybe sentence case should be the default?

@westnordost
Copy link
Member

westnordost commented Jun 29, 2023

No, I am telling you, it is not worth implementing. That's what I wrote in my previous comment. Did you read it? #5105 100% solves this issue by not using titlecase automatically at all, something that cannot be achieved either by a inclusive or exclusive list of languages (because languages are plenty). Which is fine, because it is not exactly a major inconvenience for the user to press shift, plus for names ("Avenue Lenin"), the keyboard app usually suggests to titlecase it anyway.

@rhhsm
Copy link

rhhsm commented Jun 29, 2023

Sorry for the misunderstanding, I don't even understand what a pull request is :) I agree that leaving the capitalisation to the keyboard app is the best option.

@westnordost
Copy link
Member

It is a technical term for a suggested change that can be viewed, discussed and commented on on github.

westnordost added a commit that referenced this issue Jul 4, 2023
* do not automatically titlecase words for names (fixes #4784)

* remove unnecessary import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants