Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should the Segmenter types accept a locale? #3284

Closed
sffc opened this issue Apr 11, 2023 · 29 comments · Fixed by #5565
Closed

Should the Segmenter types accept a locale? #3284

sffc opened this issue Apr 11, 2023 · 29 comments · Fixed by #5565
Assignees
Labels
C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Apr 11, 2023

In the API review, @markusicu pointed out that ICU takes a locale in the segmenter, and the locale affects the behavior in certain cases, such as those in the data files below:

Why don't we support these in ICU4X Segmenter, and should we add them?

For 1.2 purposes, we have a few choices:

  1. Add the locale parameter now and don't use it for anything yet
  2. Don't add the locale parameter but add something like _invariant to the constructor names, so that in the future try_new_auto_invariant() creates the locale-invariant segmenter and try_new_auto(locale!("el")) creates the locale-specific segmenter
  3. Keep things the way they are and add locale constructors later, possibly adopting the style above in 2.0
  4. Add the parameter to Word and Sentence, but not Line or Grapheme

Thoughts?

@aethanyc @makotokato @Manishearth

@sffc sffc added discuss Discuss at a future ICU4X-SC meeting C-segmentation Component: Segmentation labels Apr 11, 2023
@Manishearth
Copy link
Member

General preference for #3

Do we plan to provide these locale-ish APIs in the near term? I actually think future try_new_auto_with_locale() would be fine

@zbraniecki
Copy link
Member

General preference for #2

I'd prefer the default constructor names to be consistent in behavior as much as makes sense. If we believe Segmenter constructors will want to take locale just like all others, lets keep the names for those constructors.
If, in the future, we decide to not add those names, we can always alias the default constructor names to the _invariant ones.

@makotokato
Copy link
Member

Why this ICU4C rule isn't merged/requested to UAX#29? Does ICU4C have a plan to file/merge an issue to UAX#29? After merging this change to UAX#29, then #3.

@Manishearth
Copy link
Member

UAX #29 in general doesn't really want to include locale-specific stuff because it wants to leave that up to CLDR.

@sffc
Copy link
Member Author

sffc commented Apr 11, 2023

Suffix suggestions:

  • _invariant
  • _root (wrong: we don't have CLDR root tailorings)
  • _uax (too restrictive: we want to add CLDR tailorings)
  • _untailored (wrong: LineBreakOptions has tailorings)
  • _default (could be restrictive with regard to default data)

@macchiati
Copy link
Member

Since we will want to take locales as parameters (even for segmenters where that isn't implemented yet), IMO we should make that the "normal" case.

@sffc
Copy link
Member Author

sffc commented Apr 11, 2023

From discussion with @aethanyc @makotokato @Manishearth @nordzilla: It is an enhancement to consume CLDR root.xml tailorings, but not necessarily a bug. We would like to see it done in a timely fashion.

Conclusion: use _invariant

@robertbastian
Copy link
Member

For the record we didn't use _invariant in #3294

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label May 11, 2023
@sffc sffc added this to the 1.x Priority ⟨P2⟩ milestone May 11, 2023
@sffc sffc added T-core Type: Required functionality S-large Size: A few weeks (larger feature, major refactoring) labels May 11, 2023
@hsivonen
Copy link
Member

Data bundles that contain some language-specific data for sentence segmentation: https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr

These seem to be lists of abbreviations that contain a period that doesn't end a sentence. How bad would it be to merge the lists and use the merged lists across languages?

fi_sv override for word break: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/word_fi_sv.txt

It's a bit sad that treating letter, colon, letter as having a word break opportunity after the colon is a case of giving computer syntax needs precedence over natural-language needs. If accommodating computer syntaxes wasn't given priority, the Finnish/Swedish requirement of not treating letter, colon, letter as containing a word break opportunity could be hoisted to root.

el override for sentence break: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/sent_el.txt

This seems to be about ASCII semicolon having sentence-ending question mark semantics. Could this be accommodated in the root by triggering the rule on the most recent letter being from the Greek script?

@robertbastian
Copy link
Member

It's a bit sad that treating letter, colon, letter as having a word break opportunity after the colon is a case of giving computer syntax needs precedence over natural-language needs. If accommodating computer syntaxes wasn't given priority, the Finnish/Swedish requirement of not treating letter, colon, letter as containing a word break opportunity could be hoisted to root.

I think German needs this tailoring as well. I don't know why Finnish and Swedish do, but in German a colon is commonly used to form gender-neutral nouns, like Lehrer:in, which should not contain any word breaks.

What's the current process for updating the tailorings? ICU or CLDR?

@Manishearth
Copy link
Member

@robertbastian this was recently discussed in the CLDR design meeting, CLDR issue: https://unicode-org.atlassian.net/browse/CLDR-15910 / PAG issue: https://github.com/unicode-org/properties/issues/187 (internal)

There's thought that this should actually be made to apply to all languages, since colons without spaces on either side are not really a thing in regular text anyway, and if the space has been removed there's a good chance it's on purpose.

@srl295
Copy link
Member

srl295 commented Jan 24, 2024

Data bundles that contain some language-specific data for sentence segmentation: https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr

These seem to be lists of abbreviations that contain a period that doesn't end a sentence. How bad would it be to merge the lists and use the merged lists across languages?

That's exactly what they are. https://www.unicode.org/reports/tr35/tr35-general.html#Segmentation_Exceptions for more details.

The lists for one language may not be applicable for others. But you could probably calculate a list that's likely to be generally useful, it might be less useful for any particular language.

@hsivonen
Copy link
Member

I think German needs this tailoring as well.

I think https://unicode-org.atlassian.net/browse/CLDR-15910 should be reverted on the root level so that we don't need tailorings to accommodate natural languages.

I don't know why Finnish and Swedish do

For Finnish, the use case is marking where a sufficiently unusual word body (e.g. acronym) ends and the case suffix starts. For example, English Henri’s would be Henrin in Finnish but English ICU4X’s would be ICU4X:n in Finnish. The use case for Swedish seems to be also about applying suffixes (though not case suffixes) to sufficiently unusual word bodies. (Consider an analog English Londoner but with with the suffix applied to e.g. a sports team acronym.)

@sffc
Copy link
Member Author

sffc commented Jan 26, 2024

Okay, for 2.0 purposes, which of the four segmenters requires a locale parameter?

  1. Grapheme: Are there locale-specific CLDR tailorings for graphemes?
  2. Word: It sounds like people want to move the fi_sv tailorings to the root, which would obviate the need for RBBI tailoring. However, locale info could still help with complex language segmentation, although we need to know the language of the text, not of the user.
  3. Sentence: Seems like this is the biggest use case, although it is still about the language of the text and not of the user.
  4. Line: I think this one is invariant.

The "language of the text" would be more appropriate to provide in the terminal segment function since it is an attribute of the text, but since that requires data loading, it might be more appropriate to specify it in the constructor. Alternatively, we could stuff all sentence language tailorings into a single data key which is always loaded when making a sentence segmenter, as we do for the word break segmenter.

@sffc sffc added the needs-approval One or more stakeholders need to approve proposal label Jan 26, 2024
@srl295
Copy link
Member

srl295 commented Jan 26, 2024

Grapheme: Are there locale-specific CLDR tailorings for graphemes?

Not yet. Please put it into the API. I was doing planning on a work item to move this forward. This is for example languages that want to keep "ch" together etc.

@hsivonen
Copy link
Member

hsivonen commented Jan 26, 2024

Grapheme

Please put it into the API.

On the flip side, putting this in the API really requires making ECMA-402 have a way to explicitly ask for root and to default to root.

Some users getting a different definition of extended grapheme clusters based on the browser UI locale would likely be bad, after developers having assumed for years that extended grapheme clusters are a Unicode-level concept and not a locale-level concept. Also, it would be bad to have to assume that English is always going to be the untailored language and to teach every developer to ask for a grapheme segmenter for English in order to get behavior on a similar level of stability that one would expect of e.g Swift strings.

This is for example languages that want to keep "ch" together etc.

What languages do you mean and why do they want to keep "ch" together for the kind of purposes that extended grapheme clusters are used for, such as denying the selection of only "c" or only "h"? Czech treats "ch" as a collation unit, but do users of the language expect not to be able to select "c" and "h" individually?

@markusicu
Copy link
Member

CLDR has decided that they're putting the : thing back into root.

Yes, but the Apple rep in the meeting, who is originally from Sweden, insisted that CLDR keep the fi/sv word break tailorings, because he thinks that even the future keyword selecting technical usage should keep fi/sv words together across colon.

No language parameter for grapheme cluster segmenter

+1

Language parameter for the other three segmenters

+1

Plan to put the parameter on the .segment() function

That seems weird both from looking at usage and thinking about data loading.

  • I strongly expect that someone should be able to get a Segmenter object and just use it to find/iterate over segments without knowing about additional options.
  • I expect tailorings to require some different data that should be loaded in the "constructor". Depending on the implementation, there may be a totally different blob or a small delta, but probably generally some non-zero tailoring-specific data.

@sffc sffc added S-medium Size: Less than a week (larger bug fix or enhancement) and removed discuss Discuss at a future ICU4X-SC meeting needs-approval One or more stakeholders need to approve proposal discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band S-large Size: A few weeks (larger feature, major refactoring) labels Mar 14, 2024
@sffc
Copy link
Member Author

sffc commented Apr 1, 2024

The conclusions from the discussion of this issue with the CLDR design group:

  • Grapheme clusters should not be language-specific; baked into much low-level processing (e.g., Swift, font mappings) which we don’t want to be language-specific
  • Content locale/text language parameter (not UI locale): Potential for accuracy; make it optional, name it well
  • Ok to leave the locale on the constructor; benefit: more specific data loading even for existing dictionaries & models

My suggested path forward for this issue, then, is to add an options bag to the WordSegmenter, LineSegmenter, and SentenceSegmenter constructors with an optional content_locale field of type &LanguageIdentifier.

@sffc sffc modified the milestones: ICU4X 2.0, 1.5 Blocking ⟨P1⟩ Apr 1, 2024
@sffc
Copy link
Member Author

sffc commented Apr 1, 2024

I'm moving this back into 1.5 because the constructor can be drafted and bikeshed ahead of time, and then in 2.0 we can do the minimal change of making the new constructor the default one.

@srl295
Copy link
Member

srl295 commented Apr 2, 2024

Grapheme clusters should not be language-specific; baked into much low-level processing (e.g., Swift, font mappings) which we don’t want to be language-specific

This makes no sense and contradicts the long standing requests. ( https://unicode-org.atlassian.net/browse/CLDR-2992 which I am working on scheduling ) I would have joined, did not realize this was coming up today.

Perusing the notes it's not clear that the previous requirements and recent discussion from the segmentation summary last year were included here.

@sffc
Copy link
Member Author

sffc commented May 16, 2024

Based on additional discussion in the email thread, I would like to move forward with the recommendation in #3284 (comment), with the additional understanding that we may add support for locale-based grapheme segmentation in the future if CLDR adds data for this, but it might take the form of another (fifth) segmenter type.

Concretely:

  • All segmenters retain a new or try_new function without an options bag
  • Word, Sentence, and Line segmenters get a try_new_with_options function that includes a content_locale option

@makotokato
Copy link
Member

When looking ICU4C brkiter rule files for word and sentence, UAX#29's property of this isn't same each locale. But rules seem to be same. So if we modify datagen (with a few changes of toml data file), we can generate rules data per locale.

makotokato added a commit that referenced this issue Sep 3, 2024
Add LocaleData parameter for word/sentence segmenter 

This is a part of #3284.

ICU4C has some language break rules for word and sentence segmenter, so
this fix adds some rules to ICU4X per locale.

This adds LocaleData argument to all constructors. Also, locale
difference is small and 2 data only, I add the override table data
marker for machine state property.
@makotokato
Copy link
Member

If we support auto-phase line-break property, we should have locale paremeter to line segmenter instead of ja_zh flag.

@sffc
Copy link
Member Author

sffc commented Sep 18, 2024

Currently, we have the optional Content Locale on WordSegmenter and SentenceSegmenter.

@makotokato will create a pull request to replace ja_zh with content_locale in LineSegmenterOptions.

Once that PR lands, we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.