Latin script: Segmenter should split camelCased words #129

ManyTheFish · 2022-09-28T07:56:12Z

Today, Meilisearch is splitting snake_case, SCREAMING_CASE and kebab-case properly but doesn't split PascalCase nor camelCase.

drawback

Meilisearch doesn't completely support code documentation.

enhancement

Make Latin Segmenter split camelCased/PascalCase words:

"camelCase" -> ["camel", "Case"]
"PascalCase" -> ["Pascal", "Case"]
"IJsland" -> ["IJsland"] (Language trap)
"CASE" -> ["CASE"] (another trap)

Files expected to be modified

/src/segmenter/latin.rs

Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! 🤝

The text was updated successfully, but these errors were encountered:

adithyaakrishna · 2022-10-03T21:40:12Z

@ManyTheFish Can I take this up?

curquiza · 2022-10-04T07:48:51Z

Hello @adithyaakrishna

Thanks for your interest in this project 🔥 You are definitely more than welcome to open a PR for this!

FYI, we prefer not assigning people to our issues because sometimes people ask to be assigned and never come back, which discourages the volunteer contributors from opening a PR to fix this issue.
We will accept and merge the first PR that fixes correctly and well implements the issue following our contributing guidelines.

We are looking forward to reviewing your PR 😊

yenwel · 2022-10-07T15:00:06Z

@adithyaakrishna you still doing this otherwise I'd try this next week

adithyaakrishna · 2022-10-10T18:51:48Z

@yenwel I am working on this, as I am very new to rust I have some doubts 😶

@curquiza Regarding this issue, how about if we convert the string to something which meilisearch already does?

ManyTheFish · 2022-10-11T16:23:58Z

Hey @adithyaakrishna,
you can't convert the String in a segmenter because you don't have ownership of it at this step. So you'll have to find another way to split it.
To help you a bit more, split_word_bounds returns an Iterator over &str, from this Iterator you should be able to re-split each Item then flatten them.

ManyTheFish · 2022-10-18T12:15:34Z

Hey @adithyaakrishna,
any news on this?
Do you need any help? 😄

160: Latin Segmenter split camelCased words r=ManyTheFish a=rashmibharambe # Pull Request ## Related issue Fixes #129 ## What does this PR do? - Latin Segmenter - to split camelCased words ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: rashmibharambe <93034034+rashmibharambe@users.noreply.github.com>

goodhoko · 2023-01-26T11:52:45Z

Hi, @ManyTheFish!

I have a working patch for this. What's not clear to me is whether we want to keep the whole word or not. I.e. if camelCase should be segmented into ["camelCase", "camel", "Case"] or just ["camel", "Case"].

I'd expect the former because if a user searches for camelCase we probably want to prioritize documents containing camelCase as a whole over documents containing camel and case. But I'm not familiar with how the ranking at search-time works in Meili so maybe this is handled in some other way.

Let me know what's your stance on this and I can submit a PR.

ManyTheFish · 2023-01-31T09:49:49Z

The goal is to split "camelCase" into ["camel", "Case"] and even "snake_case" into ["snake", "_", "case"].
I agree with you that we will lose precision in favor of the recall, but, in this case, I'd prefer to focus on accent/diacritics precision than on the case. 🤔

To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (Or the position between them, to be precise.) This treats most cases and avoids the common pitfalls like eg. ALL_CAPS. What is not handled though, are abbreviations within a longer word. Especially in code it's common to write eg. "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Closes meilisearch#129.

To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (Or the position between them, to be precise.) This treats most cases and avoids the common pitfalls like eg. ALL_CAPS. What is not handled though, are abbreviations within a longer word. Especially in code it's common to write eg. "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-side in latin.rs clean and concise. Closes meilisearch#129.

To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (Or the position between them, to be precise.) This treats most cases and avoids common pitfalls like eg. ALL_CAPS. What is not handled though, are abbreviations within a longer word. Especially in code, it's common to write eg. "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-site in latin.rs clean and concise. Closes meilisearch#129.

To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (Or the position between them, to be precise.) This treats most cases and avoids common pitfalls like eg. ALL_CAPS. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-site in latin.rs clean and concise. Caveats: - Especially in code, it's common to write abbreviations in all caps. For instance "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. - The implementation is not grapheme-cluster aware. E.g. it won't split on boundaries that are intermitted by combining characters like u{0306}. Closes meilisearch#129.

181: Split camelCase in Latin segmenter r=ManyTheFish a=goodhoko ## Related issue #129 ## What does this PR do? To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries. Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (The position between them, to be precise.) This treats most cases and avoids the common pitfalls like eg. ALL_CAPS. What is not handled though, are abbreviations within a longer word. Especially in code it's common to write eg. "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. Let me know if that's a blocker. Leverage the Unicode General Categories https://en.wikipedia.org/wiki/Unicode_character_property#General_Category and their support in the Regex crate for matching lowercase and uppercase letters. Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-site in latin.rs clean and concise. Fixes #129. ## Performance I benchmarked this locally on my Apple-silicon MB Air with `cargo bench` and saw some significant regressions. The `Latin/Fra` and `Latin/Eng` (in both `132` and `363` variants) regressed some 40% compared to a3eab30. (Here's the full [report](https://github.com/meilisearch/charabia/files/10559955/report.zip)) I'm not sure how big of a problem this is. There definitely are many optimization opportunities but I wanted to propose the simplest solution first and eventually iterate from there. --- PS: This is my very first rust contribution. Sorry for missing any conventions or leaving in rough edges. Co-authored-by: Jen Tak <goodhoko@gmail.com>

ManyTheFish added good first issue Good for newcomers hacktoberfest labels Sep 28, 2022

ManyTheFish changed the title ~~Latin Segmenter should split CamelCased words~~ Latin Segmenter should split camelCased words Sep 28, 2022

curquiza transferred this issue from meilisearch/engine-team Sep 29, 2022

rashmibharambe mentioned this issue Oct 31, 2022

Latin Segmenter split camelCased words #160

Closed

3 tasks

curquiza removed the hacktoberfest label Nov 15, 2022

oluademola mentioned this issue Jan 10, 2023

The soft separators ("_" and "-") should behave exactly the same meilisearch/meilisearch#3320

Closed

ManyTheFish mentioned this issue Jan 31, 2023

Latin script: segmenter should support word segmentation #175

Open

ManyTheFish changed the title ~~Latin Segmenter should split camelCased words~~ Latin script: Segmenter should split camelCased words Jan 31, 2023

goodhoko mentioned this issue Feb 1, 2023

Split camelCase in Latin segmenter #181

Merged

bors bot closed this as completed in 8c63180 Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latin script: Segmenter should split camelCased words #129

Latin script: Segmenter should split camelCased words #129

ManyTheFish commented Sep 28, 2022 •

edited

Loading

adithyaakrishna commented Oct 3, 2022 •

edited

Loading

curquiza commented Oct 4, 2022

yenwel commented Oct 7, 2022

adithyaakrishna commented Oct 10, 2022

ManyTheFish commented Oct 11, 2022 •

edited

Loading

ManyTheFish commented Oct 18, 2022

goodhoko commented Jan 26, 2023 •

edited

Loading

ManyTheFish commented Jan 31, 2023

Latin script: Segmenter should split camelCased words #129

Latin script: Segmenter should split camelCased words #129

Comments

ManyTheFish commented Sep 28, 2022 • edited Loading

drawback

enhancement

Files expected to be modified

adithyaakrishna commented Oct 3, 2022 • edited Loading

curquiza commented Oct 4, 2022

yenwel commented Oct 7, 2022

adithyaakrishna commented Oct 10, 2022

ManyTheFish commented Oct 11, 2022 • edited Loading

ManyTheFish commented Oct 18, 2022

goodhoko commented Jan 26, 2023 • edited Loading

ManyTheFish commented Jan 31, 2023

ManyTheFish commented Sep 28, 2022 •

edited

Loading

adithyaakrishna commented Oct 3, 2022 •

edited

Loading

ManyTheFish commented Oct 11, 2022 •

edited

Loading

goodhoko commented Jan 26, 2023 •

edited

Loading