Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latin script: Segmenter should split camelCased words #129

Closed
ManyTheFish opened this issue Sep 28, 2022 · 8 comments · Fixed by #181
Closed

Latin script: Segmenter should split camelCased words #129

ManyTheFish opened this issue Sep 28, 2022 · 8 comments · Fixed by #181
Labels
good first issue Good for newcomers

Comments

@ManyTheFish
Copy link
Member

ManyTheFish commented Sep 28, 2022

Today, Meilisearch is splitting snake_case, SCREAMING_CASE and kebab-case properly but doesn't split PascalCase nor camelCase.

drawback

Meilisearch doesn't completely support code documentation.

enhancement

Make Latin Segmenter split camelCased/PascalCase words:

  • "camelCase" -> ["camel", "Case"]
  • "PascalCase" -> ["Pascal", "Case"]
  • "IJsland" -> ["IJsland"] (Language trap)
  • "CASE" -> ["CASE"] (another trap)

Files expected to be modified

Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! 🤝

@ManyTheFish ManyTheFish changed the title Latin Segmenter should split CamelCased words Latin Segmenter should split camelCased words Sep 28, 2022
@curquiza curquiza transferred this issue from meilisearch/engine-team Sep 29, 2022
@adithyaakrishna
Copy link

adithyaakrishna commented Oct 3, 2022

@ManyTheFish Can I take this up?

@curquiza
Copy link
Member

curquiza commented Oct 4, 2022

Hello @adithyaakrishna

Thanks for your interest in this project 🔥 You are definitely more than welcome to open a PR for this!

FYI, we prefer not assigning people to our issues because sometimes people ask to be assigned and never come back, which discourages the volunteer contributors from opening a PR to fix this issue.
We will accept and merge the first PR that fixes correctly and well implements the issue following our contributing guidelines.

We are looking forward to reviewing your PR 😊

@yenwel
Copy link
Contributor

yenwel commented Oct 7, 2022

@adithyaakrishna you still doing this otherwise I'd try this next week

@adithyaakrishna
Copy link

@yenwel I am working on this, as I am very new to rust I have some doubts 😶

@curquiza Regarding this issue, how about if we convert the string to something which meilisearch already does?

@ManyTheFish
Copy link
Member Author

ManyTheFish commented Oct 11, 2022

Hey @adithyaakrishna,
you can't convert the String in a segmenter because you don't have ownership of it at this step. So you'll have to find another way to split it.
To help you a bit more, split_word_bounds returns an Iterator over &str, from this Iterator you should be able to re-split each Item then flatten them.

@ManyTheFish
Copy link
Member Author

Hey @adithyaakrishna,
any news on this?
Do you need any help? 😄

bors bot added a commit that referenced this issue Nov 1, 2022
160: Latin Segmenter split camelCased words r=ManyTheFish a=rashmibharambe

# Pull Request

## Related issue
Fixes #129 

## What does this PR do?
- 
Latin Segmenter - to split camelCased words

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: rashmibharambe <93034034+rashmibharambe@users.noreply.github.com>
@goodhoko
Copy link
Contributor

goodhoko commented Jan 26, 2023

Hi, @ManyTheFish!

I have a working patch for this. What's not clear to me is whether we want to keep the whole word or not. I.e. if camelCase should be segmented into ["camelCase", "camel", "Case"] or just ["camel", "Case"].

I'd expect the former because if a user searches for camelCase we probably want to prioritize documents containing camelCase as a whole over documents containing camel and case. But I'm not familiar with how the ranking at search-time works in Meili so maybe this is handled in some other way.

Let me know what's your stance on this and I can submit a PR.

@ManyTheFish
Copy link
Member Author

The goal is to split "camelCase" into ["camel", "Case"] and even "snake_case" into ["snake", "_", "case"].
I agree with you that we will lose precision in favor of the recall, but, in this case, I'd prefer to focus on accent/diacritics precision than on the case. 🤔

@ManyTheFish ManyTheFish changed the title Latin Segmenter should split camelCased words Latin script: Segmenter should split camelCased words Jan 31, 2023
goodhoko added a commit to goodhoko/charabia that referenced this issue Feb 1, 2023
To improve recall and to be consistent with snake_case and kebab-case
splitting that's already in place, make the Latin Segmenter split words
on camelCase boundaries.

Define camelCase boundary as a lowercase letter directly followed by an
uppercase one. (Or the position between them, to be precise.) This
treats most cases and avoids the common pitfalls like eg. ALL_CAPS.
What is not handled though, are abbreviations within a longer word.
Especially in code it's common to write eg. "meiliAPIClient". With this
implementation it's split into ["meili", "APIClient"].

Leverage the Unicode General Categories
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
and their support in the Regex crate for matching lowercase and
uppercase letters.

Closes meilisearch#129.
goodhoko added a commit to goodhoko/charabia that referenced this issue Feb 1, 2023
To improve recall and to be consistent with snake_case and kebab-case
splitting that's already in place, make the Latin Segmenter split words
on camelCase boundaries.

Define camelCase boundary as a lowercase letter directly followed by an
uppercase one. (Or the position between them, to be precise.) This
treats most cases and avoids the common pitfalls like eg. ALL_CAPS.
What is not handled though, are abbreviations within a longer word.
Especially in code it's common to write eg. "meiliAPIClient". With this
implementation it's split into ["meili", "APIClient"].

Leverage the Unicode General Categories
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
and their support in the Regex crate for matching lowercase and
uppercase letters.

Put the logic into a separate module and expose API similar to
UnicodeSegmentation's to keep the call-side in latin.rs clean and
concise.

Closes meilisearch#129.
goodhoko added a commit to goodhoko/charabia that referenced this issue Feb 1, 2023
To improve recall and to be consistent with snake_case and kebab-case
splitting that's already in place, make the Latin Segmenter split words
on camelCase boundaries.

Define camelCase boundary as a lowercase letter directly followed by an
uppercase one. (Or the position between them, to be precise.) This
treats most cases and avoids the common pitfalls like eg. ALL_CAPS.
What is not handled though, are abbreviations within a longer word.
Especially in code it's common to write eg. "meiliAPIClient". With this
implementation it's split into ["meili", "APIClient"].

Leverage the Unicode General Categories
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
and their support in the Regex crate for matching lowercase and
uppercase letters.

Put the logic into a separate module and expose API similar to
UnicodeSegmentation's to keep the call-side in latin.rs clean and
concise.

Closes meilisearch#129.
goodhoko added a commit to goodhoko/charabia that referenced this issue Feb 1, 2023
To improve recall and to be consistent with snake_case and kebab-case
splitting that's already in place, make the Latin Segmenter split words
on camelCase boundaries.

Define camelCase boundary as a lowercase letter directly followed by an
uppercase one. (Or the position between them, to be precise.) This
treats most cases and avoids common pitfalls like eg. ALL_CAPS. What is
not handled though, are abbreviations within a longer word. Especially
in code, it's common to write eg. "meiliAPIClient". With this
implementation it's split into ["meili", "APIClient"].

Leverage the Unicode General Categories
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
and their support in the Regex crate for matching lowercase and
uppercase letters.

Put the logic into a separate module and expose API similar to
UnicodeSegmentation's to keep the call-site in latin.rs clean and
concise.

Closes meilisearch#129.
goodhoko added a commit to goodhoko/charabia that referenced this issue Feb 7, 2023
To improve recall and to be consistent with snake_case and kebab-case
splitting that's already in place, make the Latin Segmenter split words
on camelCase boundaries.

Define camelCase boundary as a lowercase letter directly followed by an
uppercase one. (Or the position between them, to be precise.) This
treats most cases and avoids common pitfalls like eg. ALL_CAPS. What is
not handled though, are abbreviations within a longer word. Especially
in code, it's common to write eg. "meiliAPIClient". With this
implementation it's split into ["meili", "APIClient"].

Leverage the Unicode General Categories
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
and their support in the Regex crate for matching lowercase and
uppercase letters.

Put the logic into a separate module and expose API similar to
UnicodeSegmentation's to keep the call-site in latin.rs clean and
concise.

Closes meilisearch#129.
goodhoko added a commit to goodhoko/charabia that referenced this issue Feb 7, 2023
To improve recall and to be consistent with snake_case and kebab-case
splitting that's already in place, make the Latin Segmenter split words
on camelCase boundaries.

Define camelCase boundary as a lowercase letter directly followed by an
uppercase one. (Or the position between them, to be precise.) This
treats most cases and avoids common pitfalls like eg. ALL_CAPS.

Leverage the Unicode General Categories
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
and their support in the Regex crate for matching lowercase and
uppercase letters.

Put the logic into a separate module and expose API similar to
UnicodeSegmentation's to keep the call-site in latin.rs clean and
concise.

Caveats:
- Especially in code, it's common to write abbreviations in all caps.
For instance "meiliAPIClient". With this implementation it's split into
["meili", "APIClient"].
- The implementation is not grapheme-cluster aware. E.g. it won't split
on boundaries that are intermitted by combining characters like u{0306}.

Closes meilisearch#129.
goodhoko added a commit to goodhoko/charabia that referenced this issue Feb 15, 2023
To improve recall and to be consistent with snake_case and kebab-case
splitting that's already in place, make the Latin Segmenter split words
on camelCase boundaries.

Define camelCase boundary as a lowercase letter directly followed by an
uppercase one. (Or the position between them, to be precise.) This
treats most cases and avoids common pitfalls like eg. ALL_CAPS.

Leverage the Unicode General Categories
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
and their support in the Regex crate for matching lowercase and
uppercase letters.

Put the logic into a separate module and expose API similar to
UnicodeSegmentation's to keep the call-site in latin.rs clean and
concise.

Caveats:
- Especially in code, it's common to write abbreviations in all caps.
For instance "meiliAPIClient". With this implementation it's split into
["meili", "APIClient"].
- The implementation is not grapheme-cluster aware. E.g. it won't split
on boundaries that are intermitted by combining characters like u{0306}.

Closes meilisearch#129.
bors bot added a commit that referenced this issue Feb 23, 2023
181: Split camelCase in Latin segmenter r=ManyTheFish a=goodhoko

## Related issue
#129

## What does this PR do?

To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries.

Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (The position between them, to be precise.) This treats most cases and avoids the common pitfalls like eg. ALL_CAPS. What is not handled though, are abbreviations within a longer word. Especially in code it's common to write eg. "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. Let me know if that's a blocker.

Leverage the Unicode General Categories
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
and their support in the Regex crate for matching lowercase and uppercase letters.

Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-site in latin.rs clean and concise.

Fixes #129.

## Performance
I benchmarked this locally on my Apple-silicon MB Air with `cargo bench` and saw some significant regressions. The  `Latin/Fra` and `Latin/Eng` (in both `132` and `363` variants) regressed some 40% compared to a3eab30. (Here's the full [report](https://github.com/meilisearch/charabia/files/10559955/report.zip)) I'm not sure how big of a problem this is. There definitely are many optimization opportunities but I wanted to propose the simplest solution first and eventually iterate from there.

---

PS: This is my very first rust contribution. Sorry for missing any conventions or leaving in rough edges.



Co-authored-by: Jen Tak <goodhoko@gmail.com>
bors bot added a commit that referenced this issue Feb 23, 2023
181: Split camelCase in Latin segmenter r=ManyTheFish a=goodhoko

## Related issue
#129

## What does this PR do?

To improve recall and to be consistent with snake_case and kebab-case splitting that's already in place, make the Latin Segmenter split words on camelCase boundaries.

Define camelCase boundary as a lowercase letter directly followed by an uppercase one. (The position between them, to be precise.) This treats most cases and avoids the common pitfalls like eg. ALL_CAPS. What is not handled though, are abbreviations within a longer word. Especially in code it's common to write eg. "meiliAPIClient". With this implementation it's split into ["meili", "APIClient"]. Let me know if that's a blocker.

Leverage the Unicode General Categories
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
and their support in the Regex crate for matching lowercase and uppercase letters.

Put the logic into a separate module and expose API similar to UnicodeSegmentation's to keep the call-site in latin.rs clean and concise.

Fixes #129.

## Performance
I benchmarked this locally on my Apple-silicon MB Air with `cargo bench` and saw some significant regressions. The  `Latin/Fra` and `Latin/Eng` (in both `132` and `363` variants) regressed some 40% compared to a3eab30. (Here's the full [report](https://github.com/meilisearch/charabia/files/10559955/report.zip)) I'm not sure how big of a problem this is. There definitely are many optimization opportunities but I wanted to propose the simplest solution first and eventually iterate from there.

---

PS: This is my very first rust contribution. Sorry for missing any conventions or leaving in rough edges.



Co-authored-by: Jen Tak <goodhoko@gmail.com>
@bors bors bot closed this as completed in 8c63180 Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
5 participants