Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stemmer): adds english stemmer #109

Merged
merged 4 commits into from
Aug 29, 2022
Merged

Conversation

micheleriva
Copy link
Member

@micheleriva micheleriva commented Aug 28, 2022

This PR aims to add an English stemmer to Lyra.
The API is designed in such a way that if a stemmer is not available for a selected language, the stemming process will be skipped.

The stemming algorithm is based on the following paper: https://tartarus.org/martin/PorterStemmer/def.txt.

A clear example of how stemming works can be found in this file: tap-snapshots/tests/tokenizer.test.ts.test.cjs

Benefits:

  1. Will compress the index size even more
  2. Lucky, Luck, Luckily shares the same meaning and will give the same results

Question, should we add an option to disable stemming?

Update

This PR also adds a list of stop-words in English, which gets removed as they carry very little meaning to the search context. I will add the possibility to merge, replace, and delete stop-words with a simple API

Copy link
Contributor

@ShogunPanda ShogunPanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -0,0 +1,189 @@
const step2List = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking comment on this file.

If you can, it would be great to extract all constant RegExp expression at the module level and declare them as RegExp when possible so we might get a small to tiny performance improvement.

Copy link
Contributor

@ShogunPanda ShogunPanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor things.

stemmingFn?: (word: string) => string;
};

export type TokenizerConfigExec = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can streamline this as export type TokenizerConfigExec = Required<TokenizerConfig>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, one problem is that stemmingFn can be undefined (to fallback to the default stemming function). Required is not compatible with undefined values

@micheleriva micheleriva merged commit 9f5995d into main Aug 29, 2022
@micheleriva micheleriva deleted the feat/add-english-stemmer branch August 29, 2022 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants