-
-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(stemmer): adds english stemmer #109
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -0,0 +1,189 @@ | |||
const step2List = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non blocking comment on this file.
If you can, it would be great to extract all constant RegExp expression at the module level and declare them as RegExp when possible so we might get a small to tiny performance improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor things.
stemmingFn?: (word: string) => string; | ||
}; | ||
|
||
export type TokenizerConfigExec = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can streamline this as export type TokenizerConfigExec = Required<TokenizerConfig>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, one problem is that stemmingFn
can be undefined (to fallback to the default stemming function). Required
is not compatible with undefined
values
This PR aims to add an English stemmer to Lyra.
The API is designed in such a way that if a stemmer is not available for a selected language, the stemming process will be skipped.
The stemming algorithm is based on the following paper: https://tartarus.org/martin/PorterStemmer/def.txt.
A clear example of how stemming works can be found in this file: tap-snapshots/tests/tokenizer.test.ts.test.cjs
Benefits:
Lucky
,Luck
,Luckily
shares the same meaning and will give the same resultsQuestion, should we add an option to disable stemming?
Update
This PR also adds a list of stop-words in English, which gets removed as they carry very little meaning to the search context. I will add the possibility to merge, replace, and delete stop-words with a simple API