-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EstNLTK analyzer #818
EstNLTK analyzer #818
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #818 +/- ##
=======================================
Coverage 99.63% 99.63%
=======================================
Files 93 95 +2
Lines 7141 7170 +29
=======================================
+ Hits 7115 7144 +29
Misses 26 26 ☔ View full report in Codecov by Sentry. |
The test coverage is not 100%. I think that's due to how optional backends are handled in the initialization code, that I copied from the spaCy analyzer. The same problem is already there for spaCy, so I think it would make sense to fix that first in main, then apply the same solution in this PR. |
d2a0051
to
66f577d
Compare
The initialization problem was fixed for spaCy in PR #820, already merged to main. This still needs wiki documentation, maybe also a mention in the Annif tutorial. |
I added a brief section about this analyzer on the wiki page for Analyzers. I don't see any mention of specific analyzers in the Annif tutorial, so I don't think this needs to be mentioned there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I added an estnltk section to the Optional features page of wiki too (and updated the whole page to use poetry instead of pip for installing dependencies when using dev installation). |
I realized that EstNLTK is GPL licensed, so probably should be mentioned alongside YAKE when we talk about licensing. Need to fix that before merging this PR. |
Quality Gate passedIssues Measures |
I asked the EstNLTK developers about their thoughts on license compatibility. They were very positive about the Annif integration. Now that the licensing situation is at least somewhat clarified in the README, I don't think there are obstacles for merging this, so I will do it now. |
This PR adds a new analyzer to support lemmatization using EstNLTK, a natural language analysis toolkit for the Estonian language.
Note that the indirect dependencies of EstNLTK are quite large, with around ~500MB of libraries.