Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EstNLTK analyzer #818

Merged
merged 6 commits into from
Dec 20, 2024
Merged

EstNLTK analyzer #818

merged 6 commits into from
Dec 20, 2024

Conversation

osma
Copy link
Member

@osma osma commented Nov 12, 2024

This PR adds a new analyzer to support lemmatization using EstNLTK, a natural language analysis toolkit for the Estonian language.

Note that the indirect dependencies of EstNLTK are quite large, with around ~500MB of libraries.

@osma osma self-assigned this Nov 12, 2024
annif/analyzer/estnltk.py Fixed Show fixed Hide fixed
annif/analyzer/estnltk.py Fixed Show fixed Hide fixed
Copy link

codecov bot commented Nov 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.63%. Comparing base (d907024) to head (407a318).
Report is 10 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #818   +/-   ##
=======================================
  Coverage   99.63%   99.63%           
=======================================
  Files          93       95    +2     
  Lines        7141     7170   +29     
=======================================
+ Hits         7115     7144   +29     
  Misses         26       26           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@osma
Copy link
Member Author

osma commented Nov 21, 2024

The test coverage is not 100%. I think that's due to how optional backends are handled in the initialization code, that I copied from the spaCy analyzer. The same problem is already there for spaCy, so I think it would make sense to fix that first in main, then apply the same solution in this PR.

@osma osma force-pushed the feature-estnltk-analyzer branch from d2a0051 to 66f577d Compare November 22, 2024 13:23
@osma osma marked this pull request as ready for review November 22, 2024 13:28
@osma osma changed the title [WIP] EstNLTK analyzer EstNLTK analyzer Nov 22, 2024
@osma osma requested a review from juhoinkinen November 22, 2024 13:28
@osma
Copy link
Member Author

osma commented Nov 22, 2024

The initialization problem was fixed for spaCy in PR #820, already merged to main.
I rebased this PR branch and adapted it accordingly. I think all is now well.

This still needs wiki documentation, maybe also a mention in the Annif tutorial.

@osma
Copy link
Member Author

osma commented Nov 22, 2024

I added a brief section about this analyzer on the wiki page for Analyzers.

I don't see any mention of specific analyzers in the Annif tutorial, so I don't think this needs to be mentioned there.

Copy link
Member

@juhoinkinen juhoinkinen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@juhoinkinen
Copy link
Member

I added an estnltk section to the Optional features page of wiki too (and updated the whole page to use poetry instead of pip for installing dependencies when using dev installation).

@osma
Copy link
Member Author

osma commented Nov 25, 2024

I realized that EstNLTK is GPL licensed, so probably should be mentioned alongside YAKE when we talk about licensing. Need to fix that before merging this PR.

@osma
Copy link
Member Author

osma commented Dec 20, 2024

I asked the EstNLTK developers about their thoughts on license compatibility. They were very positive about the Annif integration. Now that the licensing situation is at least somewhat clarified in the README, I don't think there are obstacles for merging this, so I will do it now.

@osma osma merged commit 8f13d7d into main Dec 20, 2024
17 checks passed
@osma osma deleted the feature-estnltk-analyzer branch December 20, 2024 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants