Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update keyterms module #257

Merged
merged 41 commits into from
Jul 14, 2019
Merged

Update keyterms module #257

merged 41 commits into from
Jul 14, 2019

Conversation

bdewilde
Copy link
Collaborator

@bdewilde bdewilde commented Jul 7, 2019

Description

  • Move keyterm extraction functionality from a top-level keyterms module into a ke sub-package, and refactor+standardize its contents
    • all methods have similar args/options, and share code for selecting candidates, normalizing terms to strings, filtering to just the top-N key terms, and building term graphs
  • Add new unsupervised keyterm extraction algorithms
    • YAKE: statistical method, implemented in ke.yake()
    • sCAKE: graph-based method, implemented in ke.scake()
    • PositionRank: graph-based method, implemented in ke.textrank() with parameter values given in the docstring
  • Add new functionality for selecting candidate keyterms (in addition to n-grams method)
    • longest matching subsequence candidates: implemented in ke.utils.get_longest_subsequence_candidates()
    • pattern-matching candidates: implemented in ke.utils.get_pattern_matching_candidates()
  • Significantly improve speed of SGRank and generally optimize all of these algorithms

Motivation and Context

Still hunting for the "perfect" unsupervised keyterm extraction algorithm, although all of these methods have pros/cons. A lit review of recent results pointed me towards YAKE and sCAKE.

How Has This Been Tested?

Added lots of tests, and they all pass.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation, and I have updated it accordingly.

bdewilde added 30 commits July 5, 2019 11:08
now easier to read _and_ slightly faster
users don't need to worry about this, just make it sensible
also fix bug in calculation: sum rather than average constituent word scores, resulting in a preference for longer key terms, which are usually more interpretable
we'll figure out what to do with it later...
@bdewilde bdewilde marked this pull request as ready for review July 14, 2019 19:57
@bdewilde bdewilde merged commit 3888753 into develop Jul 14, 2019
@bdewilde bdewilde deleted the update-keyterms-module branch July 14, 2019 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant