Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract Scorer #85

Open
sam-writer opened this issue Sep 25, 2020 · 1 comment
Open

extract Scorer #85

sam-writer opened this issue Sep 25, 2020 · 1 comment
Labels
enhancement New feature or request

Comments

@sam-writer
Copy link
Contributor

KenLMScorer is fantastic. Just so useful. However, it isn't core to replaCy and should be a custom pipeline component (that we expect most people to use... think like en_core_web_sm is for spaCy - a separate installation, but in all the docs) that is separately installable.

I think what using our current pipeline should look like, after extraction, is:

import en_core_web_sm
from replacy.components import MaxCountFilter
from replacy_kenlm_scorer import KenLMScorer
from spacy.utils import filter_spans


replaCy = ReplaceMatcher(en_core_web_sm.load(), etc...)
replaCy.add_pipe("span_filter", filter_spans, first=True)
replaCy.add_pipe("scorer", KenLMScorer(model_or_path), after="span_filter)
replaCy.add_pipe("max_count_filter", MaxCountFilter(defaults...), after="scorer")
@sam-writer sam-writer added the enhancement New feature or request label Sep 25, 2020
@sam-writer
Copy link
Contributor Author

sam-writer commented Sep 25, 2020

this component should have the biggest KenLM model we can fit in and still have PyPi allow it... but we could also have instructions that you can curl -O 'https://master.dl.sourceforge.net/project/openccg/data/gigaword4.5g.kenlm.bin' (or even wrap that in a

from replacy_kenlm_scorer import KenLMScorer

klm = KenLMScorer.download_gigaword()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant