Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update LatinDefaults for lang 'la' #12538

Merged
merged 10 commits into from
Apr 20, 2023
Merged

Conversation

diyclassics
Copy link
Contributor

Description

Building on the work of issue #11349, this PR makes the following changes:

  • Adds noun chunking to syntax iterators (with test)
  • Expands list of numeral/ordinal words
  • Expands list of abbreviations in tokenizer exceptions
  • Adds example sentences

Types of change

Enhancement to lang defaults

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

NB: Noun chunking requires use of trained Latin pipeline with POS tagging and dependency parsing, like e.g. this pipeline which I am preparing to submit to the general spaCy trained pipeline offerings: https://huggingface.co/diyclassics/la_core_cltk_md)

@adrianeboyd adrianeboyd added lang / la Latin language data and models v3.6 Related to v3.6 labels Apr 19, 2023
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for these additions, it all looks really good and extensive!

I only had a few small comments. Let me know if you'd like to address them or if you'd like us to do so.

spacy/lang/la/syntax_iterators.py Outdated Show resolved Hide resolved
spacy/lang/la/syntax_iterators.py Outdated Show resolved Hide resolved
spacy/lang/la/tokenizer_exceptions.py Show resolved Hide resolved
spacy/lang/la/lex_attrs.py Show resolved Hide resolved
Reorganize la syntax iterators

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
@svlandeg svlandeg merged commit ab4ba04 into explosion:master Apr 20, 2023
@diyclassics diyclassics deleted the la-refactor branch April 21, 2023 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / la Latin language data and models v3.6 Related to v3.6
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants