Releases · explosion/spaCy

30 Mar 14:15

adrianeboyd

v3.1.6

e147a52

v3.1.6: Workaround for Click/Typer issues

🔴 Bug fixes

Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.

👥 Contributors

@adrianeboyd, @honnibal, @ines

Contributors

adrianeboyd, honnibal, and ines

Assets 2

29 Mar 18:34

adrianeboyd

v3.2.4

b50fe5e

v3.2.4: Workaround for Click/Typer issues

🔴 Bug fixes

Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.

👥 Contributors

@adrianeboyd, @honnibal, @ines

Contributors

adrianeboyd, honnibal, and ines

Assets 2

01 Mar 12:13

adrianeboyd

v3.2.3

99425de

v3.2.3: Fix Tok2Vec for empty batches

🔴 Bug fixes

Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @honnibal, @ines

Contributors

adrianeboyd, honnibal, and ines

Assets 2

01 Mar 12:13

adrianeboyd

v3.1.5

1355396

v3.1.5: Bug fixes for Tok2Vec, SpanCategorizer, and more

🔴 Bug fixes

Fix issue #9593: Use metaclass to subclass errors for easier pickling.
Fix issue #9654: Fix spancat for empty docs and zero suggestions.
Fix issue #9979: Fix type of Lexeme.rank.
Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @BramVanroy, @brucewlee, @danieldk, @honnibal, @ines, @ljvmiranda921, @polm, @svlandeg, @vgautam, @xxyzz

Contributors

danieldk, polm, and 9 other contributors

Assets 2

01 Mar 12:12

adrianeboyd

v3.0.8

f55b876

v3.0.8: Fix Tok2Vec for empty batches

🔴 Bug fixes

Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @danieldk, @honnibal, @ines

Contributors

danieldk, adrianeboyd, and 2 other contributors

Assets 2

11 Feb 13:12

adrianeboyd

v3.2.2

bbaf41f

v3.2.2: Improved NER and parser speeds, bug fixes and more

✨ New features and improvements

Improved parser and ner speeds on long documents (see technical details in #10019).
Support for spancat components in debug data.
Support for ENT_IOB as a Matcher token pattern key.
Extended and improved types for many classes.

🔴 Bug fixes

Fix issue #9735: Make floret murmurhash endian-neutral.
Fix issue #9738: Support string IOB values for ENT_IOB.
Fix issue #9746: Updates to avoid "dictionary size changed during iteration" runtime errors.
Fix issue #9960: Warn about entities that cross sentence boundaries in debug data.
Fix issue #9979: Fix type for Lexeme.rank.
Fix issue #10026: Check for 0-size assets in spacy project.
Fix issue #10051: Consistently return scalars from similarity methods.
Fix issue #10052: Fix spaces in Doc.from_docs() for empty docs.
Fix issue #10079: Fix label detection in debug data for components with custom names.
Fix issue #10109: Add types to Underscore and DependencyMatcher and improve types in Language, Matcher and PhraseMatcher.
Fix issue #10130: Fix Tokenizer.explain when infixes appear as prefixes.
Fix issue #10143: Use simple suggester in spancat initialization.
Fix issue #10164: Support IS_SENT_END in Doc.has_annotation.
Fix issue #10192: Detect invalid package names in spacy package.
Fix issue #10223: Support mixed case in package names.
Fix issue #10234: Fix type in PhraseMatcher.

📖 Documentation and examples

Various documentation updates.
New spaCy version tags in spaCy universe.
New Dockerfile for repeatable website builds and easier local development.
New additions to spaCy universe:
- Augmenty: a text augmentation library
- Healthsea: an end-to-end spaCy pipeline for exploring health supplement effects
- spacy-wrap: wrap fine-tuned transformers in spaCy pipelines
- spacypdfreader: easy PDF to text to spaCy text extraction
- textnets: text analysis with networks

👥 Contributors

@adrianeboyd, @antonpibm, @ColleterVi, @danieldk, @DuyguA, @ezorita, @HaakonME, @honnibal, @ines, @jboynyc, @KennethEnevoldsen, @ljvmiranda921, @mrshu, @pmbaumgartner, @polm, @ramonziai, @richardpaulhudson, @ryndaniels, @svlandeg, @thiippal, @thomashacker, @yoavxyoav

Contributors

danieldk, polm, and 20 other contributors

Assets 2

07 Dec 16:30

adrianeboyd

v3.2.1

800737b

v3.2.1: doc_cleaner component, new Matcher attributes, bug fixes and more

✨ New features and improvements

NEW: doc_cleaner component for removing doc.tensor,doc._._trf_data or other Doc attributes at the end of the pipeline to reduce size of output docs.
NEW: ENT_ID and ENT_KB_ID to Matcher pattern attributes.
Support kb_id for entities in displaCy from Doc input.
Add Span.sents property for spans spanning over more than one sentence.
Add EntityRuler.remove to remove patterns by id.
Make the Tagger neg_prefix configurable.
Use Language.pipe in Language.evaluate for more efficient processing.
Test suite updates: move regression tests into core test modules with pytest markers for issue numbers, extend tests for languages with alpha support.

🔴 Bug fixes

Fix issue #9638: Make JsonlCorpus path optional again.
Fix issue #9654: Fix spancat for empty docs and zero suggestions.
Fix issue #9658: Improve error message for incorrect .jsonl paths in EntityRuler.
Fix issue #9674: Fix language-specific factory handling in package CLI.
Fix issue #9694: Convert labels to strings for README in package CLI.
Fix issue #9697: Exclude strings from source vector checks.
Fix issue #9701: Allow Scorer.score_spans to handle predicted docs with missing annotation.
Fix issue #9722: Initialize parser from reference parse rather than aligned example.
Fix issue #9764: Set annotations more efficiently in tagger and morphologizer.

📖 Documentation and examples

Various documentation updates: init_tok2vec after pretraining, batch contract for listeners.
New additions to the spaCy universe:
- eng-spacysentiment: Sentiment analysis for English.
- Applied Language Technology course: NLP for newcomers using spaCy and Stanza.

👥 Contributors

@adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar

Contributors

danieldk, polm, and 12 other contributors

Assets 2

05 Nov 15:54

adrianeboyd

v3.2.0

0fc3dee

v3.2.0: Registered scoring functions, Doc input, floret vectors and more

✨ New features and improvements

NEW: Registered scoring functions for each component in the config.
NEW: nlp() and nlp.pipe() accept Doc input, which simplifies setting custom tokenization or extensions before processing.
NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
overwrite config settings for entity_linker, morphologizer, tagger, sentencizer and senter.
extend config setting for morphologizer for whether existing feature types are preserved.
Support for a wider range of language codes in spacy.blank() including IETF language tags, for example fra for French and zh-Hans for Chinese.
New package spacy-loggers for additional loggers.
New Irish lemmatizer.
New Portuguese noun chunks and updated Spanish noun chunks.
Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
Japanese reading and inflection from sudachipy are annotated as Token.morph features.
Additional morph_micro_p/r/f scores for morphological features from Scorer.score_morph_per_feat().
LIKE_URL attribute includes the tokenizer URL pattern.
--n-save-epoch option for spacy pretrain.
Trained pipelines:
- New transformer pipeline for Japanese ja_core_news_trf, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community!
- Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!
- Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
- Universal Dependencies corpora updated to v2.8.
- Trailing space added as a tok2vec feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation.
- English attribute ruler patterns updated to improve Token.pos and Token.morph.

For more details, see the New in v3.2 usage guide.

🔴 Bug fixes

Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
Fix issue #9032: Retain alignment between doc and context for Language.pipe(as_tuples=True) for multiprocessing with custom error handlers.
Fix issue #9136: Ignore prefixes when applying suffix patterns in Tokenizer.
Fix issue #9584: Use metaclass to subclass errors to allow better pickling.

⚠️ Backwards incompatibilities

In the Tokenizer, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of °[cfk]. is now ° c . instead of ° c. for most languages.
The tokenizer classes ChineseTokenizer, JapaneseTokenizer, KoreanTokenizer, ThaiTokenizer and VietnameseTokenizer require Vocab rather than Language in __init__.
In DocBin, user data is now always serialized according to the store_user_data option, see #9190.

📖 Documentation and examples

Demo projects for floret vectors:
- pipelines/floret_vectors_demo: basic floret vector training and importing.
- pipelines/floret_fi_core_demo: Finnish UD+NER vector and pipeline training, comparing standard vs. floret vectors.
- pipelines/floret_ko_ud_demo: Korean UD vector and pipeline training, comparing standard vs. floret vectors.

👥 Contributors

@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker

Contributors

jimregan, rspeer, and 17 other contributors

Assets 2

29 Oct 14:14

svlandeg

v3.1.4

006df1a

v3.1.4: Python 3.10 wheels and support for AppleOps

✨ New features and improvements

NEW: Binary wheels for Python 3.10.
NEW: Improve performance on Apple M1 with AppleOps: pip install spacy[apple].
GPU profiling with spacy.models_with_nvtx_range.v1.
Full mypy integration in the CI and many type fixes across the code base.
Added custom Protocol classes in ty.py to define behavior of pipeline components.
Support for entity linking visualization in displacy.
Allow overriding vars in spacy project assets .
Standalone train function to run the training from Python scripts just like the spacy train CLI.
Support for spacy-transformers>=1.1.0 with improved IO.
Support for thinc>=8.0.11 with improved gradient clipping.

🔴 Bug fixes

Fix issue #5507: Improve UX for multiprocessing on GPU.
Fix issue #9137: Fix serialization for KnowledgeBase.set_entities.
Fix issue #9244: Fix vectors for 0-length spans.
Fix issue #9247: Improve UX for the DocBin constructor.
Fix Issue #9254: Allow unicode in a spacy project title.
Fix issue #9263: Make added patterns consistent in the DependencyMatcher.
Fix issue #9305: Restore tokenization timing during evaluation.
Fix issue #9335: Sync vocab in vectors and sourced components.
Fix issue #9387: Ensure lemmas are consistent for Catalan, Dutch, French, Russian and Ukrainian.
Fix issue #9404: Create consistent default textcat and textcat_multilabel configurations.
Fix issue #9437: Improve UX around Doc object creation.
Fix issue #9465: Fix minor issues with convert CLI.
Fix issue #9500: Include .pyi files in the distributed package.

📖 Documentation and examples

Various updates to the documentation.
New additions to the spaCy universe:
- deplacy: CUI-based dependency visualizer
- ipymarkup: Visualizations for NER and syntax trees
- PhruzzMatcher: Find fuzzy matches
- spacy-huggingface-hub: Push spaCy pipelines to the Hugging Face Hub
- spaCyOpenTapioca: Entity Linking on Wikidata
- spacy-clausie: Clause-based information extraction system
- "Applied Natural Language Processing in the Enterprise": Book by Ankur A. Patel
- "Introduction to spaCy 3": Free course by Dr. W.J.B. Mattingly

👥 Contributors

@adrianeboyd, @connorbrinton, @danieldk, @DuyguA, @honnibal, @ines, @Jette16, @ljvmiranda921, @mjvallone, @philipvollet, @polm, @rspeer, @ryndaniels, @shigapov, @svlandeg, @thomashacker

Contributors

danieldk, rspeer, and 14 other contributors

Assets 2

20 Sep 12:06

svlandeg

v3.1.3

8bda39f

v3.1.3: Bug fixes and UX updates

✨ New features and improvements

The v3 of WandbLogger now supports optional run_name and entity parameters.
Improved UX when providing invalid pos values for a Doc or Token.

🔴 Bug fixes

Fix issue #9001: Pass alignments to Matcher callbacks.
Fix issue #9009: Include component factories in third-party dependencies resolver.
Fix issue #9012: Correct type of config in create_pipe.
Fix issue #9014: Allow typer 0.4 to provide support for both Click 7 and Click 8.
Fix issue #9033: Fix verbs list for French tokenizer exceptions.
Fix issue #9059: Pass overrides to subcommands in spacy project workflows.
Fix issue #9074: Improve UX around repo and path arguments in spacy project.
Fix issue #9084: Fix inference of epoch_resume in spacy pretrain.
Fix issue #9163: Handle spacy-legacy in spacy package dependency detection.
Fix issue #9211: Include only runtime-relevant dependencies in spacy package.

📖 Documentation and examples

Various updates to the documentation.
Few additions and updates to the spaCy universe.
Extended the developer documentation with information about the listener pattern, the StringStore and the Vocab.

👥 Contributors

@adrianeboyd, @davidefiocco, @davidstrouk, @filipematos95, @honnibal, @ines, @j-frei, @Joozty, @kwhumphreys, @mjhajharia, @mylibrar, @polm, @rspeer, @shigapov, @svlandeg, @thomashacker

Contributors

rspeer, polm, and 14 other contributors

Assets 2

Releases: explosion/spaCy

v3.1.6: Workaround for Click/Typer issues

🔴 Bug fixes

👥 Contributors

Contributors

v3.2.4: Workaround for Click/Typer issues

🔴 Bug fixes

👥 Contributors

Contributors

v3.2.3: Fix Tok2Vec for empty batches

🔴 Bug fixes

👥 Contributors

Contributors

v3.1.5: Bug fixes for Tok2Vec, SpanCategorizer, and more

🔴 Bug fixes

👥 Contributors

Contributors

v3.0.8: Fix Tok2Vec for empty batches

🔴 Bug fixes

👥 Contributors

Contributors

v3.2.2: Improved NER and parser speeds, bug fixes and more

✨ New features and improvements

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

Contributors

v3.2.1: doc_cleaner component, new Matcher attributes, bug fixes and more

✨ New features and improvements

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

Contributors

v3.2.0: Registered scoring functions, Doc input, floret vectors and more

✨ New features and improvements

🔴 Bug fixes

⚠️ Backwards incompatibilities

📖 Documentation and examples

👥 Contributors

Contributors

v3.1.4: Python 3.10 wheels and support for AppleOps

✨ New features and improvements

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

Contributors

v3.1.3: Bug fixes and UX updates

✨ New features and improvements

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

Contributors