Releases: explosion/spaCy
v3.1.6: Workaround for Click/Typer issues
🔴 Bug fixes
- Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.
👥 Contributors
v3.2.4: Workaround for Click/Typer issues
🔴 Bug fixes
- Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.
👥 Contributors
v3.2.3: Fix Tok2Vec for empty batches
v3.1.5: Bug fixes for Tok2Vec, SpanCategorizer, and more
🔴 Bug fixes
- Fix issue #9593: Use metaclass to subclass errors for easier pickling.
- Fix issue #9654: Fix
spancat
for empty docs and zero suggestions. - Fix issue #9979: Fix type of
Lexeme.rank
. - Fix issue #10324: Fix
Tok2Vec
for empty batches.
👥 Contributors
@adrianeboyd, @BramVanroy, @brucewlee, @danieldk, @honnibal, @ines, @ljvmiranda921, @polm, @svlandeg, @vgautam, @xxyzz
v3.0.8: Fix Tok2Vec for empty batches
v3.2.2: Improved NER and parser speeds, bug fixes and more
✨ New features and improvements
- Improved
parser
andner
speeds on long documents (see technical details in #10019). - Support for
spancat
components indebug data
. - Support for
ENT_IOB
as aMatcher
token pattern key. - Extended and improved types for many classes.
🔴 Bug fixes
- Fix issue #9735: Make floret murmurhash endian-neutral.
- Fix issue #9738: Support string IOB values for
ENT_IOB
. - Fix issue #9746: Updates to avoid "dictionary size changed during iteration" runtime errors.
- Fix issue #9960: Warn about entities that cross sentence boundaries in
debug data
. - Fix issue #9979: Fix type for
Lexeme.rank
. - Fix issue #10026: Check for 0-size assets in
spacy project
. - Fix issue #10051: Consistently return scalars from similarity methods.
- Fix issue #10052: Fix spaces in
Doc.from_docs()
for empty docs. - Fix issue #10079: Fix label detection in
debug data
for components with custom names. - Fix issue #10109: Add types to
Underscore
andDependencyMatcher
and improve types inLanguage
,Matcher
andPhraseMatcher
. - Fix issue #10130: Fix
Tokenizer.explain
when infixes appear as prefixes. - Fix issue #10143: Use simple suggester in
spancat
initialization. - Fix issue #10164: Support
IS_SENT_END
inDoc.has_annotation
. - Fix issue #10192: Detect invalid package names in
spacy package
. - Fix issue #10223: Support mixed case in package names.
- Fix issue #10234: Fix type in
PhraseMatcher
.
📖 Documentation and examples
- Various documentation updates.
- New spaCy version tags in spaCy universe.
- New
Dockerfile
for repeatable website builds and easier local development. - New additions to spaCy universe:
- Augmenty: a text augmentation library
- Healthsea: an end-to-end spaCy pipeline for exploring health supplement effects
- spacy-wrap: wrap fine-tuned transformers in spaCy pipelines
- spacypdfreader: easy PDF to text to spaCy text extraction
- textnets: text analysis with networks
👥 Contributors
@adrianeboyd, @antonpibm, @ColleterVi, @danieldk, @DuyguA, @ezorita, @HaakonME, @honnibal, @ines, @jboynyc, @KennethEnevoldsen, @ljvmiranda921, @mrshu, @pmbaumgartner, @polm, @ramonziai, @richardpaulhudson, @ryndaniels, @svlandeg, @thiippal, @thomashacker, @yoavxyoav
v3.2.1: doc_cleaner component, new Matcher attributes, bug fixes and more
✨ New features and improvements
- NEW:
doc_cleaner
component for removingdoc.tensor
,doc._._trf_data
or otherDoc
attributes at the end of the pipeline to reduce size of output docs. - NEW:
ENT_ID
andENT_KB_ID
toMatcher
pattern attributes. - Support
kb_id
for entities in displaCy fromDoc
input. - Add
Span.sents
property for spans spanning over more than one sentence. - Add
EntityRuler.remove
to remove patterns byid
. - Make the
Tagger
neg_prefix
configurable. - Use
Language.pipe
inLanguage.evaluate
for more efficient processing. - Test suite updates: move regression tests into core test modules with pytest markers for issue numbers, extend tests for languages with alpha support.
🔴 Bug fixes
- Fix issue #9638: Make
JsonlCorpus
path optional again. - Fix issue #9654: Fix
spancat
for empty docs and zero suggestions. - Fix issue #9658: Improve error message for incorrect
.jsonl
paths inEntityRuler
. - Fix issue #9674: Fix language-specific factory handling in package CLI.
- Fix issue #9694: Convert labels to strings for README in package CLI.
- Fix issue #9697: Exclude strings from source vector checks.
- Fix issue #9701: Allow
Scorer.score_spans
to handle predicted docs with missing annotation. - Fix issue #9722: Initialize
parser
from reference parse rather than aligned example. - Fix issue #9764: Set annotations more efficiently in
tagger
andmorphologizer
.
📖 Documentation and examples
- Various documentation updates:
init_tok2vec
after pretraining, batch contract for listeners. - New additions to the spaCy universe:
eng-spacysentiment
: Sentiment analysis for English.- Applied Language Technology course: NLP for newcomers using spaCy and Stanza.
👥 Contributors
@adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar
v3.2.0: Registered scoring functions, Doc input, floret vectors and more
✨ New features and improvements
- NEW: Registered scoring functions for each component in the config.
- NEW:
nlp()
andnlp.pipe()
acceptDoc
input, which simplifies setting custom tokenization or extensions before processing. - NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
overwrite
config settings forentity_linker
,morphologizer
,tagger
,sentencizer
andsenter
.extend
config setting formorphologizer
for whether existing feature types are preserved.- Support for a wider range of language codes in
spacy.blank()
including IETF language tags, for examplefra
forFrench
andzh-Hans
forChinese
. - New package
spacy-loggers
for additional loggers. - New Irish lemmatizer.
- New Portuguese noun chunks and updated Spanish noun chunks.
- Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
- Japanese reading and inflection from
sudachipy
are annotated asToken.morph
features. - Additional
morph_micro_p/r/f
scores for morphological features fromScorer.score_morph_per_feat()
. LIKE_URL
attribute includes the tokenizer URL pattern.--n-save-epoch
option forspacy pretrain
.- Trained pipelines:
- New transformer pipeline for Japanese
ja_core_news_trf
, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community! - Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!
- Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
- Universal Dependencies corpora updated to v2.8.
- Trailing space added as a
tok2vec
feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation. - English attribute ruler patterns updated to improve
Token.pos
andToken.morph
.
- New transformer pipeline for Japanese
For more details, see the New in v3.2 usage guide.
🔴 Bug fixes
- Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
- Fix issue #9032: Retain alignment between doc and context for
Language.pipe(as_tuples=True)
for multiprocessing with custom error handlers. - Fix issue #9136: Ignore prefixes when applying suffix patterns in
Tokenizer
. - Fix issue #9584: Use metaclass to subclass errors to allow better pickling.
⚠️ Backwards incompatibilities
- In the
Tokenizer
, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of°[cfk].
is now° c .
instead of° c.
for most languages. - The tokenizer classes
ChineseTokenizer
,JapaneseTokenizer
,KoreanTokenizer
,ThaiTokenizer
andVietnameseTokenizer
requireVocab
rather thanLanguage
in__init__
. - In
DocBin
, user data is now always serialized according to thestore_user_data
option, see #9190.
📖 Documentation and examples
- Demo projects for floret vectors:
pipelines/floret_vectors_demo
: basic floret vector training and importing.pipelines/floret_fi_core_demo
: Finnish UD+NER vector and pipeline training, comparing standard vs. floret vectors.pipelines/floret_ko_ud_demo
: Korean UD vector and pipeline training, comparing standard vs. floret vectors.
👥 Contributors
@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker
v3.1.4: Python 3.10 wheels and support for AppleOps
✨ New features and improvements
- NEW: Binary wheels for Python 3.10.
- NEW: Improve performance on Apple M1 with
AppleOps
:pip install spacy[apple]
. - GPU profiling with
spacy.models_with_nvtx_range.v1
. - Full
mypy
integration in the CI and many type fixes across the code base. - Added custom
Protocol
classes inty.py
to define behavior of pipeline components. - Support for entity linking visualization in
displacy
. - Allow overriding vars in
spacy project assets
. - Standalone
train
function to run the training from Python scripts just like thespacy train
CLI. - Support for
spacy-transformers>=1.1.0
with improved IO. - Support for
thinc>=8.0.11
with improved gradient clipping.
🔴 Bug fixes
- Fix issue #5507: Improve UX for multiprocessing on GPU.
- Fix issue #9137: Fix serialization for
KnowledgeBase.set_entities
. - Fix issue #9244: Fix vectors for 0-length spans.
- Fix issue #9247: Improve UX for the
DocBin
constructor. - Fix Issue #9254: Allow unicode in a
spacy project
title. - Fix issue #9263: Make added patterns consistent in the
DependencyMatcher
. - Fix issue #9305: Restore tokenization timing during evaluation.
- Fix issue #9335: Sync vocab in vectors and sourced components.
- Fix issue #9387: Ensure lemmas are consistent for Catalan, Dutch, French, Russian and Ukrainian.
- Fix issue #9404: Create consistent default
textcat
andtextcat_multilabel
configurations. - Fix issue #9437: Improve UX around
Doc
object creation. - Fix issue #9465: Fix minor issues with
convert
CLI. - Fix issue #9500: Include
.pyi
files in the distributed package.
📖 Documentation and examples
- Various updates to the documentation.
- New additions to the spaCy universe:
deplacy
: CUI-based dependency visualizeripymarkup
: Visualizations for NER and syntax treesPhruzzMatcher
: Find fuzzy matchesspacy-huggingface-hub
: Push spaCy pipelines to the Hugging Face HubspaCyOpenTapioca
: Entity Linking on Wikidataspacy-clausie
: Clause-based information extraction system- "Applied Natural Language Processing in the Enterprise": Book by Ankur A. Patel
- "Introduction to spaCy 3": Free course by Dr. W.J.B. Mattingly
👥 Contributors
@adrianeboyd, @connorbrinton, @danieldk, @DuyguA, @honnibal, @ines, @Jette16, @ljvmiranda921, @mjvallone, @philipvollet, @polm, @rspeer, @ryndaniels, @shigapov, @svlandeg, @thomashacker
v3.1.3: Bug fixes and UX updates
✨ New features and improvements
- The
v3
ofWandbLogger
now supports optionalrun_name
andentity
parameters. - Improved UX when providing invalid
pos
values for aDoc
orToken
.
🔴 Bug fixes
- Fix issue #9001: Pass alignments to
Matcher
callbacks. - Fix issue #9009: Include component factories in third-party dependencies resolver.
- Fix issue #9012: Correct type of
config
increate_pipe
. - Fix issue #9014: Allow
typer
0.4 to provide support for both Click 7 and Click 8. - Fix issue #9033: Fix verbs list for French tokenizer exceptions.
- Fix issue #9059: Pass overrides to subcommands in
spacy project
workflows. - Fix issue #9074: Improve UX around
repo
andpath
arguments inspacy project
. - Fix issue #9084: Fix inference of
epoch_resume
inspacy pretrain
. - Fix issue #9163: Handle
spacy-legacy
inspacy package
dependency detection. - Fix issue #9211: Include only runtime-relevant dependencies in
spacy package
.
📖 Documentation and examples
- Various updates to the documentation.
- Few additions and updates to the spaCy universe.
- Extended the developer documentation with information about the listener pattern, the
StringStore
and theVocab
.
👥 Contributors
@adrianeboyd, @davidefiocco, @davidstrouk, @filipematos95, @honnibal, @ines, @j-frei, @Joozty, @kwhumphreys, @mjhajharia, @mylibrar, @polm, @rspeer, @shigapov, @svlandeg, @thomashacker