March 29, 2025

ikayaniaamirshahzad@gmail.com

Introducing spaCy v3.1 · Explosion


It’s been great to see the adoption of spaCy v3, which
introduced transformer-based pipelines, a new config and training system and
many other features. Version 3.1 adds more on top of it, including the ability
to use predicted annotations during training, a component for predicting
arbitrary and overlapping spans and new trained pipelines for Catalan and
Danish.

For a full overview of what’s new in spaCy v3.1 and notes on upgrading, check
out the release notes
and usage guide. Here are some of the most relevant
additions:

By default, components are updated in isolation during training, which means
that they don’t see the predictions of any earlier components in the pipeline.
The new
[training.annotating_components]
config setting lets you specify pipeline components that should set annotations
on the predicted docs during training. This makes it easy to use the predictions
of a previous component in the pipeline as features for a subsequent component,
e.g. the dependency labels in the tagger:

config.cfg (excerpt)

[nlp]

pipeline = ["parser", "tagger"]

[components.tagger.model.tok2vec.embed]

@architectures = "spacy.MultiHashEmbed.v1"

width = ${components.tagger.model.tok2vec.encode.width}

attrs = ["NORM","DEP"]

rows = [5000,2500]

include_static_vectors = false

[training]

annotating_components = ["parser"]

For an end-to-end example of how to use the token.dep attribute predicted by
the parser as a feature for a subsequent tagger component in the pipeline, check
out
this project template.

A common task in applied NLP is extracting spans
of texts from documents, including longer phrases or nested expressions. Named
entity recognition isn’t the right tool for this problem, since an entity
recognizer typically predicts single token-based tags that are very sensitive to
boundaries. This is effective for proper nouns and self-contained expressions,
but less useful for other types of phrases or overlapping spans. The new
experimental SpanCategorizer component
and architecture let you
label arbitrary and potentially overlapping spans of texts.

The upcoming version of our annotation tool Prodigy
(currently available as a pre-release for all
users) will also feature a
new workflow and UI for annotating
overlapping and nested spans, which you can use to create training data for
spaCy’s SpanCategorizer component.

The EntityRecognizer can now be
updated with known incorrect annotations, which lets you take advantage of
partial and sparse data. For example, you’ll be able to use the information that
certain spans of text are definitely not PERSON entities, without having
to provide the complete gold-standard annotations for the given example. The
incorrect span annotations can be added via the
Doc.spans in the training data under the key
defined as incorrect_spans_key
in the component config.

Annotate incorrect spans

train_doc = nlp.make_doc("Barack Obama was born in Hawaii.")

train_doc.spans["incorrect_spans"] = [

Span(doc, 0, 2, label="ORG"),

Span(doc, 5, 6, label="PRODUCT")

]

config.cfg (excerpt)

[components.ner]

factory = "ner"

incorrect_spans_key = "incorrect_spans"

moves = null

update_with_oracle_cut_size = 100

spaCy v3.1 adds 5 new pipeline packages, including a new core family for Catalan
and a new transformer-based pipeline for Danish using the
danish-bert-botxo weights.
See the models directory for an overview of all
available trained pipelines and the
training guide for details on how to train
your own.

Upload your pipelines to the Hugging Face Hub

The Hugging Face Hub lets you upload models and share
them with others, and it now supports spaCy pipelines out-of-the-box. The
extension package
automatically adds command to your spacy CLI, lets you upload pipelines
packaged with spacy package and takes care of
auto-generating all required meta information.

Upload a trained pipeline to the hub

pip install spacy-huggingface-hub

huggingface-cli login

python -m spacy package ./en_ner_fashion ./output --build wheel

cd ./output/en_ner_fashion-0.0.0/dist

python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl

After uploading, you’ll get a live URL for your model page, as well as a direct
URL to the wheel file that you can install via pip install. You can also
integrate the upload command into your
project template to
automatically upload your packaged pipelines after training.

View spaCy pipelines on the
Hub

New in the spaCy universe

The spaCy universe has seen a lot of awesome
additions since the last release! Here’s a selection of new plugins and
extensions you can to add more power to your spaCy projects:

🐭 skweak Toolkit for weak supervision applied to NLP tasks
👯 coreferee Coreference resolution for English, German and Polish
🐇 tokenwiser Connect vowpal-wabbit & scikit-learn models to spaCy
🏺 hmrb Python rule processing engine with readable syntax
🧮 numerizer Convert natural language numerics into ints and floats
🌕 spikex Pipeline components for knowledge extraction
📘 trunajod Text complexity library for text analysis
🧠 emfdscore Extended Moral Foundation Dictionary Scoring
📇 denomme Extension for extracting multilingual names
💎 ruby-spacy Wrapper to use spaCy in Ruby.

The following packages have been updated with support for
spaCy v3:

View the spaCy universe

We have a lot more planned for upcoming releases so stay tuned! Some of our
current work in progress includes a native component for
coreference resolution, new
ecosystem integrations and end-to-end project templates for using PyTorch models
to power spaCy components and training pipelines using the new span categorizer.

Resources



Source link

Leave a Comment