March 29, 2025

ikayaniaamirshahzad@gmail.com

Introducing spaCy v3.0 · Explosion


spaCy v3.0 is a huge release! It features new
transformer-based pipelines that get spaCy’s accuracy right up to the current
state-of-the-art, and a new workflow system to help you take projects from
prototype to production. It’s much easier to configure and train your pipeline,
and there are lots of new and improved integrations with the rest of the NLP
ecosystem.

We’ve been working on spaCy v3.0 for over a year now, and
almost two years if you count all the work that’s gone into
Thinc. Our main aim with the release is to make it easier to
bring your own models into spaCy, especially state-of-the-art models like
transformers. You can write models powering spaCy components in frameworks like
PyTorch or TensorFlow, using our awesome new configuration system to describe
all of your settings. And since modern NLP workflows often consist of multiple
steps, there’s a new workflow system to help you keep your work organized.

For detailed installation instructions for your platform and setup, check out
the installation quickstart widget.

pip install -U spacy

spaCy v3.0 features all new transformer-based pipelines that bring spaCy’s
accuracy right up to the current state-of-the-art. You can use any
pretrained transformer to train your own pipelines, and even share one
transformer between multiple components with multi-task learning. spaCy’s
transformer support interoperates with PyTorch and the
HuggingFace transformers library,
giving you access to thousands of pretrained models for your pipelines. See
below for an overview of the new pipelines.

Accuracy on the OntoNotes 5.0 corpus
(reported on the development set).

Named Entity Recognition System OntoNotes CoNLL ‘03
spaCy RoBERTa (2020) 89.7 91.6
Stanza (StanfordNLP)1 88.8 92.1
Flair2 89.7 93.1

Named entity recognition accuracy on the
OntoNotes 5.0 and
CoNLL-2003 corpora. See
NLP-progress for
more results. Project template:
benchmarks/ner_conll03.
1. Qi et al. (2020). 2.
Akbik et al. (2018).

spaCy lets you share a single transformer or other token-to-vector (“tok2vec”)
embedding layer between multiple components. You can even update the shared
layer, performing multi-task learning. Reusing the embedding layer between
components can make your pipeline run a lot faster and result in much smaller
models.

You can share a single transformer or other token-to-vector model between
multiple components by adding a Transformer or Tok2Vec component near the
start of your pipeline. Components later in the pipeline can “connect” to it by
including a listener layer within their model.

Read more
Benchmarks
Download trained pipelines

spaCy v3.0 provides retrained model families for 18
languages and 59 trained pipelines in total, including 5 new
transformer-based pipelines. You can also train your own transformer-based
pipelines using your own data and transformer weights of your choice.

The models are each trained with a single transformer shared across the
pipeline, which requires it to be trained on a single corpus. For
English and Chinese,
we used the OntoNotes 5 corpus, which has annotations across several tasks. For
French, Spanish and
German, we didn’t have a suitable corpus that had
both syntactic and entity annotations, so the transformer models for those
languages do not include NER.

Download pipelines

spaCy v3.0 introduces a comprehensive and extensible
system for configuring your training
runs
. A single configuration file describes every detail of your training run,
with no hidden defaults, making it easy to rerun your experiments and track
changes.

You can use the quickstart widget
or the init config command to get
started. Instead of providing lots of arguments on the command line, you only
need to pass your config.cfg file to
spacy train.

Training config files include all settings and hyperparameters for training
your pipeline. Some settings can also be registered functions that you can
swap out and customize, making it easy to implement your own custom models and
architectures.

config.cfg

[training]

accumulate_gradient = 3

[training.optimizer]

@optimizers = "Adam.v1"

[training.optimizer.learn_rate]

@schedules = "warmup_linear.v1"

warmup_steps = 250

total_steps = 20000

initial_rate = 0.01

Some of the main advantages and features of spaCy’s training config are:

  • Structured sections. The config is grouped into sections, and nested
    sections are defined using the . notation. For example, [components.ner]
    defines the settings for the pipeline’s named entity recognizer. The config
    can be loaded as a Python dict.
  • References to registered functions. Sections can refer to registered
    functions like model architectures,
    optimizers or
    schedules and define arguments that are
    passed into them. You can also
    register your own functions
    to define custom architectures or methods, reference them in your config and
    tweak their parameters.
  • Interpolation. If you have hyperparameters or other settings used by
    multiple components, define them once and reference them as
    variables.
  • Reproducibility with no hidden defaults. The config file is the “single
    source of truth” and includes all settings.
  • Automated checks and validation. When you load a config, spaCy checks if
    the settings are complete and if all values have the correct types. This lets
    you catch potential mistakes early. In your custom architectures, you can use
    Python type hints to tell the
    config which types of data to expect.

Read more

spaCy’s new configuration system makes
it easy to customize the neural network models used by the different pipeline
components. You can also implement your own architectures via spaCy’s machine
learning library Thinc that provides various layers and
utilities, as well as thin wrappers around frameworks like PyTorch,
TensorFlow and MXNet. Component models all follow the same unified
Model API and each Model can also be used
as a sublayer of a larger network, allowing you to freely combine
implementations from different frameworks into a single model.

PyTorch, TensorFlow, MXNet, Thinc

Wrapping a PyTorch model

from torch import nn

from thinc.api import PyTorchWrapper

torch_model = nn.Sequential(

nn.Linear(32, 32),

nn.ReLU(),

nn.Softmax(dim=1)

)

model = PyTorchWrapper(torch_model)

Read more

spaCy projects let you manage and share
end-to-end spaCy workflows for different use cases and domains, and
orchestrate training, packaging and serving your custom pipelines. You can start
off by cloning a pre-defined project template, adjust it to fit your needs, load
in your data, train a pipeline, export it as a Python package, upload your
outputs to a remote storage and share your results with your team.

spaCy projects also make it easy to integrate with other tools in the data
science and machine learning ecosystem, including
DVC for data version control,
Prodigy for creating labelled data,
Streamlit for building interactive
apps, FastAPI for serving models in
production, Ray for parallel training,
Weights & Biases for experiment
tracking, and more!

Using spaCy projects

python -m spacy project clone pipelines/tagger_parser_ud

cd tagger_parser_ud

python -m spacy project assets

python -m spacy project run all

Selected example templates

To clone a template, you can run the spacy project clone command with its
relative path, e.g. python -m spacy project clone pipelines/ner_wikiner.

Read more
Project templates

Track your results with Weights & Biases

Weights & Biases is a popular platform for experiment
tracking. spaCy integrates with it out-of-the-box via the
WandbLogger, which you can add
as the [training.logger] block of your training
config.

The results of each step are then logged in your project, together with the full
training config. This means that every hyperparameter, registered function
name and argument will be tracked and you’ll be able to see the impact it has on
your results.

config.cfg

[training.logger]

@loggers = "spacy.WandbLogger.v1"

project_name = "monitor_spacy_training"

remove_config_values = ["paths.train", "paths.dev", "training.dev_corpus.path", "training.train_corpus.path"]

Ray is a fast and simple framework for building and running
distributed applications. You can use Ray to train spaCy on one or more
remote machines, potentially speeding up your training process.

The Ray integration is powered by a lightweight extension package,
spacy-ray, that automatically adds
the ray command to your spaCy CLI if it’s
installed in the same environment. You can then run
spacy ray train for parallel training.

Parallel training with Ray

pip install spacy-ray --pre

python -m spacy ray --help

python -m spacy ray train config.cfg --n-workers 2

Read
more

spacy-ray

spaCy v3.0 includes several new trainable and rule-based components that you can
add to your pipeline and customize for your use case:

Defining, configuring, reusing, training and analyzing
pipeline components
is now easier and more convenient. The
@Language.component and
@Language.factory decorators let you
register your component and define its default configuration and meta data, like
the attribute values it assigns and requires. Any custom component can be
included during training, and sourcing components from existing trained
pipelines lets you mix and match custom pipelines. The
nlp.analyze_pipes method
outputs structured information about the current pipeline and its components,
including the attributes they assign, the scores they compute during training
and whether any required attributes aren’t set.

import spacy

from spacy.language import Language

@Language.component("my_component")

def my_component(doc):

return doc

nlp = spacy.blank("en")

nlp.add_pipe("my_component")

other_nlp = spacy.load("en_core_web_sm")

nlp.add_pipe("ner", source=other_nlp)

nlp.analyze_pipes(pretty=True)

Read more

The new DependencyMatcher lets you
match patterns within the dependency parse using
Semgrex
operators. It follows the same API as the token-based
Matcher. A pattern added to the dependency
matcher consists of a list of dictionaries, with each dictionary describing
a token to match and its relation to an existing token in the pattern.

Illustration showing part of the match pattern

import spacy

from spacy.matcher import DependencyMatcher

nlp = spacy.load("en_core_web_sm")

matcher = DependencyMatcher(nlp.vocab)

pattern = [

{"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}},

{"LEFT_ID": "anchor_founded", "REL_OP": ">", "RIGHT_ID": "subject", "RIGHT_ATTRS": {"DEP": "nsubj"}},

{"LEFT_ID": "anchor_founded", "REL_OP": ">", "RIGHT_ID": "founded_object", "RIGHT_ATTRS": {"DEP": "dobj"}},

{"LEFT_ID": "founded_object", "REL_OP": ">", "RIGHT_ID": "founded_object_modifier", "RIGHT_ATTRS": {"DEP": {"IN": ["amod", "compound"]}}}

]

matcher.add("FOUNDED", [pattern])

doc = nlp("Lee, an experienced CEO, has founded two AI startups.")

matches = matcher(doc)

Read
more

spaCy v3.0 officially drops support for Python 2 and now requires Python
3.6+
. This also means that the code base can take full advantage of
type hints. spaCy’s user-facing
API that’s implemented in pure Python (as opposed to Cython) now comes with type
hints. The new version of spaCy’s machine learning library
Thinc also features extensive
type support, including custom
types for models and arrays, and a custom mypy plugin that can be used to
type-check model definitions.

For data validation, spaCy v3.0 adopts
pydantic. It also powers the data
validation of Thinc’s config system, which
lets you register custom functions with typed arguments, reference them in
your config and see validation errors if the argument values don’t match.

Argument validation with type hints

from spacy.language import Language

from pydantic import StrictBool

@Language.factory("my_component")

def create_component(nlp: Language, name: str, custom: StrictBool):

...

Read
more

Resources



Source link

Leave a Comment