March 28, 2025

ikayaniaamirshahzad@gmail.com

Our Year in Review · Explosion


It’s been another exciting year at Explosion! We’ve developed a new end-to-end
neural coref component for spaCy, improved the speed of our
CNN pipelines up to 60%, and published new pre-trained pipelines for Finnish,
Korean, Swedish, Croatian and Ukrainian. We’ve also released several updates to
Prodigy and introduced new recipes to kickstart annotation
with zero- or few-shot learning.

During 2022, we also launched two popular new services –
spaCy Tailored Pipelines and
spaCy Tailored Analysis. We’ve published several
technical blog posts and reports, and created a bunch of new videos covering
many tips and tricks to get the most out of our developer tools. We can’t wait
to show you what we’re building in 2023 for the next chapters of spaCy and
Prodigy, but for now, here’s our look back at 2022. Happy reading!

edit-tree lemmatizer diagram + coref video thumbnail

As part of our spaCy v3.3 release in
April, we’ve added a trainable lemmatizer to spaCy. It uses
edit trees to transform tokens
into lemmas and it’s included in the new Finnish, Korean and Swedish pipelines
introduced with v3.3, as well as in the new Croatian and Ukrainian pipelines
released for spaCy v3.4 in July. We’ve
also updated the pipelines for Danish, Dutch, German, Greek, Italian,
Lithuanian, Norwegian Bokmål, Polish, Portuguese and Romanian to switch from the
lookup or rule-based lemmatizers to the new trainable one.

Furthermore, we’ve implemented a new end-to-end neural
coreference resolution component in
spacy-experimental’s
v0.6.0 release.
The release includes an
experimental English coref pipeline
and a
sample project
that shows how to train a coref model for spaCy. You can read all about this new
coref component in our blog post by Ákos,
Paul and team that outlines why you want to do coreference resolution in the
first place and explains some of the crucial architecture choices of our
end-to-end neural system in detail. Finally, Edi recorded
a video on our coref component showing how to
train a coreference resolution model with spaCy projects and then applies the
trained pipeline to resolve references in a text. Use these resources to
jump-start your experiments with coref, and let us know how you go on the
discussion forum!

How Spancat works and span labeling in Prodigy

Finally, we’ve spent some time running various experiments and implementing
extensions to our new SpanCategorizer
component. The spancat is a spaCy component that answers the need to handle
arbitrary and overlapping spans, which can be used for long phrases,
non-named entities or overlapping annotations. We added some useful span
suggesters to
spacy-experimental v0.5.0
that identify candidate spans by inspecting annotations from the tagger and
parser, and then marking relevant subtrees, noun chunks, or sentences. Edi, Lj
and team have written a comprehensive
blog post covering full details of the
spancat implementation as well as an architecture case study on nested NER. In
his most recent video, Edi shows how to use
Prodigy for spaCy’s spancat component, annotating food recipes and sharing best
practices around annotation consistency and efficiency.

We’ve been focusing heavily on speed improvements across our open-source stack
for two years now, including spaCy and
Thinc. We fixed a lot of the low-hanging fruit in 2021,
improving transformer training performance by up to 62%. We’ve achieved further
improvements in 2022 by systematically profiling training and inference and
eliminating bottlenecks where we could. There were too many improvements to
summarize here, so we will highlight three changes:

  1. The Thinc Softmax layer is
    used by many models to compute a probability distribution over classes. This
    function is quite expensive due to its use of the exponentiation function.
    During inference, we usually do not care about the actual class probabilities
    but rather what the most probable class is. Since softmax is a monotonic
    function, we can find the most probable class from the raw inputs to the
    softmax function (the so-called logits). In spaCy v3.3, we started using
    logits during inference, which resulted in speedups of 27% when using a
    tagging + parsing pipeline.
  2. The transition-based parser extracts features for the transition model to
    predict the next transition. One function that is used in feature extraction
    looks up the n-th most recent left-arc of a head. In order to do so, it
    would first extract all arcs with that particular head from a table of all
    left-arcs. Since the number of left-arcs correlates with the document length,
    doing this for each transition unfortunately degraded the complexity of the
    parser to quadratic-time. In spaCy v3.3, we rewrote this function to perform
    the lookup in constant-time, restoring the parser’s overall complexity to
    linear time again. This resulted in
    large speedups
    on long documents.
  3. One of the operations involved in the training of a pipeline component is the
    calculation of the loss between the model’s predictions and the gold-standard
    labels, which requires computing the alignment between the two. Originally,
    the alignment function manually iterated through arrays using a for-loop
    and compared the entries individually. In spaCy v3.4, we vectorized those
    operations which increased GPU throughput and
    reduced training time by 20%.

Thanks to the aforementioned changes and a myriad other, smaller optimizations,
we’ve been able to squeeze out significant improvements in both inference and
training performance. In the tables below, we compare the inference and
training performance
of spaCy on January 1, 2022 and January 1, 2023 for a
German pipeline with the tagger, morphologizer, parser and attribute ruler
components. The results show improvements across the board, but are most visible
in pipelines that are not dominated by matrix multiplication.

Inference performance on Ryzen 5950X/GeForce RTX3090

Pipeline Device January 2022 (words/s) January 2023 (words/s) Delta
Convolution CPU 25,421 25,573 +0.6%
Convolution GPU 96,291 121,623 +26.3%
Transformer CPU 1,743 1,779 +2.0%
Transformer GPU 20,381 20,297 -0.4%

Training performance on Ryzen 5950X/GeForce RTX3090

Pipeline Device January 2022 (words/s) January 2023 (words/s) Delta
Convolution CPU 5,139 6,359 +23.7%
Convolution GPU 4,667 5,139 +10.0%
Transformer GPU 3,327 3,575 +7.5%

We also made two large optimizations that primarily benefit Apple Silicon Macs.
In 2021, we released
thinc-apple-ops.
With this add-on package, Thinc uses Apple’s Accelerate framework for matrix
multiplication. Accelerate uses special matrix multiplication units (AMX) on
Apple Silicon Macs, resulting in large speedups. However, spaCy’s dependency
parser did not use Thinc for matrix multiplication in low-level Cython code. The
first optimization was to define a C BLAS interface in Thinc and use this in
the dependency parser to leverage the AMX units. This leads to large
improvements in training and inference speeds as shown in the tables below.

The second optimization was to leverage the support for Metal Performance
Shaders
that was added to PyTorch to speed up transformer models. Madeesh and
Daniël have written a
blog post about fast
transformer inference using Metal Performance Shaders. The performance impact
can also be seen in the results below.

Inference performance on M1 Max

Pipeline Device January 2022 (words/s) January 2023 (words/s) Delta
Convolution CPU 35,818 57,376 +60.1%
Transformer CPU 1,883 1,887 0.0%
Transformer GPU See CPU 7,660 +406.9%

Training performance on M1 Max

Pipeline Device January 2022 (words/s) January 2023 (words/s) Delta
Convolution CPU 7,593 9975 +31.4%

All in all, it has been a great year for performance! Nevertheless, we have more
improvements in the works – particularly with respect to transformer models –
that we hope to show you in the following months.

We released Prodigy v1.11.7 and
Prodigy v1.11.8. These releases
include various bug fixes, usability improvements and extended support to the
latest spaCy versions, as well as many other small improvements.

Prodigy annotation with OpenAI

Further, we’ve been working on new
Prodigy workflows that use the OpenAI API
to kickstart your annotations, via zero- or few-shot learning. We published
the
first recipe,
for NER annotation, at the end of December. Keep an eye on this repo as more
exciting recipes will be published soon!

In 2022, we launched two brand new consulting services! February saw the launch
of spaCy Tailored Pipelines, where we offer
custom-made solutions by spaCy’s core developers for your NLP problems. By the
summer we had already engaged with several companies on a variety of interesting
use cases, including Patent Bots’ legal
information extraction pipeline. It now handles training, packaging and
deployment in a spaCy project structure that is easy to maintain and update in
the future.

spaCy tailored pipelines and spaCy tailored analysis

In November, we followed up with the launch of our second new service:
spaCy Tailored Analysis. People often ask us for
help with problem solving, strategy and analysis for their applied NLP projects,
so we designed this new service to help with exactly these types of problems.

In August, we released the config system used by spaCy and Thinc as its own
lightweight package: confection!
Confection, our battle-tested config system for Python, can now easily be
included in any Python project without having to install Thinc.

Holmes structure diagram and Confection code example

We’ve also added support for spaCy v3.4 for English, German and Polish in
the v1.3.0 release
of the Coreferee library. Holmes, an information extraction component
based on predicate logic, was also updated to support spaCy v3.4 in its
v4.1.0 release.

We’ve worked hard on creating more resources that explain spaCy’s
implementation and architecture choices in further detail. On top of the
content produced for coref and spancat, Adriane has written an
interesting
blog post explaining floret, which
combines fastText and Bloom embeddings to create compact vector tables with both
word and subword information and enables vectors that are up to 10× smaller than
traditional word vectors. Additionally, Lj, Ákos et al. published
a technical report that benchmarks spaCy’s
hashing trick on various NER datasets in different scenarios. Finally, this
LinkedIn thread
by Vincent explains you all you need to know about spaCy’s Vocab object and
its vectors.

Multihash embeddings in spaCy paper and spaCy cheat sheet

Only just getting started with spaCy or Prodigy? Our ever-popular “Advanced
NLP with spaCy” course has got you covered, and is now
available en français on top of our current
languages: English, German, Spanish, Portuguese, Japanese and Chinese. We’ve
also created a
spaCy cheat sheet,
packed with great features and practical tips so you can create projects at
lightning speed, and we revamped
Ines’ flowchart
containing our best practices for annotating and training Named Entity
Recognition models with Prodigy. The PDF version includes clickable links for
context and additional information.

Prodigy with PDFs and The Guardian blog image

Already a pro? You might be pretty interested in The
Guardian case study report that Ryan and
team wrote. In order to modularize content for reuse, The Guardian’s data
science team developed a spaCy-Prodigy NER workflow for quote extraction. We
talked with The Guardian’s lead data scientist Anna Vissens about the project
for a fascinating blog post. And on the topic of expert content, our machine
learning engineer Lj
shows
how to integrate HuggingFace’s LayoutLMv3 model with Prodigy to tackle the
challenge of extracting information from PDFs.

Bulk labeling and spaCy shorts video thumbnails

We’ve expanded our YouTube channel with two new playlists:
spaCy Shorts and Prodigy Shorts. As part of the
spaCy Shorts
series, Vincent walks you through various quick lessons on how to
speed up your pipeline execution via nlp.pipe,
how to
leverage linguistic features in a rule-based approach,
and much more. The bite-sized videos in the
Prodigy Shorts
playlist demonstrate how to configure the Prodigy UI for efficient annotations,
and
how to exploit Prodigy’s core scriptability design.

Interested in more nitty gritty details? In
one of his first videos with Explosion, Vincent
explains how to use Prodigy to train a named entity recognition model from
scratch by taking advantage of semi-automatic annotation and modern transfer
learning techniques. On the topic of efficient labeling,
this recent Prodigy video shows how you can use
a bulk labeling technique to prepare data for Prodigy and illustrates that a
pre-trained language model can help you annotate data. Finally,
this Prodigy video
shows how you might be able to improve the annotation experience by leveraging
sense2vec to pre-fill named entities.

We are always excited to talk about our vision on implementing developer tools,
general design choices or new features that we released. In January, Ines
appeared on ZenML’s podcast Pipeline Conversations and
talked about creating tools that spark joy. She
also gave
the keynote
at the New Languages for NLP conference at
Princeton in May. Her talk covered the challenges for non-English NLP and how
spaCy allows you to develop advanced NLP pipelines, including for typologically
diverse languages. In June, she presented a nice recap of spaCy’s changes over
time on
Deepak John Reji’s D4 Data Podcast.

Creating tools that spark joy (Ines) and Spancat (Victoria) talks

Over at the
Data-aware Distributed Computing (DADC)
conference in July, Damian and Magda gave
a talk
on collecting high-quality adversarial data for machine reading comprehension
tasks with humans and models in the loop. Victoria and Damian also both gave
talks at PyData Global in the beginning of
December.

If you were ever curious about what some of us get up to at Explosion, as of
December we’ve added an events page to our website where you can
see upcoming and past talks from us. If you want to meet us in person and learn
about our tools, maybe grab some stickers, check it out!

With the community and the team continuing to grow, we look forward to making 2023 even better. Thanks for all your support!



Source link

Leave a Comment