March 29, 2025

ikayaniaamirshahzad@gmail.com

Introducing spaCy v3.3 · Explosion


We’re pleased to present v3.3 of the spaCy Natural Language
Processing library. spaCy v3.3 improves the speed of nearly all statistical
pipeline components, adds a trainable lemmatizer and includes new trained
pipelines for Finnish, Korean and Swedish.

spaCy v3.3 includes a slew of speed improvements that increase the speed of all
core pipeline components in training and inference. For longer texts, the
trained pipeline speeds improve 15% or more in prediction. Detailed
benchmarks for en_core_web_md
show the speed improvements for spaCy v3.2 vs v3.3:

Speed Benchmarks: en_core_web_md

CPU Avg. Words/Doc v3.2 Words/Sec v3.3 Words/Sec Diff
Intel Xeon W-2265 100 17292 17441 0.86%
1000 15408 16024 4.00%
10000 12798 15346 19.91%
Apple M1 100 18272 18408 0.74%
1000 18794 19248 2.42%
10000 15144 17513 15.64%

The new trainable lemmatizer
component uses edit trees to
transform tokens into lemmas. Try out the trainable lemmatizer with the
training quickstart!

displaCy now supports
overlapping span annotation from
Doc.spans:

displaCy for overlapping spans

v3.3 introduces new CPU/CNN pipelines for Finnish, Korean and Swedish, which use
the new trainable lemmatizer and
floret vectors. Due to the use of
Bloom embeddings and subwords, the
pipelines have compact vectors with no out-of-vocabulary words.

The trained pipelines for the following languages switch from lookup or
rule-based lemmatizers to the new trainable lemmatizer:

Lemmatizer Accuracy (md Pipeline)

Many cool new plugins, extensions, pipelines and tutorials have been added to
the spaCy universe since v3.2:

View the spaCy universe

Resources



Source link

Leave a Comment