March 29, 2025

ikayaniaamirshahzad@gmail.com

Introducing spaCy v3.2 · Explosion

We’re pleased to present v3.2 of the spaCy Natural Language
Processing library. Since v3.1 we’ve added usability improvements for custom
training and scoring, improved performance on Apple M1 and Nvidia GPU hardware,
and support for space-efficient vectors using
floret, our new hash embedding
extension to fastText.

The spaCy team has gotten a lot bigger this year, and we’ve got lots
of exciting features and examples coming up, including example projects for data
augmentation and model distillation, more examples of transformer-based
pipelines, and new components for coreference resolution and graph-based
parsing.

spaCy is now up to 8 × faster on M1 Macs by calling into Apple’s
native Accelerate library for matrix multiplication. For more details, check out
thinc-apple-ops.

pip install spacy[apple]

Prediction speed of the
de_core_news_lg pipeline between
the M1, Intel MacBook and AMD Ryzen 5900X with and without thinc-apple-ops.
Results are in words per second.

CPU	BLIS	thinc-apple-ops	Package power (Watt)
Mac Mini (M1)	6,492	27,676	5
MacBook Air Core i5 2020	9,790	10,983	9
AMD Ryzen 5900X	22,568	n/a	52

nlp and
nlp.pipe accept
Doc input, skipping the tokenizer if a Doc is
provided instead of a string. This makes it easier to create a Doc with custom
tokenization or to set custom extensions before processing:

Process a Doc object
doc = nlp.make_doc("This is text 500.")
doc._.text_id = 500
doc = nlp(doc)

To customize the scoring, you can specify a scoring function for each component
in your config from the new
scorers registry:

config.cfg (excerpt)
[components.tagger]
factory = "tagger"
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

We recently published floret, an
extended version of fastText that combines fastText’s
subwords with Bloom embeddings for compact, full-coverage vectors. The use of
subwords means that there are no OOV words and due to Bloom embeddings, the
vector table can be kept very small at <100K entries. Bloom embeddings are
already used by HashEmbed in
tok2vec for compact spaCy
models. For easy integration, floret includes a
Python wrapper:

pip install floret

To get started, check out the
pipelines/floret_vectors_demo
project which trains toy English floret vectors and imports them into a spaCy
pipeline. For agglutinative languages like Finnish or Korean, there are large
improvements in performance due to the use of subwords (no OOV words!), with a
vector table containing merely 50K entries.

Finnish example project with benchmarks

To try it out, clone the
pipelines/floret_fi_core_demo
project:

python -m spacy project clone pipelines/floret_fi_core_demo

Finnish UD+NER vector and pipeline training, comparing standard fastText vs.
floret vectors. For the default project settings with 1M (2.6G) tokenized
training texts and 50K 300-dim vectors, ~300K keys for the standard vectors:

Vectors	TAG	POS	DEP UAS	DEP LAS	NER F
none	93.5	92.4	80.1	73.0	61.6
standard (pruned: 50K vectors for 300K keys)	95.9	95.0	83.1	77.4	68.1
standard (unpruned: 300K vectors/keys)	96.4	95.0	82.8	78.4	70.4
floret (minn 4, maxn 5; 50K vectors, no OOV)	96.9	95.9	84.5	79.9	70.1

Results updated on Nov. 22, 2021 for floret v0.10.1.

Korean example project with benchmarks

To try it out, clone the
pipelines/floret_ko_ud_demo
project:

python -m spacy project clone pipelines/floret_ko_ud_demo

Korean UD vector and pipeline training, comparing standard fastText vs. floret
vectors. For the default project settings with 1M (3.3G) tokenized training
texts and 50K 300-dim vectors, ~800K keys for the standard vectors:

Vectors	TAG	POS	DEP UAS	DEP LAS
none	72.5	85.3	74.0	65.0
standard (pruned: 50K vectors for 800K keys)	77.3	89.1	78.2	72.2
standard (unpruned: 800K vectors/keys)	79.0	90.3	79.4	73.9
floret (minn 2, maxn 3; 50K vectors, no OOV)	82.8	94.1	83.5	80.5

Results updated on Nov. 22, 2021 for floret v0.10.1.

spaCy v3.2 adds a new transformer pipeline package for Japanese
ja_core_news_trf, which uses
the basic pretokenizer instead of mecab to limit the number of dependencies
required for the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese
community for their contributions!

The spaCy universe has seen some cool additions
since the last release! Here’s a selection of new plugins and extensions you can
install to add more power to your spaCy projects:


💬 `spacy-clausie`	Implementation of the ClausIE information extraction system
🎨 `ipymarkup`	Collection of NLP visualizations for NER and syntax tree markup
🌳 `deplacy`	Tree visualizer for Universal Dependencies and Immediate Catena Analysis

The following packages have been updated with support for
spaCy v3:


🕵️‍♂️ `holmes`	Information extraction from English and German based on predicate logic
🌐 `spaCyOpenTapioca`	OpenTapioca wrapper for named entity linking on Wikidata
🇩🇰 `DaCy`	State of the Art Danish NLP pipelines

View the spaCy universe

Resources

Source link

Introducing spaCy v3.2 · Explosion

Process a Doc object

config.cfg (excerpt)

Finnish example project with benchmarks

Korean example project with benchmarks

Resources

Latest articles

ChatGPT gained one million new users in an hour today

China police deploy real-life Robocop as humanoid tech takes huge leap forward

Runway releases Gen-4 video model with focus on consistency

Leave a Comment Cancel reply

ChatGPT gained one million new users in an hour today

China police deploy real-life Robocop as humanoid tech takes huge leap forward

Runway releases Gen-4 video model with focus on consistency

Introducing spaCy v3.2 · Explosion

Process a Doc object

config.cfg (excerpt)

Finnish example project with benchmarks

Korean example project with benchmarks

Resources

Latest articles

ChatGPT gained one million new users in an hour today

China police deploy real-life Robocop as humanoid tech takes huge leap forward

Runway releases Gen-4 video model with focus on consistency

Leave a Comment Cancel reply

Featured articles

ChatGPT gained one million new users in an hour today

China police deploy real-life Robocop as humanoid tech takes huge leap forward

Runway releases Gen-4 video model with focus on consistency