March 29, 2025

ikayaniaamirshahzad@gmail.com

Introducing spaCy v3.2 · Explosion


We’re pleased to present v3.2 of the spaCy Natural Language
Processing library. Since v3.1 we’ve added usability improvements for custom
training and scoring, improved performance on Apple M1 and Nvidia GPU hardware,
and support for space-efficient vectors using
floret, our new hash embedding
extension to fastText.

The spaCy team has gotten a lot bigger this year, and we’ve got lots
of exciting features and examples coming up, including example projects for data
augmentation and model distillation, more examples of transformer-based
pipelines, and new components for coreference resolution and graph-based
parsing.

spaCy is now up to 8 × faster on M1 Macs by calling into Apple’s
native Accelerate library for matrix multiplication. For more details, check out
thinc-apple-ops.

pip install spacy[apple]

Prediction speed of the
de_core_news_lg pipeline between
the M1, Intel MacBook and AMD Ryzen 5900X with and without thinc-apple-ops.
Results are in words per second.

CPU BLIS thinc-apple-ops Package power (Watt)
Mac Mini (M1) 6,492 27,676 5
MacBook Air Core i5 2020 9,790 10,983 9
AMD Ryzen 5900X 22,568 n/a 52

nlp and
nlp.pipe accept
Doc input, skipping the tokenizer if a Doc is
provided instead of a string. This makes it easier to create a Doc with custom
tokenization or to set custom extensions before processing:

Process a Doc object

doc = nlp.make_doc("This is text 500.")

doc._.text_id = 500

doc = nlp(doc)

To customize the scoring, you can specify a scoring function for each component
in your config from the new
scorers registry:

config.cfg (excerpt)

[components.tagger]

factory = "tagger"

scorer = {"@scorers":"spacy.tagger_scorer.v1"}

We recently published floret, an
extended version of fastText that combines fastText’s
subwords with Bloom embeddings for compact, full-coverage vectors. The use of
subwords means that there are no OOV words and due to Bloom embeddings, the
vector table can be kept very small at <100K entries. Bloom embeddings are
already used by HashEmbed in
tok2vec for compact spaCy
models. For easy integration, floret includes a
Python wrapper:

pip install floret

To get started, check out the
pipelines/floret_vectors_demo
project which trains toy English floret vectors and imports them into a spaCy
pipeline. For agglutinative languages like Finnish or Korean, there are large
improvements in performance
due to the use of subwords (no OOV words!), with a
vector table containing merely 50K entries.

Finnish example project with benchmarks

To try it out, clone the
pipelines/floret_fi_core_demo
project:

python -m spacy project clone pipelines/floret_fi_core_demo

Finnish UD+NER vector and pipeline training, comparing standard fastText vs.
floret vectors. For the default project settings with 1M (2.6G) tokenized
training texts and 50K 300-dim vectors, ~300K keys for the standard vectors:

Vectors TAG POS DEP UAS DEP LAS NER F
none 93.5 92.4 80.1 73.0 61.6
standard (pruned: 50K vectors for 300K keys) 95.9 95.0 83.1 77.4 68.1
standard (unpruned: 300K vectors/keys) 96.4 95.0 82.8 78.4 70.4
floret (minn 4, maxn 5; 50K vectors, no OOV) 96.9 95.9 84.5 79.9 70.1

Results updated on Nov. 22, 2021 for floret v0.10.1.

Korean example project with benchmarks

To try it out, clone the
pipelines/floret_ko_ud_demo
project:

python -m spacy project clone pipelines/floret_ko_ud_demo

Korean UD vector and pipeline training, comparing standard fastText vs. floret
vectors. For the default project settings with 1M (3.3G) tokenized training
texts and 50K 300-dim vectors, ~800K keys for the standard vectors:

Vectors TAG POS DEP UAS DEP LAS
none 72.5 85.3 74.0 65.0
standard (pruned: 50K vectors for 800K keys) 77.3 89.1 78.2 72.2
standard (unpruned: 800K vectors/keys) 79.0 90.3 79.4 73.9
floret (minn 2, maxn 3; 50K vectors, no OOV) 82.8 94.1 83.5 80.5

Results updated on Nov. 22, 2021 for floret v0.10.1.

spaCy v3.2 adds a new transformer pipeline package for Japanese
ja_core_news_trf, which uses
the basic pretokenizer instead of mecab to limit the number of dependencies
required for the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese
community for their contributions!

The spaCy universe has seen some cool additions
since the last release! Here’s a selection of new plugins and extensions you can
install to add more power to your spaCy projects:

💬 spacy-clausie Implementation of the ClausIE information extraction system
🎨 ipymarkup Collection of NLP visualizations for NER and syntax tree markup
🌳 deplacy Tree visualizer for Universal Dependencies and Immediate Catena Analysis

The following packages have been updated with support for
spaCy v3:

🕵️‍♂️ holmes Information extraction from English and German based on predicate logic
🌐 spaCyOpenTapioca OpenTapioca wrapper for named entity linking on Wikidata
🇩🇰 DaCy State of the Art Danish NLP pipelines

View the spaCy universe

Resources



Source link

Leave a Comment