To demonstrate the performance of spaCy v3.2, we present a series of UD
benchmarks comparable to the
Stanza and
Trankit evaluations
on Universal Dependencies v2.5, using the
evaluation from the
CoNLL 2018 Shared Task.
The benchmarks show the competitive performance of spaCy’s core components for
tagging, parsing and sentence segmentation and also let us highlight and
evaluate the new edit tree lemmatizer. The trained
pipelines in the benchmarks are made available for download on
Explosion’s Hugging Face Hub repo and a
UD benchmark project lets you run the full training and evaluation
for any Universal Dependencies corpus.
The core syntactic annotation is performed by built-in spaCy components:
Experimental components are used for tokenization and lemmatization:
Aside from the tokenizer, the pipeline components are trained with a single
transformer component using xlm-roberta-base
, similar to Trankit Base.
While many spaCy pipelines are trained on Universal
Dependencies corpora, we haven’t published full Universal Dependencies
benchmarks in the past because spaCy v2 and v3 pipelines have primarily relied
on rule-based components for tokenization and lemmatization, which are good for
speed in production, but not for training from scratch for a language that only
has partial support in spaCy or where spaCy’s defaults don’t align well with the
corpus annotation scheme.
Tokenization presents a particular problem, since every single error lowers the
ceiling for the performance of all the following components. In order to give
spaCy’s core components a fair shake in comparison with other libraries, we
switch from a fast rule-based tokenizer to a slower trainable tokenizer that
doesn’t require any manual customization. This new
experimental tokenizer
uses spaCy’s built-in NER component under the hood to segment on the character
level, following the idea behind the
Elephant tokenizer (Evang et al. 2013).
For lemmatization, we use the new experimental
edit tree lemmatizer, which we recently added
along with the experimental tokenizer to our new
spacy-experimental
package,
where we plan to provide in-progress features and components while we refine and
evaluate them for inclusion in the core spaCy library.
Multi-word tokens
The other remaining issue for spaCy and Universal Dependencies is multi-word
tokens (MWTs), which don’t fit well into spaCy’s Doc
objects. A spaCy
Doc
aligns each token directly with a series of
characters in the input text and it doesn’t support multiple token texts or
multiple tokenizations within the same document. As a result, especially for
cases where the UD word forms don’t correspond to the token form in the text,
it’s difficult to implement an MWT expander for spaCy because the annotation
couldn’t be stored easily on the text-based tokens in the Doc
.
For now, we side-step this mismatch and focus on UD corpora with no or few MWTs,
since this gives a more accurate impression of the performance of spaCy’s
pipeline components. For the corpora with a small number of MWTs, we use spaCy’s
CoNLL-U converter to merge MWTs into single
tokens that have the text of the original token with linguistic features merged
from the word annotations. The lower “Words” scores do cascade into the
remaining evaluation metrics, but you can get a better impression of the
performance of spaCy itself from the aligned accuracy scores.
The pipeline has the following configuration, with the relevant UD evaluation
metrics noted for each:
The tokenizer is trained separately and the remaining components are trained
sharing the same transformer component using multi-task learning. The final
pipeline is assembled with the senter
disabled by default so that sentence
boundaries are set by the parser, which is the same design used in
spaCy’s trained pipelines. We’ll see in the
evaluation where it makes sense to use the senter vs. the parser for sentence
segmentation.
We selected 28 UD v2.5 corpora to benchmark using this configuration. The
corpora share the following characteristics:
- 20K+ training tokens
- whitespace is used to separate tokens
- no or few multi-word tokens
- license permitting commercial use
The CoNLL 2018 evaluation metrics for UD v2.5 are shown for Stanza, Trankit and
spaCy in the following table.
- The Stanza and Trankit numbers are copied from
Trankit’s model performance overview. - spaCy’s CoNLL-U converter copies
UPOS
values toXPOS
ifXPOS
is missing,
soXPOS
andAllFeats
are omitted in the averages and in the evaluations
for several corpora: Danish-DDT, French-Sequoia, Norwegian-Bokmaal,
Norwegian-Nynorsk, Portuguese-Bosque.
In general, spaCy’s performance is very close to Trankit Base for larger corpora
and solidly in between Stanza and Trankit Base for smaller corpora. The
part-of-speech tags and morphological features are on par with Trankit Base
while UAS/LAS are slightly lower.
For smaller corpora and languages with rich morphology, spaCy’s edit tree
lemmatizer is slightly worse than Stanza’s seq2seq lemmatizer. For corpora where
lemmatization is primarily a segmentation task rather than a generation task
(Korean-GSD, Korean-Kaist), the edit tree lemmatizer outperforms Stanza.
For most UD corpora and especially for smaller corpora, spaCy’s separate
senter
sentence segmenter performs better than the default parser-based
segmentation. In general, if sentence boundaries are marked by punctuation, the
senter component performs well, requiring much less training data than the
parser.
Install any udv25
pipeline from
Explosion’s Hugging Face Hub repo. Find the
link to install the pipeline at the top right under “Use in spaCy”.
Install the model from the Huggingface Hub link
python -m pip install https://huggingface.co/explosion/en_udv25_englishewt_trf/resolve/main/en_udv25_englishewt_trf-any-py3-none-any.whl
Be aware that this will additionally install spacy-experimental
to provide the
experimental tokenizer and lemmatizer. If you haven’t already installed
transformers
, you might want to have a look at our recommended
installation steps.
Once the pipeline is installed, load it like any other spaCy pipeline:
import spacy
spacy.prefer_gpu()
nlp = spacy.load("en_udv25_englishewt_trf")
If you would like to train and evaluate the same pipelines yourself, start with
the
UD benchmark project:
Clone the project
python -m spacy project clone benchmarks/ud_benchmark
cd ud_benchmark
By default, this project trains a pipeline on UD_English-EWT:
Download data, train, assemble and evaluate
python -m spacy project assets
python -m spacy project run all
You can edit project.yml
to switch to a different UD corpus or edit the
configs to try out different pipeline and training settings. See the full
spaCy project docs for more information on
working with the project assets, templates and remote storage.
In addition, you can use
spacy-huggingface-hub
to
upload spaCy pipelines to your own repo, complete with model cards generated
from the spaCy pipeline metadata.
These pipelines are published for benchmarking purposes and are not intended for
production use. In production, a rule-based tokenizer for languages with
whitespace or a language-specific word segmenter such as
SudachiPy for Japanese is a
better choice than the experimental tokenizer, which is not optimized for speed
or memory use.
If you’re working with a specific language, you may be able to train a better,
smaller model with a language-specific transformer model in place of
xlm-roberta-base
. spaCy can provide language-specific transformer
recommendations with spacy init config --lang lang --gpu config.cfg
or in the
training quickstart with the GPU
option.