March 29, 2025

ikayaniaamirshahzad@gmail.com

Universal Dependencies v2.5 Benchmarks for spaCy · Explosion


To demonstrate the performance of spaCy v3.2, we present a series of UD
benchmarks comparable to the
Stanza and
Trankit evaluations
on Universal Dependencies v2.5, using the
evaluation from the
CoNLL 2018 Shared Task.

The benchmarks show the competitive performance of spaCy’s core components for
tagging, parsing and sentence segmentation and also let us highlight and
evaluate the new edit tree lemmatizer. The trained
pipelines in the benchmarks are made available for download on
Explosion’s Hugging Face Hub repo and a
UD benchmark project lets you run the full training and evaluation
for any Universal Dependencies corpus.

The core syntactic annotation is performed by built-in spaCy components:

Experimental components are used for tokenization and lemmatization:

Aside from the tokenizer, the pipeline components are trained with a single
transformer component using xlm-roberta-base, similar to Trankit Base.

While many spaCy pipelines are trained on Universal
Dependencies corpora, we haven’t published full Universal Dependencies
benchmarks in the past because spaCy v2 and v3 pipelines have primarily relied
on rule-based components for tokenization and lemmatization, which are good for
speed in production, but not for training from scratch for a language that only
has partial support in spaCy or where spaCy’s defaults don’t align well with the
corpus annotation scheme.

Tokenization presents a particular problem, since every single error lowers the
ceiling for the performance of all the following components. In order to give
spaCy’s core components a fair shake in comparison with other libraries, we
switch from a fast rule-based tokenizer to a slower trainable tokenizer that
doesn’t require any manual customization. This new
experimental tokenizer
uses spaCy’s built-in NER component under the hood to segment on the character
level, following the idea behind the
Elephant tokenizer (Evang et al. 2013).

For lemmatization, we use the new experimental
edit tree lemmatizer, which we recently added
along with the experimental tokenizer to our new
spacy-experimental package,
where we plan to provide in-progress features and components while we refine and
evaluate them for inclusion in the core spaCy library.

Multi-word tokens

The other remaining issue for spaCy and Universal Dependencies is multi-word
tokens (MWTs), which don’t fit well into spaCy’s Doc objects. A spaCy
Doc aligns each token directly with a series of
characters in the input text and it doesn’t support multiple token texts or
multiple tokenizations within the same document. As a result, especially for
cases where the UD word forms don’t correspond to the token form in the text,
it’s difficult to implement an MWT expander for spaCy because the annotation
couldn’t be stored easily on the text-based tokens in the Doc.

UD vs. spaCy MWTs

For now, we side-step this mismatch and focus on UD corpora with no or few MWTs,
since this gives a more accurate impression of the performance of spaCy’s
pipeline components. For the corpora with a small number of MWTs, we use spaCy’s
CoNLL-U converter to merge MWTs into single
tokens that have the text of the original token with linguistic features merged
from the word annotations. The lower “Words” scores do cascade into the
remaining evaluation metrics, but you can get a better impression of the
performance of spaCy itself from the aligned accuracy scores.

The pipeline has the following configuration, with the relevant UD evaluation
metrics noted for each:

Pipeline configuration

The tokenizer is trained separately and the remaining components are trained
sharing the same transformer component using multi-task learning. The final
pipeline is assembled with the senter disabled by default so that sentence
boundaries are set by the parser, which is the same design used in
spaCy’s trained pipelines. We’ll see in the
evaluation where it makes sense to use the senter vs. the parser for sentence
segmentation.

We selected 28 UD v2.5 corpora to benchmark using this configuration. The
corpora share the following characteristics:

  • 20K+ training tokens
  • whitespace is used to separate tokens
  • no or few multi-word tokens
  • license permitting commercial use

The CoNLL 2018 evaluation metrics for UD v2.5 are shown for Stanza, Trankit and
spaCy in the following table.

View full table
  • The Stanza and Trankit numbers are copied from
    Trankit’s model performance overview.
  • spaCy’s CoNLL-U converter copies UPOS values to XPOS if XPOS is missing,
    so XPOS and AllFeats are omitted in the averages and in the evaluations
    for several corpora: Danish-DDT, French-Sequoia, Norwegian-Bokmaal,
    Norwegian-Nynorsk, Portuguese-Bosque.

In general, spaCy’s performance is very close to Trankit Base for larger corpora
and solidly in between Stanza and Trankit Base for smaller corpora. The
part-of-speech tags and morphological features are on par with Trankit Base
while UAS/LAS are slightly lower.

For smaller corpora and languages with rich morphology, spaCy’s edit tree
lemmatizer is slightly worse than Stanza’s seq2seq lemmatizer. For corpora where
lemmatization is primarily a segmentation task rather than a generation task
(Korean-GSD, Korean-Kaist), the edit tree lemmatizer outperforms Stanza.

For most UD corpora and especially for smaller corpora, spaCy’s separate
senter sentence segmenter performs better than the default parser-based
segmentation. In general, if sentence boundaries are marked by punctuation, the
senter component performs well, requiring much less training data than the
parser.

Install any udv25 pipeline from
Explosion’s Hugging Face Hub repo. Find the
link to install the pipeline at the top right under “Use in spaCy”.

Install the model from the Huggingface Hub link

python -m pip install https://huggingface.co/explosion/en_udv25_englishewt_trf/resolve/main/en_udv25_englishewt_trf-any-py3-none-any.whl

Be aware that this will additionally install spacy-experimental to provide the
experimental tokenizer and lemmatizer. If you haven’t already installed
transformers, you might want to have a look at our recommended
installation steps.

Once the pipeline is installed, load it like any other spaCy pipeline:

import spacy

spacy.prefer_gpu()

nlp = spacy.load("en_udv25_englishewt_trf")

If you would like to train and evaluate the same pipelines yourself, start with
the
UD benchmark project:

Clone the project

python -m spacy project clone benchmarks/ud_benchmark

cd ud_benchmark

By default, this project trains a pipeline on UD_English-EWT:

Download data, train, assemble and evaluate

python -m spacy project assets

python -m spacy project run all

You can edit project.yml to switch to a different UD corpus or edit the
configs to try out different pipeline and training settings. See the full
spaCy project docs for more information on
working with the project assets, templates and remote storage.

In addition, you can use
spacy-huggingface-hub to
upload spaCy pipelines to your own repo, complete with model cards generated
from the spaCy pipeline metadata.

These pipelines are published for benchmarking purposes and are not intended for
production use. In production, a rule-based tokenizer for languages with
whitespace or a language-specific word segmenter such as
SudachiPy for Japanese is a
better choice than the experimental tokenizer, which is not optimized for speed
or memory use.

If you’re working with a specific language, you may be able to train a better,
smaller model with a language-specific transformer model in place of
xlm-roberta-base. spaCy can provide language-specific transformer
recommendations with spacy init config --lang lang --gpu config.cfg or in the
training quickstart with the GPU
option.



Source link

Leave a Comment