March 29, 2025

ikayaniaamirshahzad@gmail.com

Neural edit-tree lemmatization for spaCy · Explosion


We are happy to introduce a new, experimental, machine learning-based lemmatizer
that posts accuracies above 95% for many languages. This lemmatizer learns to
predict lemmatization rules from a corpus of examples and removes the need to
write an exhaustive set of per-language lemmatization rules.

spaCy provides a Lemmatizer component for
assigning base forms (lemmas) to tokens. For example, it lemmatizes the
sentence

The kids bought treats from various stores.

to its base forms:

the kid buy treat from various store.

Lemmas are useful in many applications. For example, a search engine could use
lemmas to match all inflections of a base form. In this way, a query like buy
could match its inflections buy, buys, buying, and bought.

Lemma-based query example

English is one of the few languages with a relatively simple inflectional
morphology, so ignoring morphology or using a crude approximation like stemming
can work decently well for many applications, such as search engines. But for
most languages, you need good lemmatization just to get a sensible list of term
frequencies.

The spaCy lemmatizer uses two mechanisms for lemmatization for most languages:

  1. A lookup table that maps inflections to their lemmas. For example, the
    table could specify that buys is lemmatized as buy. The Lemmatizer
    component also supports lookup tables that are indexed by form and
    part-of-speech. This allows for different lemmatization of the same
    orthographic forms that have different word classes. For example, the verbal
    form chartered in they chartered a plane should be lemmatized as
    charter, whereas the adjective chartered in a chartered plane should be
    lemmatized as chartered.
  2. A rule set that rewrites a token to its lemma in certain constrained
    ways. For example, one specific rule could specify that a token that ends
    with the suffix -ed and has the part-of-speech tag VERB is lemmatized by
    removing the suffix -ed. The rules can only operate over the suffix of the
    token, so are only suitable for simple morphological systems that are mostly
    concatenative, such as English.

These mechanisms can also be combined. For instance, a lookup table could be used
for irregular forms and a set of rules for regular forms.

The accuracy of the Lemmatizer component on a particular language depends on
how comprehensive the lookup table and rule set for that language is. Developing
a comprehensive rule set requires a fair amount of labor, even for linguists who
are familiar with the language.

Since corpora with lemma annotations are available for many languages, it would
be more convenient if a lemmatizer could infer lemmatization rules automatically
from a set of examples. Consider for example the Dutch past participle form
gepakt and its lemma pakken (to take). It is fairly straightforward to
come up with a rule for lemmatizing gepakt:

  1. Find the longest common substring (LCS) of inflected form and its lemma:
    gepaktpakken. The longest common substring often
    captures the stem of the words.
  2. Split the inflected form and the lemma in three parts: the prefix, the LCS,
    and the suffix.
  3. Find the changes that need to be made to the prefix and suffix to go from the
    inflected form to the lemma:
    a. Replace the prefix ge- by the
    empty sting ε
    b. Replace the suffix -t by the string -ken

3a and 3b would then together form a single lemmatization rule that works for
(most) regularly-inflected Dutch past participles that have the general form:
ge- [stem-ending-in-k] -t, such as gepakt or gelekt.

In practice, the rule-finding algorithm is a bit more complex, since there may
be multiple shared substrings. For example, the Dutch verb afpakken (to take
away)
contains a separable verb prefix af-. Its past participle is
afgepakt, so the past participle and the lemma have two shared substrings:
afgepaktafpakken. This is accounted for by using a
recursive version of the algorithm above. Rather than simply replacing the
string afge by af, we apply the algorithm to these two substrings as well.

This recursive algorithm and the corresponding rule representation were proposed
in
Joint Lemmatization and Morphological Tagging with Lemming
(Thomas Müller et al., 2015). The recursive data structure that the algorithm
produces is a so-called edit tree. Edit trees have two types of nodes:

Edit tree node legend

You could see these two types of nodes as small functions:

  • Interior node: splits a string into three parts: 1. a prefix of length
    n; 2. an infix; and 3. a suffix of length m. Then it applies its left
    child to the prefix and its right child to the suffix. Finally, it returns the
    concatenation of the transformed prefix, the infix, and the transformed
    suffix.
  • Leaf node: checks that the input string is s (otherwise, the tree is not
    applicable) and if so, returns t.

These two node types can be combined into a tree, which recursively rewrites
string prefixes and suffixes, while retaining infixes (which are substrings
shared by the token and its lemma). Below, you will find the edit tree that is
the result of applying the rule construction algorithm to the pair afgepakt
and afpakken.

Edit tree example

The grey nodes represent the edit tree itself. The purple and orange edges show
the prefixes and suffixes that are the inputs to the tree nodes when the tree is
applied to afgepakt. The black edges show the outputs of the tree nodes.

One nice property of edit trees is that they leave out as much of the surface
form as possible. For this reason, the edit tree also generalizes to other verbs
with the same inflectional pattern, such as afgeplakt (taped) or even
opgepakt (picked up or arrested).

Given a large corpus where tokens are annotated with their lemmas, we can use
the algorithm discussed earlier to extract an edit tree for each token – lemma
pair. This typically results in hundreds or thousands of unique edit trees for a
reasonably-sized corpus. The number of edit trees is much smaller than the
number of types (unique words), since most words are inflected following regular
patterns. However, how do we know which edit tree to apply when we are asked to
lemmatize a token?

Treating the task of picking the right edit tree as a classification task turns
out to work surprisingly well. In this approach, each edit tree is considered to
be a class and we use a Softmax layer to compute a probability distribution
over all trees for a particular token. We can then apply the most-probable edit
tree to lemmatize the token. If the most probably tree cannot be applied, there
is the option to back off to the next most probable tree.

The quality of the predictions is largely determined by the hidden
representations that are provided to the softmax layer. These representations
should encode both subword and contextual information:

  • Subword information is relevant for choosing a tree that is applicable to
    the surface form. For instance, it does not make sense to apply the edit tree
    that was discussed above to tokens without the infix -ge-, the suffix -t,
    or a two-letter separable verb particle such as af.
  • Contextual information is needed to disambiguate surface forms. In many
    languages, the inflectional affixes are specific to a part-of-speech. So, in
    order to pick the correct edit tree for a token, a model also needs to infer
    its part-of-speech. For instance, walking in She is walking should be
    lemmatized as walk, whereas walking in I bought new walking shoes has
    walking as its lemma.
  • Sometimes it is also necessary to disambiguate the word sense in order to
    choose the correct edit tree. For example, axes can either be the plural of
    the noun axis or the plural of the noun axe. In order to pick the correct
    lemmatization, a model would first need to infer from the context which sense
    of axes was used.

Luckily, the venerable
HashEmbedCNN layer provides
both types of information to the classifier, providing word and subword
representations through the
MultiHashEmbed layer and
contextual information through the
MaxoutWindowEncoder
layer. Another good option for integrating both types of information are
transformer models provided through
spacy-transformers.

We have created a new experimental_edit_tree_lemmatizer component that
combines the techniques discussed in this post. We have also done experiments on
several languages to gauge how well this lemmatizer works. In these experiments,
we trained some pipelines with the tok2vec, tagger (where applicable),
morphologizer components and the default spaCy lemmatizer or the new edit tree
lemmatizer. The accuracies, as well as the CPU prediction speeds in words per
second (WPS), are shown in the table below:

Language Vectors Lemmatizer Accuracy Lemmatizer Speed1 Edit Tree Lemmatizer Accuracy Edit Tree Lemmatizer Speed1
de de_core_news_lg 0.70 39,567 0.97 31,043
es es_core_news_lg 0.98 46,388 0.99 39,018
it it_core_news_lg 0.86 43,397 0.97 33,419
nl nl_core_news_lg 0.86 51,395 0.96 40,421
pl pl_core_news_lg 0.87 17,920 0.94 15,429
pt pt_core_news_lg 0.76 45,097 0.97 39,783
nl xlm-roberta-base (transformer) 0.86 1,772 0.98 1,712
pl xlm-roberta-base (transformer) 0.88 1,631 0.97 1,554
  1. Speeds are in words per second (WPS), measured on the test set using three
    evaluation passes as warmup.

For the tested languages, the edit tree lemmatizer provides considerable
improvements, generally posting accuracies above 95%.

We configured the edit tree lemmatizer to share the same token representations
as the other components in the pipeline, which means the benefits of the edit
tree lemmatizer are especially clear if you’re using a transformer model.
Transformers take longer to run, so the edit tree lemmatizer will have a
fractionally lower impact on the total runtime of the pipeline. Transformers
also supply more informative token representations, increasing the edit tree
lemmatizer’s accuracy advantage over the rule-based lemmatizer.

We should emphasize that the edit tree lemmatizer component is currently still
experimental. However, thanks to the
function registry support in
spaCy v3, it is easy to try out the new lemmatizer in your own pipelines. First
install the
spacy-experimental Python
package:

Installing the spacy-experimental package

pip install -U pip setuptools wheel

pip install spacy-experimental==0.4.0

You can then use the experimental_edit_tree_lemmatizer component factory:

Basic edit tree lemmatizer configuration

[components.experimental_edit_tree_lemmatizer]

factory = "experimental_edit_tree_lemmatizer"

That’s all! Of course, we encourage you to experiment with more than the default
model. First of all, you can change the behavior of the edit tree lemmatizer
using the options described in the table below:

backoff The token attribute that must be used when the lemmatizer fails to find an applicable edit tree. The default is to use the orth attribute to get the orthographical form.
min_tree_freq The required minimum frequency of an edit tree in the training data to be included in the model.
top_k The number of most probable trees that should be tried for lemmatization before resorting to the backoff attribute.
overwrite If enabled, the lemmatizer will overwrite lemmas set by previous components in the pipeline.

Secondly, you can also share hidden representations between the edit tree
lemmatizer and other components by using
Tok2VecListener, as
shown in the example below. In many cases, joint training with components that
perform morphosyntactic annotation, such as
Tagger or
Morphologizer, can improve the accuracy
of the lemmatizer.

Edit tree lemmatizer configuration that uses a shared tok2vec component

[components.experimental_edit_tree_lemmatizer]

factory = "experimental_edit_tree_lemmatizer"

backoff = "orth"

min_tree_freq = 3

overwrite = false

top_k = 1

[components.experimental_edit_tree_lemmatizer.model]

@architectures = "spacy.Tagger.v1"

nO = null

[components.experimental_edit_tree_lemmatizer.model.tok2vec]

@architectures = "spacy.Tok2VecListener.v1"

width = ${components.tok2vec.model.encode.width}

upstream = "tok2vec"

If you would rather like to try a ready-to-use example project to start out, you
can use the
example project
for the edit tree lemmatizer. You can fetch this project with spaCy’s
project command and install the necessary
dependencies:

Get the sample project and install dependencies

python -m spacy project clone projects/edit_tree_lemmatizer \

--repo https://github.com/explosion/spacy-experimental \

--branch v0.4.0

cd edit_tree_lemmatizer

pip install spacy-experimental==0.4.0

The training and evaluation data can be downloaded with the
project assets command. The
lemmatizer can then be trained and evaluated using the run all workflow. The
project uses the Dutch Alpino treebank as provided by the
Universal Dependencies project by default.
So, the following commands will train and evaluate a Dutch lemmatizer:

Fetch data and train a lemmatization model

python -m spacy project assets

python -m spacy project run all

You can edit the config to try out different settings or change the pipeline to
your requirements, edit the project.yml file to use different data or add
preprocessing steps, and use spacy project push and spacy project pull to
persist intermediate results to a remote storage and share them amongst your
team.

We have made this new lemmatizer available through
spacy-experimental, our
package with experimental spaCy components. In the future, we would like to move
the functionality of the edit tree lemmatizer into spaCy. You can help make this
happen by trying out the edit tree lemmatizer and posting your experiences and
feedback to the
spaCy discussion forums.

Update: May 2022

As of spaCy v3.3 the edit tree lemmatizer is now
a standard pipeline component called trainable_lemmatizer. Many of the
trained pipelines now use the trainable lemmatizer
instead of lookup-based lemmatizers.

You can try training your own lemmatizer with the
training quickstart or
spacy init config -p trainable_lemmatizer.



Source link

Leave a Comment