March 28, 2025

ikayaniaamirshahzad@gmail.com

Towards a Tagalog NLP pipeline


Update (2023-12-06): I am happy to announce that two papers came out from this project: “Developing a Named Entity Recognition Dataset for Tagalog” and “calamanCy: A Tagalog Natural Language Processing Toolkit”.

Tagalog is my native language. It’s spoken
by 76 million Filipinos and has been the country’s official language since the
30s. It’s a text-rich language, but unfortunately, a
low-resource one. In the age of big data and large language models, building
NLP pipelines for Tagalog is still difficult.

In this blog post, I’ll talk about how I built a named-entity recognition (NER)
pipeline for Tagalog. I’ll discuss how I came up with a gold-standard dataset,
my benchmarking results, and my hopes for the future of Tagalog NLP.

I don’t recommend using this pipeline for production purposes yet. See caveats.

Table of Contents (click ↵ to jump back to the TOC)

Tagalog NER data is scarce

Even if Tagalog is text-rich, the amount of annotated data is scarce. We
usually call these types of languages as low-resource. This problem isn’t
unique to Tagalog. Out of the approximately 7000 languages worldwide, only
10 have adequate NLP resources (Mortensen, 2017 and Tsvetkov,
2017
). However, we can circumvent the data scarcity problem
by bootstrapping the data we have.

We can circumvent the data scarcity problem…

There are many clever ways to circumvent the data scarcity problem. They usually
involve taking advantage of a high-resource language and transferring its
capacity to a low-resource one. The table below outlines some techniques:

Approach Data* Prerequisites Description
Supervised learning High Large amount of labeled data. Train a model as usual using feature-label pairs.
Cross-lingual transfer learning None to a few Understanding of the similarities between source and target languages. Transfer resources and models from resource-rich source to resource-poor target language.
Zero-shot learning None High similarity between source and target domains. It often helps if the latter is a “subset” of the former. Train a model in a domain and assume it generalizes out-of-the-box in another domain.
Few-shot learning A few to high Similarity between source and target domains and a task-specific finetuning dataset. Use a pretrained model from a large corpus, and then finetune on a specific task.
Polyglot learning A few to high A mixed corpus or a dataset with languages converted to a shared representation. Combine resource-rich and resource-poor languages and train them together.

* Data: amount of gold-labeled annotations required.

Table 1: List of techniques for low-resource NLP (Mortensen, Tsvetkov, 2017).

My work focuses on supervised and few-shot learning. Because these
methods require a substantial amount of data, I need to take advantage of
existing corpora. One way is to use silver-standard data. Silver-standard
annotations are usually generated by a statistical model trained from a similar
language or a knowledge base. They may not be accurate or trustworthy, but
they’re faster and cheaper to create.

…by bootstrapping the data we have

The best way to work with silver-standard data is to use them for bootstrapping
the annotations of a much larger and diverse dataset, producing
gold-standard annotations. Through this method, we reduce the
cognitive load of labeling and focus more on correcting the model’s outputs
rather than doing it from scratch. The figure below illustrates the workflow I’m
following:

By bootstrapping the annotations, we reduce the cognitive load of labeling
and focus more on correcting the model’s outputs rather than labeling from scratch.

The only major NER dataset for Tagalog is WikiANN. It is a silver-standard
dataset based on an English Knowledge Base (KB). Pan, Zhang, et al.,
(2017)
created a framework for tagging entities based on
Wikipedia and extended it to 282 other languages, including Tagalog. However, it’s not perfect. For example, the first few entries of the validation
set
have glaring
errors:

Example: Hinlalato ng paa is the middle toe finger, not an ORG.

Example: Ninoy Aquino should be tagged as PER, while Sultan Kudarat as LOC.

Also, the texts themselves aren’t complete sentences. A model trained on this
data might translate poorly to longer documents as the context of an entity is
lost.We can’t rely solely on a model trained from WikiANN. However, it can still
be useful: we can use it to train a model that bootstraps our annotations.

…the texts [in WikiANN] aren’t complete sentences. A model trained on this
data might translate poorly to longer documents…so we can’t [just] rely [on it].

Fortunately, we have a lot of unannotated datasets that represent the
diversity of the Filipino language
. For example, there is the
CommonCrawl repository that contains web-crawled
data for any language. We also have TLUnified (Cruz and Cheng,
2022
) and WikiText TL-39 (Cruz and Cheng,
2019
) that are much more recent. For my experiments, I will
use the TLUnified dataset because one of its subdomains (news) resembles that of
standard NER benchmarks like CoNLL. It contains reports from 2009 to 2020 that
were scraped from major Filipino-language news sites, broadsheet and associated
tabloid, in the Philippines (e.g., CNN Philippines, ABS-SCBN, Manila Times,
etc.).

I will be using the TLUnified dataset as it’s more recent, and one of its
subdomains resemble that of standard NER benchmarks like CoNLL.

My process goes like this: I will train a model from WikiANN and have it
predict entities for TLUnified. Then, I will correct the predictions using
Prodigy, an annotation software, to produce gold-standard
annotations. Piece of cake, right?

However, labeling thousands of samples is not the hardest part. As the sole
annotator, I can easily influence a dataset of my biases and errors. In
practice, you’d want three or more annotators and an inter-annotator agreement.
Unfortunately, this is the limitation of this work. In the next section,
I’ll outline some of my attempts to be more objective when annotating. Of
course, the ideal case is to have multiple annotators, so let me know if you
want to help out!

As the sole annotator, [I] can influence a dataset of my biases and errors.
This is the limitation of this work.

We still want gold-annotated data

This section will discuss how I annotated TLUnified to produce
gold-standard data. I’ll also introduce my benchmarking experiments to see how
baseline models perform on this dataset. For clarity, I’ll call the
annotated TLUnified as tl_tlunified_gold (tl – language code, tlunified
data source, gold – dataset type).

I corrected annotations from a silver model…

For the past three months, I corrected annotations produced by the WikiANN model.
I learned that as an annotator, it’s easier to fix annotations than label them
from scratch. I also devised
annotation guidelines (Artstein, 2017) to make the annotation process more objective. Professor Nils
Reiter has an excellent guide for
developing these. Lastly, I also took inspiration from The Guardian’s
work
,
which uses Prodigy for quotation
detection
.

For TLUnified, I used three labels for annotation:

  • PER: person entities limited to humans. It may be a single individual or group (e.g., Juan de la Cruz, Nene, mga abogado).
  • ORG: organization entities are limited to corporations, agencies, and other groups of people defined by an organizational structure (e.g., United Nations, DPWH, Meralco).
  • LOC: location entities are geographical regions, areas, and landmasses. Subtypes of geo-political entities (GPE) are also included within this group (e.g., Pilipinas, Manila, Luzon)

Again, you can check the full annotation
guidelines

to see the nuances in labeling these entities.

Since there are still gaps in my annotation process, the annotations produced in
v1.0 of tl_tlunified_gold are not ready for production. Getting multiple
annotations and developing an inter-annotator agreement for several iterations
is the ideal case.

Nevertheless, I produced some annotations for around 7,000 documents. I split
them between training, development, and test partitions and uploaded the v1.0 of
raw annotations to the cloud. You can access the raw annotations and replicate
the preprocessing step by checking out the GitHub repository of this
project
. The table below shows some dataset statistics:

Tagalog Data Documents Tokens PER ORG LOC
Training Set \(6252\) \(198588\) \(6418\) \(3121\) \(3296\)
Development Set \(782\) \(25007\) \(793\) \(392\) \(409\)
Test Set \(782\) \(25153\) \(818\) \(423\) \(438\)

Table 2: Dataset statistics for v1.0 of tl_tlunified_gold

…then benchmarked it with baseline NER approaches

I want to see how standard NER approaches fare with tl_tlunified_gold. My
eventual goal is to set up training pipelines to produce decent Tagalog
models from this dataset.
I made two sets of experiments, one involving word
vectors and the other using transformers. I aim to identify
the best training setup for this Tagalog corpus. I’m not pitting
one against the other; I want to set up training pipelines for both in the
future.

My eventual goal is to identify the best training setup for this Tagalog corpus.
I’m not pitting one against the other; I want to setup
training pipelines for both in the future.

First, I want to benchmark several word vector settings for NER. The baseline
approach simply trains a model from the training and dev data, nothing more.
Then, I will examine if adding word vectors (also called static
vectors
in
spaCy) can improve performance. Finally, I will investigate if
pretraining
can help push performance further:

Approach Setup Description
Supervised learning Baseline Train a NER model from scratch. No tricks, just the annotated data.
Supervised learning Baseline + fastText Source fastText vectors for the downstream NER task.
Supervised learning Basline + fastText + pretraining Pretrain spaCy’s token-to-vector layer while sourcing fastText vectors.

Table 3: Experimental setup for word vectors. I will use the default spaCy
hyperparameters
for all these experiments.

The figure below shows the default setup for our word vector pipeline. The
baseline approach won’t have any word vectors nor pretraining (Baseline).
Then, we will use word vectors as additional features for training our
statistical model (Baseline + fastText). Lastly, we will use pretraining to
initialize the weights of our components (Baseline + fastText + pretraining).

Next, I will measure the performance of a monolingual and multilingual language
model. I’m using transformer models as a drop-in
replacement
for
the representation layer to achieve higher accuracy:

Approach Language Models Description
Few-shot learning roberta-tagalog Monolingual experiment with a large RoBERTa model trained from TLUnified. I will be testing both base and large variants.
Few-shot learning xlm-roberta Multilingual experiment with a large XLM-RoBERTa. Trained on 100 different languages. I will be testing both base and large variants.

Table 4: Experimental setup for language models

Again, I want to use what I learned from these experiments to set up a training
scheme for a Tagalog pipeline down the road. Something akin to en_core_web_lg
or en_core_web_trf in spaCy, but for Tagalog. For all the experiments above, I
will use spaCy’s transition-based
parser
for sequence labeling.

Experimental Results

The results below aim to answer eventual design decisions for building NLP
pipelines for Tagalog. I plan to create a word vector-based and language
model-based training setup. If you’re interested in replicating my results,
check out the spaCy project in
Github!

Lastly, because we’re doing a bit of hyperparameter tuning here (choosing the
proper config, etc.), I will report the results as evaluated on the dev set to
avoid overfitting.

The results below aim to answer eventual design decisions for building
NLP pipelines for Tagalog. I plan to create a word vector-based and language model-based
training setup.

I ran each experiment for three trials, and I will report their mean and
standard deviation.

Finding the best word vector training setup

To find the best word vector training setup, I designed an experiment to test
how static vectors and pretraining can improve performance. The baseline
approach has none of these; it simply trains a model from scratch. Then, I sourced
static vectors and, eventually, a set of pretrained weights. Within these two, there
are still design choices left to be made:

  • On static vectors: by default, I’m using the vectors available from the
    fastText website. These were trained
    from CommonCrawl and Wikipedia. I’d like to know if it matters if I train my own
    fastText vectors from TLUnified and if there are efficiency gains when using floret vectors.

  • On pretraining: by default, my pretraining
    objective
    is based on
    characters, i.e., the model
    is tasked to predict some number of leading and trailing UTF-8 bytes for the
    words. However, spaCy also provides another pretraining objective based on a
    static embeddings table
    .
    I’d like to know which one is more performant between the two.

The table below shows the performance, using default settings, for our three
training scenarios. The results suggest that using a combination of static
vectors and pretraining can improve F1-score
by at least 2pp.

Setup Precision Recall F1-score
Baseline \(0.87\pm0.01\) \(0.87\pm0.01\) \(0.87\pm0.00\)
Baseline + fastText* \(\mathbf{0.89\pm0.01}\) \(0.86\pm0.01\) \(0.88\pm0.00\)
Baseline + fastText* + pretraining \(\mathbf{0.89\pm0.01}\) \(\mathbf{0.89\pm0.01}\) \(\mathbf{0.89\pm0.00}\)

* 714k keys and unique vectors. Vectors were sourced from the fastText website.

Table 5: Pipeline performance. Evaluated on the development set. Using
*default hyperparameters.

It’s also worth noting that the baseline result (87% F-score) where we trained a
model without any static vectors nor pretrained weights performs well! Of
course, we want to push these scores further so in the next sections, I’ll be
exploring alternative configurations for our word vector pipeline.

On static vectors: it is worth training floret vectors

In the previous experiment, I used the fastText vectors provided by the
fastText website. These
vectors were trained from CommonCrawl and Wikipedia. I’m curious if I can
achieve better performance if I train my own vectors.

I trained two sets of word vectors from TLUnified, one based on fastText and the
other on floret. Both vectors were
trained using the skipgram
model
,
with a dimension of \(200\), and a subword minimum (minn) and maximum size
(maxn) of \(3\) and \(5\) respectively.

Lastly, I also removed the annotated texts from TLUnified during training to ensure
no overlaps will influence our benchmark results. These results can be seen
in the table below:

Word Vectors Unique Vectors* Precision Recall F1-score
fastText (default: CommonCrawl + Wikipedia) \(714k\) \(\mathbf{0.89\pm0.01}\) \(0.86\pm0.01\) \(\mathbf{0.88\pm0.00}\)
fastText (TLUnified) \(566k\) \(\mathbf{0.89\pm0.01}\) \(\mathbf{0.88\pm0.00}\) \(\mathbf{0.88\pm0.01}\)
floret (TLUnified) \(\mathbf{200k}\) \(0.88\pm0.01\) \(\mathbf{0.88\pm0.01}\) \(\mathbf{0.88\pm0.00}\)

* This time, we’re talking about unique vectors, not keys. Several keys can map to the same vectors, and floret doesn’t use the keys table.

Table 6: Even with a smaller vector table size, floret is competitive with the default fastText vectors. Evaluated on the development set.

We can also inspect the cosine similarity of subword pairs between related and
unrelated terms. Here I’m using the vectors from fastText and floret, both trained
with TLUnified:

The results suggest that the floret vectors were able to keep the correlation
between subtokens intact despite its smaller size (\(700k\) → \(200k\)).
So perhaps training my own fastText vectors isn’t worth it, but exploring floret
is. In addition, this efficiency gain has little to no performance penalty,
as seen in Table 6. So let’s see what happens if I train floret with bucket
sizes of 100k, 50k, and 25k:

Unique Vectors Precision Recall F1-score
\(200k\) \(\mathbf{0.88\pm0.01}\) \(\mathbf{0.88\pm0.01}\) \(\mathbf{0.88\pm0.00}\)
\(100k\) \(\mathbf{0.88\pm0.00}\) \(0.87\pm0.00\) \(\mathbf{0.88\pm0.01}\)
\(50k\) \(0.86\pm0.01\) \(0.85\pm0.01\) \(0.85\pm0.00\)
\(25k\) \(0.82\pm0.02\) \(0.82\pm0.01\) \(0.81\pm0.00\)

Table 6: floret vector performance on different vector table sizes. All
*vectors were trained using TLUnified. Evaluated on the development set.

There was a slight degradation in performance when I adjusted the bucket size
from \(200k\) to \(25k\). It’s not as drastic as I expected, but it’s
interesting to see the pattern. There’s even a case for using \(100k\) rows in
floret, but for now, I’ll stick to \(200k\).

On pretraining: there is no significant difference between the two pretraining objectives

spaCy provides two optimization objectives to pretrain the token-to-vector
weights from raw data:
PretrainCharacters and
PretrainVectors. Both
use a trick called language modeling with approximate outputs (LMAO), in which
we force the network to model something about word co-occurrence.

Using their default values, I ran an experiment that compares the two. Similar
to the previous experiment, I also removed overlaps between the final dataset
and the pretraining corpus to ensure that they won’t affect the results:

Pretraining objective Precision Recall F1-score
PretrainCharacters \(0.89\pm0.01\) \(\mathbf{0.89\pm0.01}\) \(0.89\pm0.00\)
PretrainVectors \(\mathbf{0.90\pm0.01}\) \(\mathbf{0.89\pm0.00}\) \(\mathbf{0.90\pm0.00}\)

Table 7: Performance comparison between different pretraining objectives (characters vs. vectors). Evaluated on the development set.

The results suggest that there is no significant difference between the two.
PretrainVectors has a slight edge on precision, but it’s not apparent.
However, for an agglutinative language like Tagalog, our pipeline might benefit
from a model with some knowledge of a word’s affixes, so I’ll use
PretrainCharacters for the final pipeline.

Some notes for the final word vector pipeline

In the future, I hope to create pipelines akin to spaCy’s en_core_web_md or
en_core_web_lg but for Tagalog. I’ll settle for the following setup:

  • Train floret vectors from TLUnified. spaCy’s floret vectors provide
    efficiency with no cost to performance. The F1-scores are competitive even with the
    vectors sourced from fastText (trained on CommonCrawl and Wikipedia). The size
    of the model (i.e., if it’s medium (*_md) or large (*_lg)) can then be a function
    of the hash table size.
  • Pretrain token-to-vector weights using the ‘characters’ objective.
    Pretraining gives a decent boost to performance. I also think the “characters”
    objective can better model the structure of Tagalog words. Lastly, I only
    pretrained for \(5\) epochs (\(12\) hours!), so I might push it to \(10\) or
    more in the final pipeline.
  • Hyperparameter tuning. I’ve been using the default training and NER
    parameters throughout my experiments. I prefer starting from a crude pipeline
    and moving on to its finer points. I’ll spend some time doing a hyperparameter
    search using WandB
    and see if there are more optimizations I can do.

Finding the best language model training setup

With spaCy’s Huggingface
integration
, finding a decent
language model as a drop-in replacement for our token-to-vector embedding layer
is much faster. Previously, we slotted a tok2vec
embedding layer that downstream components like
ner use. Here, we effectively replace
that with a transformer model. For example, the English transformer model
en_core_web_trf uses RoBERTa
(Liu, et al., 2019) as its base. We want transformers because
of their dense and context-sensitive representations, even if they have higher
training and runtime costs.

Luckily, Tagalog has a RoBERTa-based model. The
roberta-tagalog-large
was trained using TLUnified and was benchmarked on multilabel text
classification tasks (Cruz and Cheng, 2022). The large
model has 330M parameters, whereas the base model has 110M. I’ll use both
variants throughout the experiment as my
monolingual language model of choice.

The only limitation in this setup is that roberta-tagalog-large was also
trained on parts of TLUnified that I annotated, so it may have some information
regarding my test set. However, I don’t want to pretrain my own transformer
model for now without the overlapping texts, so I’ll use this language model as
it is. I just want to caveat that this setup may have inflated my reported
scores.

On the other hand, I also want to benchmark with a multilingual transformer
model.
The XLM-RoBERTa (XLM-R) model may be a good fit (Conneau, et al.,
2019
). It was pre-trained on text in 100 languages, including
Tagalog. Most of its data source came from a cleaned version of CommonCrawl,
with Tagalog containing 556 million tokens and 3.1 GiB in size. My hope is that
the XLM-R model can take advantage of learned representations from
morphologically similar languages for our downstream task.

I finetuned these language models for three trials on different random seeds.
The results can be seen in the table below:

Language Model Precision Recall F1-score
roberta-tagalog-large \(\mathbf{0.91\pm0.01}\) \(\mathbf{0.91\pm0.02}\) \(\mathbf{0.91\pm0.01}\)
roberta-tagalog-base \(0.90\pm0.01\) \(0.89\pm0.01\) \(0.90\pm0.00\)
xlm-roberta-large \(0.88\pm0.00\) \(0.88\pm0.00\) \(0.89\pm0.01\)
xlm-roberta-base \(0.87\pm0.02\) \(0.87\pm0.01\) \(0.88\pm0.01\)

Table 8: Performance comparison between a monolingual and multilingual language model. Evaluated on the development set.

I didn’t expect the two models to be on par with one another. In addition, the
performance of our word vector pipeline (floret + pretraining) is competitive
with our transformer approach. Training for the base models took around four
hours in an NVIDIA V100 GPU (I’m using Google Colab Pro+) and twelve to fifteen
hours for the larger ones.

Some notes for the final transformer pipeline

In the future, I hope to create a transformer-based model similar to spaCy’s
en_core_web_trf for Tagalog. I’ll settle for the following setup:

  • Use roberta-tagalog-* as the transformer model. I hypothesize that a
    model trained specifically for a given language should outperform a “generalist” language model.
    I will keep tabs on XLM-R but shift focus on building upon roberta-tagalog.
  • Hyperparameter tuning. Like the word vector pipeline, I need to conduct hyperparameter
    search for my transformer pipeline. Playing around the training and span-getter parameters
    might be a good starting point.

Evaluating our pipelines

We now have a word vector and transformer-based pipeline. The former uses floret
vectors with pretraining, while the latter takes advantage of the
roberta-tagalog-* language model. Let’s do a few more
evaluation to wrap things up. For now I’ll be calling them tl_tlunified_lg and
tl_tlunified_trf to be consistent with spaCy’s model naming convention:

Let’s evaluate their performance on the test set:

Pipeline Precision Recall F1-score
tl_tlunified_lg \(0.85\pm0.01\) \(0.86\pm0.02\) \(0.86\pm0.02\)
tl_tlunified_trf (base) \(0.87\pm0.02\) \(0.87\pm0.01\) \(0.87\pm0.01\)
tl_tlunified_trf (large) \(0.89\pm0.01\) \(0.89\pm0.00\) \(0.90\pm0.02\)

Table 9: Performance comparison for the word vector and transformer-based
pipelines. Evaluated on the test set.

We can see a performance difference between the transformer and word vector
pipelines (around 4pp). Let’s see the per-entity results:

Pipeline PER ORG LOC
tl_tlunified_lg \(0.88\pm0.02\) \(0.77\pm0.02\) \(0.86\pm0.00\)
tl_tlunified_trf (base) \(0.90\pm0.01\) \(0.80\pm0.02\) \(0.87\pm0.01\)
tl_tlunified_trf (large) \(0.92\pm0.01\) \(0.81\pm0.02\) \(0.87\pm0.00\)

Table 10: F1-score comparison for the word vector and transformer-based
pipelines (per-entity). Evaluated on the test set.

Lastly, let’s look at our pipelines’ performance on unseen entities. Here, we
define an unseen test set that contains entities not seen by the model during
training. This evaluation allows us to check how a model responds to entities it
newly encounters. The way I split the texts is naive: I based them on the
orthographic representation of the words. For example, entities like “Makati
City” and “Lungsod ng Makati” will be treated as separate entities even if they
point to the same location. For the gold-annotated TLUnified, our unseen test
set has 784 documents.

Pipeline Precision Recall F1-score
tl_tlunified_lg \(0.75\pm0.01\) \(0.83\pm0.02\) \(0.79\pm0.01\)
tl_tlunified_trf (base) \(0.85\pm0.00\) \(0.84\pm0.02\) \(0.84\pm0.02\)
tl_tlunified_trf (large) \(0.88\pm0.02\) \(0.88\pm0.00\) \(0.88\pm0.00\)

Table 11: Performance comparison for the word vector and transformer-based
pipelines. Evaluated on the unseen test set.

The scores are what we expect. All pipelines dropped their performance when they
encountered previously unseen entities. However, note that the transformer
pipelines may have test set leakage during their pretraining, causing it to
inflate the scores. On the other hand, I have better control of the word vector
pipeline to run pretraining without the unseen test set.

Conclusion

In this blog post, I outlined my progress in building an NLP pipeline for
Tagalog. I started with the named-entity recognition (NER) task because it is a
crucial problem with many applications. I talked about…

  • ..how I created a gold-annotated corpus for Tagalog. I used a larger dataset
    called TLUnified and pre-annotated it with predictions from a silver-standard
    model trained from WikiANN. I then corrected its annotations using Prodigy,
    thereby producing around 7000+ documents.
  • …how I built a word-vector and language model-based pipeline. I did some tests
    to decide the best setup for my two pipelines. The word vector pipeline consists
    of a hash table trained from floret and a pretrained token-to-vector weight
    matrix. On the other hand, the language model pipeline is based on
    roberta-tagalog. I also tested how a multilingual model like XLM-R fares
    on the dataset.
  • …how I evaluated each pipeline. Aside from the benchmarking tests, I
    evaluated each pipeline on the held-out test set and a test set with
    previously-unseen entities. The transformer-based models worked well, while the
    word vector-based model is still competitive.

To summarize the process above, we have this figure below. I hope that this simplifies
the whole procedure on building an NLP pipeline:

Next steps

There are still a lot of things I want to do. Let’s use the same figure to talk
about the potential next steps for this work:

  • On annotation. Right now, the annotations were from me. Although I wrote
    myself an annotation guideline, it isn’t enough to make the labels more
    objective and error-free. Having multiple annotators and computing for
    inter-annotator agreement may be the best option.
  • On building the pipeline. I want to spend some time performing
    hyperparameter search to optimize the NLP pipelines. For the word vector
    pipeline, it may also be good to do degradation tests to see how the pretrained
    weights respond to dataset size. In the future, I see myself adding more
    components to this pipeline, not just ner. For example, some folks at the
    University of the Philippines are creating a larger Tagalog UD Treebank. I can
    tap into their corpus to train a dependency parser and parts-of-speech (POS)
    tagger for my existing pipelines.
  • On evaluation. Aside from evaluating a held-out test set and
    previously-unseen entities, I want to improve the evaluation scheme and include
    perturbations and irregular train-test splits (Vajjala and Balasubramaniam, 2022, Søgaard, et al., 2021).

Caveats

As I’ve prefaced at the beginning of this blog post, I do not recommend using
this for production. This NLP pipeline is still a work in progress, and I’m
developing ways to make the model more robust. Here are the limitations of this
work:

  • The gold-annotated dataset was labeled by a single annotator. Ideally, this would be done by multiple
    annotators to limit bias and errors. You can get the annotated dataset now but use it with caution.
  • The pipelines to date still need to be optimized. I still need to perform a more thorough parameter search to
    improve the results. This applies to both the word vector and language model-based pipelines.
  • The evaluation still needs a bit of work. Although evaluating on the test set and unseen entities may
    be standard practice, I’d like to investigate more on which instances the model works.

The Tagalog corpus is the biggest bottleneck. Increasing the amount of
gold-annotated data or increasing the reliability of silver-standard annotations
should be a priority. The former is a bit of a brute-force approach, but I think
that sometimes we just need to sit down, tackle the problem, and start
annotating.

Final thoughts

In Tagalog, we have this word called diskarte. There is no direct translation
in English, but I can describe it loosely as resourcefulness and creativity.
It’s not a highly-cognitive trait: smart people may be bookish, but not
madiskarte. It’s more practical, a form of street smarts, even. Diskarte is
a highly-Filipino trait, borne from our need to solve things creatively in the
presence of constraints (Morales, 2017). I mention this
because working in Tagalog, or any low-resource language, requires a little
diskarte, and I enjoy it!

There are many exciting ways to tackle Tagalog NLP. Right now, I’m taking the
standard labeling, training, and evaluation approach. However, I’m interested in
exploring model-based techniques like cross-lingual transfer learning and
multilingual NLP to “get around” the data bottleneck. After three months (twelve
weekends, to be specific) of labeling, I realized how long and costly the
process was. I still believe in getting gold-standard annotations, but I also
want to balance this approach with model-based solutions. However, I still
believe that getting enough annotations to support a structured dataset
evaluation is worth it.

I wish we had more consolidated efforts to work on Tagalog NLP. Right now, I
noticed that research progress for each institution is disconnected from one
another. I definitely like what’s happening in
Masakhane for African languages and
IndoNLP for Indonesian. I think they are good
community models to follow. Lastly, Tagalog is not the only language in the
Philippines, and being able to solve one Filipino language at a time would be
nice.

Right now, I’m working on
calamanCy, my attempt to create
spaCy pipelines for Tagalog. Its name is based on
kalamansi, a citrus fruit common in
the Philippines. Unfortunately, it’s something that I’ve been working on in my
spare time, so progress is slower than usual! This blog post contains my
experiments on building the NER part of the pipeline. I plan to add a dependency
parser and POS tagger from Universal Dependencies in the future.

That’s all for now. Feel free to hit me up if you have any questions and want to
collaborate! Maraming salamat!

References



Source link

Leave a Comment