Update (2023-12-06): I am happy to announce that two papers came out from this project: “Developing a Named Entity Recognition Dataset for Tagalog” and “calamanCy: A Tagalog Natural Language Processing Toolkit”.
Tagalog is my native language. It’s spoken
by 76 million Filipinos and has been the country’s official language since the
30s. It’s a text-rich language, but unfortunately, a
low-resource one. In the age of big data and large language models, building
NLP pipelines for Tagalog is still difficult.
In this blog post, I’ll talk about how I built a named-entity recognition (NER)
pipeline for Tagalog. I’ll discuss how I came up with a gold-standard dataset,
my benchmarking results, and my hopes for the future of Tagalog NLP.
I don’t recommend using this pipeline for production purposes yet. See caveats.
Table of Contents (click ↵ to jump back to the TOC)
Tagalog NER data is scarce ↵
Even if Tagalog is text-rich, the amount of annotated data is scarce. We
usually call these types of languages as low-resource. This problem isn’t
unique to Tagalog. Out of the approximately 7000 languages worldwide, only
10 have adequate NLP resources (Mortensen, 2017 and Tsvetkov,
2017). However, we can circumvent the data scarcity problem
by bootstrapping the data we have.
We can circumvent the data scarcity problem… ↵
There are many clever ways to circumvent the data scarcity problem. They usually
involve taking advantage of a high-resource language and transferring its
capacity to a low-resource one. The table below outlines some techniques:
Approach | Data* | Prerequisites | Description |
---|---|---|---|
Supervised learning | High | Large amount of labeled data. | Train a model as usual using feature-label pairs. |
Cross-lingual transfer learning | None to a few | Understanding of the similarities between source and target languages. | Transfer resources and models from resource-rich source to resource-poor target language. |
Zero-shot learning | None | High similarity between source and target domains. It often helps if the latter is a “subset” of the former. | Train a model in a domain and assume it generalizes out-of-the-box in another domain. |
Few-shot learning | A few to high | Similarity between source and target domains and a task-specific finetuning dataset. | Use a pretrained model from a large corpus, and then finetune on a specific task. |
Polyglot learning | A few to high | A mixed corpus or a dataset with languages converted to a shared representation. | Combine resource-rich and resource-poor languages and train them together. |
* Data: amount of gold-labeled annotations required.
Table 1: List of techniques for low-resource NLP (Mortensen, Tsvetkov, 2017).
My work focuses on supervised and few-shot learning. Because these
methods require a substantial amount of data, I need to take advantage of
existing corpora. One way is to use silver-standard data. Silver-standard
annotations are usually generated by a statistical model trained from a similar
language or a knowledge base. They may not be accurate or trustworthy, but
they’re faster and cheaper to create.
…by bootstrapping the data we have ↵
The best way to work with silver-standard data is to use them for bootstrapping
the annotations of a much larger and diverse dataset, producing
gold-standard annotations. Through this method, we reduce the
cognitive load of labeling and focus more on correcting the model’s outputs
rather than doing it from scratch. The figure below illustrates the workflow I’m
following:
By bootstrapping the annotations, we reduce the cognitive load of labeling
and focus more on correcting the model’s outputs rather than labeling from scratch.
The only major NER dataset for Tagalog is WikiANN. It is a silver-standard
dataset based on an English Knowledge Base (KB). Pan, Zhang, et al.,
(2017) created a framework for tagging entities based on
Wikipedia and extended it to 282 other languages, including Tagalog. However, it’s not perfect. For example, the first few entries of the validation
set have glaring
errors:
Example: Hinlalato ng paa is the middle toe finger, not an ORG
.
Example: Ninoy Aquino should be tagged as PER
, while Sultan Kudarat as LOC
.
Also, the texts themselves aren’t complete sentences. A model trained on this
data might translate poorly to longer documents as the context of an entity is
lost.We can’t rely solely on a model trained from WikiANN. However, it can still
be useful: we can use it to train a model that bootstraps our annotations.
…the texts [in WikiANN] aren’t complete sentences. A model trained on this
data might translate poorly to longer documents…so we can’t [just] rely [on it].
Fortunately, we have a lot of unannotated datasets that represent the
diversity of the Filipino language. For example, there is the
CommonCrawl repository that contains web-crawled
data for any language. We also have TLUnified (Cruz and Cheng,
2022) and WikiText TL-39 (Cruz and Cheng,
2019) that are much more recent. For my experiments, I will
use the TLUnified dataset because one of its subdomains (news) resembles that of
standard NER benchmarks like CoNLL. It contains reports from 2009 to 2020 that
were scraped from major Filipino-language news sites, broadsheet and associated
tabloid, in the Philippines (e.g., CNN Philippines, ABS-SCBN, Manila Times,
etc.).
I will be using the TLUnified dataset as it’s more recent, and one of its
subdomains resemble that of standard NER benchmarks like CoNLL.
My process goes like this: I will train a model from WikiANN and have it
predict entities for TLUnified. Then, I will correct the predictions using
Prodigy, an annotation software, to produce gold-standard
annotations. Piece of cake, right?
However, labeling thousands of samples is not the hardest part. As the sole
annotator, I can easily influence a dataset of my biases and errors. In
practice, you’d want three or more annotators and an inter-annotator agreement.
Unfortunately, this is the limitation of this work. In the next section,
I’ll outline some of my attempts to be more objective when annotating. Of
course, the ideal case is to have multiple annotators, so let me know if you
want to help out!
As the sole annotator, [I] can influence a dataset of my biases and errors.
This is the limitation of this work.
We still want gold-annotated data ↵
This section will discuss how I annotated TLUnified to produce
gold-standard data. I’ll also introduce my benchmarking experiments to see how
baseline models perform on this dataset. For clarity, I’ll call the
annotated TLUnified as tl_tlunified_gold
(tl
– language code, tlunified
–
data source, gold
– dataset type).
I corrected annotations from a silver model… ↵
For the past three months, I corrected annotations produced by the WikiANN model.
I learned that as an annotator, it’s easier to fix annotations than label them
from scratch. I also devised
annotation guidelines (Artstein, 2017) to make the annotation process more objective. Professor Nils
Reiter has an excellent guide for
developing these. Lastly, I also took inspiration from The Guardian’s
work,
which uses Prodigy for quotation
detection.
For TLUnified, I used three labels for annotation:
- PER: person entities limited to humans. It may be a single individual or group (e.g., Juan de la Cruz, Nene, mga abogado).
- ORG: organization entities are limited to corporations, agencies, and other groups of people defined by an organizational structure (e.g., United Nations, DPWH, Meralco).
- LOC: location entities are geographical regions, areas, and landmasses. Subtypes of geo-political entities (GPE) are also included within this group (e.g., Pilipinas, Manila, Luzon)
Again, you can check the full annotation
guidelines
to see the nuances in labeling these entities.
Since there are still gaps in my annotation process, the annotations produced in
v1.0 of tl_tlunified_gold
are not ready for production. Getting multiple
annotations and developing an inter-annotator agreement for several iterations
is the ideal case.
Nevertheless, I produced some annotations for around 7,000 documents. I split
them between training, development, and test partitions and uploaded the v1.0 of
raw annotations to the cloud. You can access the raw annotations and replicate
the preprocessing step by checking out the GitHub repository of this
project. The table below shows some dataset statistics:
Tagalog Data | Documents | Tokens | PER | ORG | LOC |
---|---|---|---|---|---|
Training Set | \(6252\) | \(198588\) | \(6418\) | \(3121\) | \(3296\) |
Development Set | \(782\) | \(25007\) | \(793\) | \(392\) | \(409\) |
Test Set | \(782\) | \(25153\) | \(818\) | \(423\) | \(438\) |
Table 2: Dataset statistics for v1.0 of tl_tlunified_gold
…then benchmarked it with baseline NER approaches ↵
I want to see how standard NER approaches fare with tl_tlunified_gold
. My
eventual goal is to set up training pipelines to produce decent Tagalog
models from this dataset. I made two sets of experiments, one involving word
vectors and the other using transformers. I aim to identify
the best training setup for this Tagalog corpus. I’m not pitting
one against the other; I want to set up training pipelines for both in the
future.
My eventual goal is to identify the best training setup for this Tagalog corpus.
I’m not pitting one against the other; I want to setup
training pipelines for both in the future.
First, I want to benchmark several word vector settings for NER. The baseline
approach simply trains a model from the training and dev data, nothing more.
Then, I will examine if adding word vectors (also called static
vectors in
spaCy) can improve performance. Finally, I will investigate if
pretraining
can help push performance further:
Approach | Setup | Description |
---|---|---|
Supervised learning | Baseline | Train a NER model from scratch. No tricks, just the annotated data. |
Supervised learning | Baseline + fastText | Source fastText vectors for the downstream NER task. |
Supervised learning | Basline + fastText + pretraining | Pretrain spaCy’s token-to-vector layer while sourcing fastText vectors. |
Table 3: Experimental setup for word vectors. I will use the default spaCy
hyperparameters for all these experiments.
The figure below shows the default setup for our word vector pipeline. The
baseline approach won’t have any word vectors nor pretraining (Baseline).
Then, we will use word vectors as additional features for training our
statistical model (Baseline + fastText). Lastly, we will use pretraining to
initialize the weights of our components (Baseline + fastText + pretraining).
Next, I will measure the performance of a monolingual and multilingual language
model. I’m using transformer models as a drop-in
replacement for
the representation layer to achieve higher accuracy:
Approach | Language Models | Description |
---|---|---|
Few-shot learning | roberta-tagalog | Monolingual experiment with a large RoBERTa model trained from TLUnified. I will be testing both base and large variants. |
Few-shot learning | xlm-roberta | Multilingual experiment with a large XLM-RoBERTa. Trained on 100 different languages. I will be testing both base and large variants. |
Table 4: Experimental setup for language models
Again, I want to use what I learned from these experiments to set up a training
scheme for a Tagalog pipeline down the road. Something akin to en_core_web_lg
or en_core_web_trf
in spaCy, but for Tagalog. For all the experiments above, I
will use spaCy’s transition-based
parser for sequence labeling.
Experimental Results ↵
The results below aim to answer eventual design decisions for building NLP
pipelines for Tagalog. I plan to create a word vector-based and language
model-based training setup. If you’re interested in replicating my results,
check out the spaCy project in
Github!
Lastly, because we’re doing a bit of hyperparameter tuning here (choosing the
proper config, etc.), I will report the results as evaluated on the dev set to
avoid overfitting.
The results below aim to answer eventual design decisions for building
NLP pipelines for Tagalog. I plan to create a word vector-based and language model-based
training setup.
I ran each experiment for three trials, and I will report their mean and
standard deviation.
Finding the best word vector training setup ↵
To find the best word vector training setup, I designed an experiment to test
how static vectors and pretraining can improve performance. The baseline
approach has none of these; it simply trains a model from scratch. Then, I sourced
static vectors and, eventually, a set of pretrained weights. Within these two, there
are still design choices left to be made:
-
On static vectors: by default, I’m using the vectors available from the
fastText website. These were trained
from CommonCrawl and Wikipedia. I’d like to know if it matters if I train my own
fastText vectors from TLUnified and if there are efficiency gains when using floret vectors. -
On pretraining: by default, my pretraining
objective is based on
characters, i.e., the model
is tasked to predict some number of leading and trailing UTF-8 bytes for the
words. However, spaCy also provides another pretraining objective based on a
static embeddings table.
I’d like to know which one is more performant between the two.
The table below shows the performance, using default settings, for our three
training scenarios. The results suggest that using a combination of static
vectors and pretraining can improve F1-score by at least 2pp.
Setup | Precision | Recall | F1-score |
---|---|---|---|
Baseline | \(0.87\pm0.01\) | \(0.87\pm0.01\) | \(0.87\pm0.00\) |
Baseline + fastText* | \(\mathbf{0.89\pm0.01}\) | \(0.86\pm0.01\) | \(0.88\pm0.00\) |
Baseline + fastText* + pretraining | \(\mathbf{0.89\pm0.01}\) | \(\mathbf{0.89\pm0.01}\) | \(\mathbf{0.89\pm0.00}\) |
* 714k keys and unique vectors. Vectors were sourced from the fastText website.
Table 5: Pipeline performance. Evaluated on the development set. Using
*default hyperparameters.
It’s also worth noting that the baseline result (87% F-score) where we trained a
model without any static vectors nor pretrained weights performs well! Of
course, we want to push these scores further so in the next sections, I’ll be
exploring alternative configurations for our word vector pipeline.
On static vectors: it is worth training floret vectors ↵
In the previous experiment, I used the fastText vectors provided by the
fastText website. These
vectors were trained from CommonCrawl and Wikipedia. I’m curious if I can
achieve better performance if I train my own vectors.
I trained two sets of word vectors from TLUnified, one based on fastText and the
other on floret. Both vectors were
trained using the skipgram
model,
with a dimension of \(200\), and a subword minimum (minn
) and maximum size
(maxn
) of \(3\) and \(5\) respectively.
Lastly, I also removed the annotated texts from TLUnified during training to ensure
no overlaps will influence our benchmark results. These results can be seen
in the table below:
Word Vectors | Unique Vectors* | Precision | Recall | F1-score |
---|---|---|---|---|
fastText (default: CommonCrawl + Wikipedia) | \(714k\) | \(\mathbf{0.89\pm0.01}\) | \(0.86\pm0.01\) | \(\mathbf{0.88\pm0.00}\) |
fastText (TLUnified) | \(566k\) | \(\mathbf{0.89\pm0.01}\) | \(\mathbf{0.88\pm0.00}\) | \(\mathbf{0.88\pm0.01}\) |
floret (TLUnified) | \(\mathbf{200k}\) | \(0.88\pm0.01\) | \(\mathbf{0.88\pm0.01}\) | \(\mathbf{0.88\pm0.00}\) |
* This time, we’re talking about unique vectors, not keys. Several keys can map to the same vectors, and floret doesn’t use the keys table.
Table 6: Even with a smaller vector table size, floret is competitive with the default fastText vectors. Evaluated on the development set.
We can also inspect the cosine similarity of subword pairs between related and
unrelated terms. Here I’m using the vectors from fastText and floret, both trained
with TLUnified:
The results suggest that the floret vectors were able to keep the correlation
between subtokens intact despite its smaller size (\(700k\) → \(200k\)).
So perhaps training my own fastText vectors isn’t worth it, but exploring floret
is. In addition, this efficiency gain has little to no performance penalty,
as seen in Table 6. So let’s see what happens if I train floret with bucket
sizes of 100k, 50k, and 25k:
Unique Vectors | Precision | Recall | F1-score |
---|---|---|---|
\(200k\) | \(\mathbf{0.88\pm0.01}\) | \(\mathbf{0.88\pm0.01}\) | \(\mathbf{0.88\pm0.00}\) |
\(100k\) | \(\mathbf{0.88\pm0.00}\) | \(0.87\pm0.00\) | \(\mathbf{0.88\pm0.01}\) |
\(50k\) | \(0.86\pm0.01\) | \(0.85\pm0.01\) | \(0.85\pm0.00\) |
\(25k\) | \(0.82\pm0.02\) | \(0.82\pm0.01\) | \(0.81\pm0.00\) |
Table 6: floret
vector performance on different vector table sizes. All
*vectors were trained using TLUnified. Evaluated on the development set.
There was a slight degradation in performance when I adjusted the bucket size
from \(200k\) to \(25k\). It’s not as drastic as I expected, but it’s
interesting to see the pattern. There’s even a case for using \(100k\) rows in
floret, but for now, I’ll stick to \(200k\).
On pretraining: there is no significant difference between the two pretraining objectives ↵
spaCy provides two optimization objectives to pretrain the token-to-vector
weights from raw data:
PretrainCharacters
and
PretrainVectors
. Both
use a trick called language modeling with approximate outputs (LMAO), in which
we force the network to model something about word co-occurrence.
Using their default values, I ran an experiment that compares the two. Similar
to the previous experiment, I also removed overlaps between the final dataset
and the pretraining corpus to ensure that they won’t affect the results:
Pretraining objective | Precision | Recall | F1-score |
---|---|---|---|
PretrainCharacters |
\(0.89\pm0.01\) | \(\mathbf{0.89\pm0.01}\) | \(0.89\pm0.00\) |
PretrainVectors |
\(\mathbf{0.90\pm0.01}\) | \(\mathbf{0.89\pm0.00}\) | \(\mathbf{0.90\pm0.00}\) |
Table 7: Performance comparison between different pretraining objectives (characters vs. vectors). Evaluated on the development set.
The results suggest that there is no significant difference between the two.
PretrainVectors
has a slight edge on precision, but it’s not apparent.
However, for an agglutinative language like Tagalog, our pipeline might benefit
from a model with some knowledge of a word’s affixes, so I’ll use
PretrainCharacters
for the final pipeline.
Some notes for the final word vector pipeline ↵
In the future, I hope to create pipelines akin to spaCy’s en_core_web_md
or
en_core_web_lg
but for Tagalog. I’ll settle for the following setup:
- Train floret vectors from TLUnified. spaCy’s floret vectors provide
efficiency with no cost to performance. The F1-scores are competitive even with the
vectors sourced from fastText (trained on CommonCrawl and Wikipedia). The size
of the model (i.e., if it’s medium (*_md
) or large (*_lg
)) can then be a function
of the hash table size. - Pretrain token-to-vector weights using the ‘characters’ objective.
Pretraining gives a decent boost to performance. I also think the “characters”
objective can better model the structure of Tagalog words. Lastly, I only
pretrained for \(5\) epochs (\(12\) hours!), so I might push it to \(10\) or
more in the final pipeline. - Hyperparameter tuning. I’ve been using the default training and NER
parameters throughout my experiments. I prefer starting from a crude pipeline
and moving on to its finer points. I’ll spend some time doing a hyperparameter
search using WandB and see if there are more optimizations I can do.
Finding the best language model training setup ↵
With spaCy’s Huggingface
integration, finding a decent
language model as a drop-in replacement for our token-to-vector embedding layer
is much faster. Previously, we slotted a tok2vec
embedding layer that downstream components like
ner
use. Here, we effectively replace
that with a transformer model. For example, the English transformer model
en_core_web_trf
uses RoBERTa
(Liu, et al., 2019) as its base. We want transformers because
of their dense and context-sensitive representations, even if they have higher
training and runtime costs.
Luckily, Tagalog has a RoBERTa-based model. The
roberta-tagalog-large
was trained using TLUnified and was benchmarked on multilabel text
classification tasks (Cruz and Cheng, 2022). The large
model has 330M parameters, whereas the base model has 110M. I’ll use both
variants throughout the experiment as my
monolingual language model of choice.
The only limitation in this setup is that roberta-tagalog-large
was also
trained on parts of TLUnified that I annotated, so it may have some information
regarding my test set. However, I don’t want to pretrain my own transformer
model for now without the overlapping texts, so I’ll use this language model as
it is. I just want to caveat that this setup may have inflated my reported
scores.
On the other hand, I also want to benchmark with a multilingual transformer
model. The XLM-RoBERTa (XLM-R) model may be a good fit (Conneau, et al.,
2019). It was pre-trained on text in 100 languages, including
Tagalog. Most of its data source came from a cleaned version of CommonCrawl,
with Tagalog containing 556 million tokens and 3.1 GiB in size. My hope is that
the XLM-R model can take advantage of learned representations from
morphologically similar languages for our downstream task.
I finetuned these language models for three trials on different random seeds.
The results can be seen in the table below:
Language Model | Precision | Recall | F1-score |
---|---|---|---|
roberta-tagalog-large | \(\mathbf{0.91\pm0.01}\) | \(\mathbf{0.91\pm0.02}\) | \(\mathbf{0.91\pm0.01}\) |
roberta-tagalog-base | \(0.90\pm0.01\) | \(0.89\pm0.01\) | \(0.90\pm0.00\) |
xlm-roberta-large | \(0.88\pm0.00\) | \(0.88\pm0.00\) | \(0.89\pm0.01\) |
xlm-roberta-base | \(0.87\pm0.02\) | \(0.87\pm0.01\) | \(0.88\pm0.01\) |
Table 8: Performance comparison between a monolingual and multilingual language model. Evaluated on the development set.
I didn’t expect the two models to be on par with one another. In addition, the
performance of our word vector pipeline (floret + pretraining) is competitive
with our transformer approach. Training for the base models took around four
hours in an NVIDIA V100 GPU (I’m using Google Colab Pro+) and twelve to fifteen
hours for the larger ones.
Some notes for the final transformer pipeline ↵
In the future, I hope to create a transformer-based model similar to spaCy’s
en_core_web_trf
for Tagalog. I’ll settle for the following setup:
- Use
roberta-tagalog-*
as the transformer model. I hypothesize that a
model trained specifically for a given language should outperform a “generalist” language model.
I will keep tabs on XLM-R but shift focus on building uponroberta-tagalog
. - Hyperparameter tuning. Like the word vector pipeline, I need to conduct hyperparameter
search for my transformer pipeline. Playing around the training and span-getter parameters
might be a good starting point.
Evaluating our pipelines ↵
We now have a word vector and transformer-based pipeline. The former uses floret
vectors with pretraining, while the latter takes advantage of the
roberta-tagalog-*
language model. Let’s do a few more
evaluation to wrap things up. For now I’ll be calling them tl_tlunified_lg
and
tl_tlunified_trf
to be consistent with spaCy’s model naming convention:
Let’s evaluate their performance on the test set:
Pipeline | Precision | Recall | F1-score |
---|---|---|---|
tl_tlunified_lg | \(0.85\pm0.01\) | \(0.86\pm0.02\) | \(0.86\pm0.02\) |
tl_tlunified_trf (base) | \(0.87\pm0.02\) | \(0.87\pm0.01\) | \(0.87\pm0.01\) |
tl_tlunified_trf (large) | \(0.89\pm0.01\) | \(0.89\pm0.00\) | \(0.90\pm0.02\) |
Table 9: Performance comparison for the word vector and transformer-based
pipelines. Evaluated on the test set.
We can see a performance difference between the transformer and word vector
pipelines (around 4pp). Let’s see the per-entity results:
Pipeline | PER | ORG | LOC |
---|---|---|---|
tl_tlunified_lg | \(0.88\pm0.02\) | \(0.77\pm0.02\) | \(0.86\pm0.00\) |
tl_tlunified_trf (base) | \(0.90\pm0.01\) | \(0.80\pm0.02\) | \(0.87\pm0.01\) |
tl_tlunified_trf (large) | \(0.92\pm0.01\) | \(0.81\pm0.02\) | \(0.87\pm0.00\) |
Table 10: F1-score comparison for the word vector and transformer-based
pipelines (per-entity). Evaluated on the test set.
Lastly, let’s look at our pipelines’ performance on unseen entities. Here, we
define an unseen test set that contains entities not seen by the model during
training. This evaluation allows us to check how a model responds to entities it
newly encounters. The way I split the texts is naive: I based them on the
orthographic representation of the words. For example, entities like “Makati
City” and “Lungsod ng Makati” will be treated as separate entities even if they
point to the same location. For the gold-annotated TLUnified, our unseen test
set has 784 documents.
Pipeline | Precision | Recall | F1-score |
---|---|---|---|
tl_tlunified_lg | \(0.75\pm0.01\) | \(0.83\pm0.02\) | \(0.79\pm0.01\) |
tl_tlunified_trf (base) | \(0.85\pm0.00\) | \(0.84\pm0.02\) | \(0.84\pm0.02\) |
tl_tlunified_trf (large) | \(0.88\pm0.02\) | \(0.88\pm0.00\) | \(0.88\pm0.00\) |
Table 11: Performance comparison for the word vector and transformer-based
pipelines. Evaluated on the unseen test set.
The scores are what we expect. All pipelines dropped their performance when they
encountered previously unseen entities. However, note that the transformer
pipelines may have test set leakage during their pretraining, causing it to
inflate the scores. On the other hand, I have better control of the word vector
pipeline to run pretraining without the unseen test set.
Conclusion ↵
In this blog post, I outlined my progress in building an NLP pipeline for
Tagalog. I started with the named-entity recognition (NER) task because it is a
crucial problem with many applications. I talked about…
- ..how I created a gold-annotated corpus for Tagalog. I used a larger dataset
called TLUnified and pre-annotated it with predictions from a silver-standard
model trained from WikiANN. I then corrected its annotations using Prodigy,
thereby producing around 7000+ documents. - …how I built a word-vector and language model-based pipeline. I did some tests
to decide the best setup for my two pipelines. The word vector pipeline consists
of a hash table trained from floret and a pretrained token-to-vector weight
matrix. On the other hand, the language model pipeline is based on
roberta-tagalog
. I also tested how a multilingual model like XLM-R fares
on the dataset. - …how I evaluated each pipeline. Aside from the benchmarking tests, I
evaluated each pipeline on the held-out test set and a test set with
previously-unseen entities. The transformer-based models worked well, while the
word vector-based model is still competitive.
To summarize the process above, we have this figure below. I hope that this simplifies
the whole procedure on building an NLP pipeline:
Next steps ↵
There are still a lot of things I want to do. Let’s use the same figure to talk
about the potential next steps for this work:
- On annotation. Right now, the annotations were from me. Although I wrote
myself an annotation guideline, it isn’t enough to make the labels more
objective and error-free. Having multiple annotators and computing for
inter-annotator agreement may be the best option. - On building the pipeline. I want to spend some time performing
hyperparameter search to optimize the NLP pipelines. For the word vector
pipeline, it may also be good to do degradation tests to see how the pretrained
weights respond to dataset size. In the future, I see myself adding more
components to this pipeline, not justner
. For example, some folks at the
University of the Philippines are creating a larger Tagalog UD Treebank. I can
tap into their corpus to train a dependency parser and parts-of-speech (POS)
tagger for my existing pipelines. - On evaluation. Aside from evaluating a held-out test set and
previously-unseen entities, I want to improve the evaluation scheme and include
perturbations and irregular train-test splits (Vajjala and Balasubramaniam, 2022, Søgaard, et al., 2021).
Caveats ↵
As I’ve prefaced at the beginning of this blog post, I do not recommend using
this for production. This NLP pipeline is still a work in progress, and I’m
developing ways to make the model more robust. Here are the limitations of this
work:
- The gold-annotated dataset was labeled by a single annotator. Ideally, this would be done by multiple
annotators to limit bias and errors. You can get the annotated dataset now but use it with caution. - The pipelines to date still need to be optimized. I still need to perform a more thorough parameter search to
improve the results. This applies to both the word vector and language model-based pipelines. - The evaluation still needs a bit of work. Although evaluating on the test set and unseen entities may
be standard practice, I’d like to investigate more on which instances the model works.
The Tagalog corpus is the biggest bottleneck. Increasing the amount of
gold-annotated data or increasing the reliability of silver-standard annotations
should be a priority. The former is a bit of a brute-force approach, but I think
that sometimes we just need to sit down, tackle the problem, and start
annotating.
Final thoughts ↵
In Tagalog, we have this word called diskarte. There is no direct translation
in English, but I can describe it loosely as resourcefulness and creativity.
It’s not a highly-cognitive trait: smart people may be bookish, but not
madiskarte. It’s more practical, a form of street smarts, even. Diskarte is
a highly-Filipino trait, borne from our need to solve things creatively in the
presence of constraints (Morales, 2017). I mention this
because working in Tagalog, or any low-resource language, requires a little
diskarte, and I enjoy it!
There are many exciting ways to tackle Tagalog NLP. Right now, I’m taking the
standard labeling, training, and evaluation approach. However, I’m interested in
exploring model-based techniques like cross-lingual transfer learning and
multilingual NLP to “get around” the data bottleneck. After three months (twelve
weekends, to be specific) of labeling, I realized how long and costly the
process was. I still believe in getting gold-standard annotations, but I also
want to balance this approach with model-based solutions. However, I still
believe that getting enough annotations to support a structured dataset
evaluation is worth it.
I wish we had more consolidated efforts to work on Tagalog NLP. Right now, I
noticed that research progress for each institution is disconnected from one
another. I definitely like what’s happening in
Masakhane for African languages and
IndoNLP for Indonesian. I think they are good
community models to follow. Lastly, Tagalog is not the only language in the
Philippines, and being able to solve one Filipino language at a time would be
nice.
Right now, I’m working on
calamanCy, my attempt to create
spaCy pipelines for Tagalog. Its name is based on
kalamansi, a citrus fruit common in
the Philippines. Unfortunately, it’s something that I’ve been working on in my
spare time, so progress is slower than usual! This blog post contains my
experiments on building the NER part of the pipeline. I plan to add a dependency
parser and POS tagger from Universal Dependencies in the future.
That’s all for now. Feel free to hit me up if you have any questions and want to
collaborate! Maraming salamat!
References
- Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. “Unsupervised cross-lingual representation learning at scale.” arXiv preprint arXiv:1911.02116 (2019).
- David Mortensen., Undated. Low-Resource NLP. Algorithms for Natural Language Processing [Slides]
- Kyle Gorman and Steven Bedrick. 2019. We Need to Talk about Standard Splits. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2786–2791, Florence, Italy. Association for Computational Linguistics.
- Jan Christian Blaise Cruz and Charibeth Cheng. 2022. Improving Large-scale Language Models and Resources for Filipino. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6548–6555, Marseille, France. European Language Resources Association.
- Jan Christian Blaise Cruz and Charibeth Cheng. 2019. Evaluating Language Model Finetuning Techniques for Low-resource Languages arXiv:1907.00409.
- Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “Roberta: A robustly optimized bert pretraining approach.” arXiv preprint arXiv:1907.11692 (2019).
- Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
- Ron Artstein. Inter-annotator Agreement. In: Ide Nancy & Pustejovsky James (eds.) Handbook of Linguistic Annotation. Springer, Dordrecht, 2017. DOI 10.1007/978-94-024-0881-2.
- Anders Søgaard, Sebastian Ebert, Jasmijn Bastings, and Katja Filippova. 2021. We Need To Talk About Random Splits. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1823–1832, Online. Association for Computational Linguistics.
- Sowmya Vajjala, and Ramya Balasubramaniam. “What do we Really Know about State of the Art NER?.” arXiv preprint arXiv:2205.00034 (2022).
- Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual Name Tagging and Linking for 282 Languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
- Yulia Tsvetkov, 2017. Opportunities and Challenges in Working with Low-Resource Languages. Language Technologies Institute, Carnegie Mellon University. [Slides].
- Marie Rose H. Morales, 2017. Defining Diskarte: Exploring Cognitive Processes, Personality Traits, and Social Constraints in Creative Problem Solving, Philippine Journal of Psychology vol. 50, no. 2.