March 29, 2025

ikayaniaamirshahzad@gmail.com

Predicting GitHub Tags · Explosion

One could learn how an oven works, but that doesn’t mean that you’ve learned how
to cook. Similarly, one could understand the syntax of a machine learning tool,
and still not be able to apply the technology in a meaningful way. That’s why in
this blogpost I’d like to describe some topics that surround the creation of a
spaCy project that isn’t directly related to syntax and instead relate more to
“the act” of doing an NLP project in general.

As an example use-case to focus on, we’ll be predicting tags for GitHub issues.
The goal isn’t to discuss the syntax or the commands that you’ll need to run.
Instead, this blog post will describe how a project might start and evolve.
We’ll start with a public dataset, but while working on the project we’ll also
build a custom labelling interface, improve model performance by selectively
ignoring parts of the data and even build a model reporting tool for spaCy as a
by-product.

Motivating example

Having recently joined Explosion, I noticed the manual effort involved in
labelling
issues on the spaCy GitHub repository.
Talking to colleagues who maintain the tracker on a daily basis, they mentioned
that some sort of automated label suggester could be helpful to reduce manual
load and enforce more labelling consistency.

This repository has over 5000 issues, most of which have one or more tags
attached that the project’s maintainers have added. Some of these tags, for
example, indicate that it’s about a bug, while others show there’s an issue in
the documentation.

I discussed the idea of predicting tags with Sofie, one of the core developers
of spaCy. She was excited by the idea and would support the project as the
“domain expert” who could explain the details of the project whenever I would
miss the relevant context.

After a small discussion, I verified some important project properties.

There was a valid business case to explore. Even if it was unclear what we
should expect from a model, we did recognize that having a model that could
predict a subset of tags could be helpful.
There is a labelled dataset available with about 5000 examples, that could be
downloaded easily from the GitHub API. While the labels may not be perfectly
consistent, they should certainly suffice as a starting point.
The problem was well defined in the sense that we could translate the problem
down to a text categorization task. The contents of a GitHub issue contained
text that we needed to classify into a set of non-exclusive classes that were
known upfront.

This information was enough for me to get started.

Step 1: project setup

To get an overview of the steps needed in my pipeline, I typically start out by
drawing on a digital whiteboard. Here’s the first drawing I made.

A first drawing of the process — A first sketch of the required steps

To describe each step in more detail:

First, a script downloads the relevant data from the GitHub API.
Next, this data would need to be cleaned and processed. Between the sentences
describing the issue, there would also be markdown and code blocks, so some
sort of data cleaning step is required here. Eventually, this data needed to
be turned into the binary .spacy format so that I can use it to train a
spaCy model.
The final step would be to train a model. The hyperparameters would need to
be defined upfront in a configuration file and the trained spaCy model would
then be saved on disk.

This looked simple enough, but the “preprocess” step felt a bit vague. So I
expanded that step.

More detailed drawing of the preprocess step — More details on how to preprocess

I decided to make the distinction between a couple of phases in my preprocess
step.

First, I decided that I needed a clean step. I wanted to be able to debug
said cleaning step so that meant that I also needed an inspectable file with
clean data on disk. I usually also end up re-labelling some of the data with
Prodigy when I’m working on a project and a cleaned
.jsonl file would allow me to update my data at the start of the pipeline.
Next, I needed a split step. I figured I might want to run some manual
analytics on the performance on the train and validation set. That meant that
I needed the .jsonl variants of these files on disk as well. The reason I
couldn’t use the .spacy files for this was that the .jsonl files can
contain extra metadata. For example, the raw data had the date when the issue
was published, which would be very useful for sanity checks.
Finally, I’d need a convert step. With the intermediate files ready for
potential investigation, the final set of files I’d need is the .spacy
versions of the training and validation sets.

Reflection

Implementing the code I need for a project is a lot easier when I have a
big-picture idea of what features are required. That’s why I love doing the
“solve it on paper first” exercise when I’m starting. The drawn diagrams don’t
just help me think about what I need; they also make for great documentation
pieces, especially when working with a remote team.

When the drawing phase was done, I worked on a project.yml file that defined
all the steps I’d need.

What did the project.yml file look like?

A project.yml file in spaCy contains a description of all the steps, with
associated scripts, that one wants to use in a spaCy project. The snippet below
omits some details, but the most important commands I started with were:

workflows:
  all:
    - download
    - clean
    - split
    - convert
    - train
  preprocess:
    - clean
    - split
    - convert
commands:
  - name: 'download'
    help: 'Scrapes the spaCy issues from the Github repository'
    script:
      - 'python scripts/download.py raw/github.jsonl'
  - name: 'clean'
    help: 'Cleans the raw data for inspection.'
    script:
      - 'python scripts/clean.py raw/github.jsonl raw/github_clean.jsonl'
  - name: 'split'
    help: 'Splits the downloaded data into a train/dev set.'
    script:
      - 'python scripts/split.py raw/github_clean.jsonl assets/train.jsonl
        assets/valid.jsonl'
  - name: 'convert'
    help: "Convert the data to spaCy's binary format"
    script:
      - 'python scripts/convert.py en assets/train.jsonl corpus/train.spacy'
      - 'python scripts/convert.py en assets/valid.jsonl corpus/dev.spacy'
  - name: 'train'
    help: 'Train the textcat model'
    script:
      - 'python -m spacy train configs/config.cfg --output training/
        --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy --nlp.lang
        en'

What did the folder structure look like?

While working on the project file, I also made a folder structure.

📂 spacy-github-issues
┣━━ 📂 assets
┃   ┣━━ 📄 github-dev.jsonl (7.7 MB)
┃   ┗━━ 📄 github-train.jsonl (12.5 MB)
┣━━ 📂 configs
┃   ┗━━ 📄 config.cfg (2.6 kB)
┣━━ 📂 corpus
┃   ┣━━ 📄 dev.spacy (4.0 MB)
┃   ┗━━ 📄 train.spacy (6.7 MB)
┣━━ 📂 raw
┃   ┣━━ 📄 github.jsonl (10.6 MB)
┃   ┗━━ 📄 github_clean.jsonl (20.3 MB)
┣━━ 📂 recipes
┣━━ 📂 scripts
┣━━ 📂 training
┣━━ 📄 project.yml (4.1 kB)
┣━━ 📄 README.md (1.9 kB)
┗━━ 📄 requirements.txt (95 bytes)

The training folder would contain the trained spaCy models. The recipes folder
would contain any custom recipes that I might add for Prodigy and the scripts
folder would contain all the scripts that handle the logic that I drew on the
whiteboard.

Step 2: learning from a first run

With everything in place, I spent a few hours implementing the scripts that I
needed. The goal was to build a fully functional loop from data to model first.
It’s much easier to iterate on an approach when it’s available from start to
finish.

That meant that I also did the bare minimum for data cleaning. The model
received the raw markdown text from the GitHub issues, which included the raw
code blocks. I knew this was sub-optimal, but I really wanted to have a working
pipeline before worrying about any details.

Once the first model had been trained, I started digging into the model and
the data. From that exercise, I learned a few important lessons. First, there
are 113 tags in the spaCy project, many of which aren’t used much. Especially
the tags that relate to specific natural languages might only have a few
examples.
Next, some of the listed tags were not going to be relevant, no matter how
many training examples there were. For example, the v1 tag indicates that
the issue is related to a spaCy version that’s no longer maintained.
Similarly, the wontfix tag indicates a deliberate choice from the project
maintainers, and the reasoning behind that choice will typically not be
present in the first post of the issue. Any algorithm realistically has no
way of predicting this kind of “meta” tag.
Finally, the predictions out of spaCy deserved to be investigated further. A
spaCy classifier model predicts a dictionary with label/confidence pairs. But
the confidence values tend to differ significantly between tags.

I wanted to prepare for my next meeting with Sofie. So the next step was to make
an inventory of what we might be able to expect out of the current setup.

Step 3: build your own tools

To better understand the model performance, I decided to build a small dashboard
that would allow me to inspect the performance of each tag prediction
individually. If I understand the relation between the threshold of a tag and
the precision/recall performance, then I could use that in my conversation with
Sofie to confirm if the model was useful.

I proceeded to add a script to my project.yml file that generated a few
interactive charts in a static HTML file. That way, whenever I’d retrain my
model, I’d be able to automatically generate an interactive index.html file that
allowed me to play around with threshold values.

Here’s what the dashboard would show me for the docs tag.

![More detailed drawing of the preprocess step](/blog/diary-github-spacy-dashboard1.png “Two histograms, one for the Train set and another one for the Validation. These histograms show much “mass” of the tagger/non-tagged issues would be associated with a threshold value.“)

With this dashboard available, it was time to check back in with Sofie to
discuss which tags might be most interesting to explore further.

Step 4: reporting back

The meeting that followed with Sofie was exciting. As a maintainer, she had a
lot of implicit knowledge that I was unaware of, but because I was a bit closer
to the data at this point, I also knew things that she didn’t. The exchange was
very fruitful, and a couple of key decisions got made.

First, Sofie pointed out that I was splitting my train/test sets randomly and
that this was not ideal and that I should take the recent issue data as my
test set. The main reason was that the way the issue tracker was used has
changed a few times over time. The project gained new tags over time, new
conventions as new maintainers joined the project and the repository also
recently added GitHub discussions, which caused a lot of issues to become
discussion items instead.
Next, Sofie agreed that we should only focus on a subset of tags. We decided
to only look at tags that appear at least 180 times.
Finally, I asked Sofie how much I could trust the training data. After a
small discussion, we agreed it would be good to double-check some examples
because it’s possible that some of the tags weren’t assigned consistently.
While the spaCy core team has been very stable over the years, the specific
set of people doing support has varied a bit over time. I had also spotted
some issues that didn’t have any tags attached, and Sofie agreed that these
were good candidates to check first.

I agreed to pick up all of these items as the next steps.

Step 5: lessons from labelling

I adapted the scripts that I had and moved on to create a custom labelling
recipe for Prodigy. I wanted the issues to render just like they would render on
Github so I took the effort of integrating the
CSS that GitHub uses to
render markdown.

Prodigy layout — It looks just like Github!

I was happy with how the issue was rendered, but I quickly noticed that some of
these issues are very long. I was mindful of my screen real-estate which is why
I did some extra CSS work to get the entire interface loaded in a two-column
layout.

Two column Prodigy layout — Two columns is better for the screen real-estate

This two-column layout was just right for my screen. The images would render
too, which was a nice bonus of the setup.

That meant now it was ready to label. So I checked the data by trying out a few
tactics.

I started by looking at any examples that didn’t have any tags attached. Many of these examples turned out to be short questions more related to the usage (and non-usage) of the library. These would include things like “can I get spaCy to run on mobile?“. It’d often be about an example where the tokenizer or the part of speech tagger did something unexpected. Many of these examples didn’t describe an actual bug but rather described an expectation of a user which therefore also belongs to the “usage” tag.
Next, I decided to check examples by sampling them randomly. By labelling this way, I noticed that many examples with the “bug” tag missed an associated tag that would highlight the relevant part of the codebase. After doing some digging, I learned that, for example, the feat / matcher tag was introduced much later than the bug label. That meant that many relevant tags could be missing from the dataset if the issue appeared before 2018.
Finally, I figured I’d try one more thing. Given the previous exercise, I felt that the presence of a tag was more reliable than the absence of a tag. So I used the model that I had trained and had it try to predict the feat / matcher tag. If this tag were predicted while it was missing from an example, it’d be valid to double-check. There were 41 of these examples, compared to 186 labelled instances that had a feat / matcher tag. After labelling, I confirmed that 37/41 were wrongly missing the tag. It also turned out that 29/37 of these examples predate 2018-02, which was when the feat / matcher label was introduced.

Given these lessons, I decided to reduce my train set. I’d only take examples
after 2018-02 to improve the consistency. Just doing this had a noticeable
effect on my model performance.

Epoch	Step	Score Before	Score After
0	500	57.97	61.31
0	1000	63.10	66.61
0	1500	64.85	70.99
0	2000	67.47	73.65
0	2500	71.53	75.77
1	3000	71.79	77.93
1	3500	73.20	79.11

This was an interesting observation; I only modified my training data and didn’t
change the test data that I use for evaluation. That means that I improved the
model performance by iterating on the data instead of the model.

I decided to label some more examples that might be related to the
feat / matcher tag before training the model one more time. I made subsets by
looking for issues that had the term “matcher” in the body. This gave me another
50 examples.

Step 6: another progress report

After training the model again, and after inspecting the threshold reports, I
figured I hit a nice milestone. I zoomed in on the feat / matcher tag and
learned that I could achieve:

a 75% precision / 75% recall rate when I select a threshold of 0.5.
a 91% precision / 62% recall rate when I select a threshold of 0.84

These metrics weren’t by any means “state of the art” results, but they were
tangible enough to make a decision.

I presented these results to Sofie and she enjoyed the ability to consider the
threshold options. It meant that we could tune the predictions without having to
retrain a new model, even though the numbers were looking pretty good at the
original 0.5 threshold. As far as the feat / matcher tag was concerned, Sofie
considered the exercise a success.

The project could be taken further. We wondered about what other tags to
prioritize next and also started thinking about how we might want to run the
model in production. But the first exercise of building a model was complete,
which meant that we could look back and reflect on some lessons learned along
the way.

Conclusion

In this blog post, I’ve described an example of how a spaCy project might
evolve. While we started with a problem that was well defined, it was pretty
hard to predict what steps we’d take to improve the model and get to where we
are now. In particular:

We analyzed the tags dataset which taught us that certain dates could be
excluded for data consistency.
We created a custom labelling interface, using the CSS from GitHub, which made
it convenient to improve and relabel examples in our training data.
We made a report that was specific to our classification task, which allowed
us to pick threshold values to suit our needs.

All of these developments were directly inspired by the classification problem
that we tried to tackle. For some subproblems, we were able to use pre-existing
tools, but it was completely valid to take the effort to make tools that are
tailored to the specific task. You could even argue that taking the time to do
this made all the difference.

Imagine if instead, we had put all of the efforts into the model. Is it really
that likely that we would have been in a better state if we had tried out more
hyperparameters?

The reason why I like the current milestone is because the problem is much
better understood as a result. That’s also why my interactions with Sofie were
so valuable! It’s much easier to customize a solution if you can discuss
milestones with a domain expert.

This lesson also mirrors some of the lessons we’ve learned while working on
client projects via our
tailored pipelines offering. It
really helps to take a step back to consider an approach that is less general in
favor of something a bit more bespoke. Many problems in NLP won’t be solved with
general tools, they’ll need a tailored solution instead.

Oh, and one more thing …

The custom dashboard I made in this project turned out to be very useful. I also
figured that the tool is general enough to be useful for other spaCy users.
That’s why I decided to
open-source it. You can
pip install spacy-report today to explore threshold values for your own
projects today!

Source link

Predicting GitHub Tags · Explosion

Motivating example