March 28, 2025

ikayaniaamirshahzad@gmail.com

Reflections on a year of spaCy consulting at Explosion


We’ve been offering consulting services through spaCy Tailored Pipelines for almost a year! I thought I’d post some lessons I’ve learned from chatting with practitioners about their NLP challenges, developing production-ready NLP pipelines for clients, and working with an open-source development team.

First, it’s always worth the time to fully understand a client’s problem as best as possible. We try to be thorough yet expeditious in our interactions – it can take a lot of time and energy to get everyone a shared understanding about when a challenge is a fit for our services. Not everything results in a project, and that’s a good thing – we want everyone to spend their energy where it makes sense and point clients in the right direction even if it’s not a direct engagement with us.

When we see a demand in the market from conversations with practitioners, we do our best to respond. That’s why we launched spaCy Tailored Analysis in December. Many discussions we had with clients were fruitful in terms of scoping a problem, but often times that problem was not developing an NLP pipeline. Instead, it was a problem of annotating data, or improving an existing model and codebase, or developing a strategy to collect and organize data for a project. These pipeline-adjacent problems are really important, and we have an impressive team that helps with these as well.

Personally, I feel very supported by a skilled and capable team of NLP and spaCy experts that support our consulting projects. I’m not an expert in everything, but for any NLP problem I am confident we have someone on the team that has experience with it and can help everyone understand the details. It’s also nice that these experts are the core spaCy developers, available to ask questions and assist with implementing best practices. As an example, I was using spancat seriously for the first time on a project and it was so helpful to have the primary people who had built spancat a message away, able to provide their own insight and suggestions.

Moving onto the projects we completed a few trends are noticeable. Most critically: garbage in, garbage out still holds true. Processing, cleaning up and validating data is still a top priority. You can’t hyperparameter-tune your way out of a dumpster. This doesn’t mean bad data is the end of the world, but you need to have a way to manage data quality. We spend a lot of time thinking about the data schema up front, as well as how to enforce it with data validation, to help make more robust pipelines. This is especially important with data gathered from the web!

We’ve had great success with hybrid ML and rules-based approaches for real world problems. I think it’s unrealistic to expect a single model or component to solve your entire problem. Spancat is a nice example of a flexible hybrid tool: you can filter spans of text to viable candidates for classification with a suggester function and then classify candidate spans. Then, you can use the spans that spancat identifies combined with rules to do complex information extraction tasks. This span cascading workflow is really nice to work with and a total breeze to maintain with modular spaCy components.

This brings me to one technical area I’d like us to work on a bit more: making more robust information extraction pipelines. With rules-based approaches and messy data, you might have new cases that don’t fit into your schema and fail validation. I think we can improve the workflow for doing data validation within a pipeline, so you can track validation errors more precisely and make practical use of that information in improving your system.

Finally, consulting projects improve spaCy and the larger Explosion ecosystem, which is so cool! Lots of the spaCy contributions I worked on came from ideas we had or stumbling blocks we encountered while completing a project. We’re also doing some battle testing of Prodigy, allowing us to develop more custom recipes and round out some annotation edge cases.

As a final note, my colleague Sofie—who is also working on some interesting consulting projects—wrote a nice post on running successful data science projects. If you’re interested in improving the success rate of your projects, I’d strongly encourage you to check out her post as well.



Source link

Leave a Comment