March 28, 2025

ikayaniaamirshahzad@gmail.com

Introducing Holmes 4.0 · Explosion


A few weeks ago we released version 4.0 of Holmes, which we are now able to offer under a permissive MIT license. Holmes is a library in the spaCy Universe that runs on top of spaCy and enables information extraction and intelligent search, currently for English and German. Holmes goes beyond simple matching algorithms and allows you to look for a specified idea or ideas in a corpus of documents.

Holmes offers two main search mechanisms. The first, structural matching, aims to find text snippets in a corpus that express a given idea exactly and is useful for extracting structured information, for example into a relational database. The second, topic matching, is fuzzier and forms the basis for a real-time search machine. Structural matching is the more fundamental of the two mechanisms, so I shall
explain it first, and then go on to discuss topic matching, which builds upon it.

The history of Holmes

I wrote the original version of Holmes while working at msg systems, a large, international IT consultancy with its headquarters near Munich. Holmes was partly based on concepts that were developed at another employer still previous to that and that are described in a U.S. patent. I now work at Explosion and the patent is now controlled by AstraZeneca. Thanks to the goodwill and openness of both AstraZeneca and msg systems, we are able to continue maintaining the library at Explosion and to offer it for the first time under a permissive MIT license. This means that people can now use it, and expand on it if they wish, without having to worry about the patent or other legal issues.

1. Structural matching

You tell Holmes the idea you are looking for, specifying a phrase and strategies for recognizing the individual words, and leave it to the library to find complex examples.

1.1 Recognizing different ways of saying the same thing

Tools like spaCy’s Matcher are an effective way of performing information extraction: the Matcher lets you specify both lexical and grammatical features with which to find phrases within a large body of documents. However, this typically requires many rules to capture a single idea because the same thing can be said in various different ways (see Figure 1). The variation is on two levels. On the one hand, the four examples have different surface grammatical structures; and on the other hand, groups of words like acquire, buy and take over are used synonymously, and specific instances of entities like companies have names like MaxLinear and Datto.

Figure 1: Different headlines announcing company takeovers
Figure 1: Different headlines announcing company takeovers

The aim of Holmes structural matching is to abstract away both these types of variation so that the user can concentrate on the information they want to extract. You can tell Holmes that a company takes over a company is the idea you are looking for, specify strategies for recognizing company names and synonyms of take over, and leave it to the library to find complex examples without having to write a large number of extra rules.

1.2 Deriving the meanings of sentence structures

The grammatical relationships between the words within a phrase determine how the individual meanings of those words combine to form an overall meaning for the phrase. The rules that drive this apply across any phrases that share a given structure regardless of the specific words involved (see Figure 2). Central to Holmes are rules that transform syntactic surface structures outputted by the standard spaCy models into corresponding underlying semantic structures. Unlike in a typical rule-based system where rules are developed for a specific task and handle words and phrases specific to that task, these rules, which we refer to as meta-rules:

  • describe the basic grammatical and semantic structures of a language
  • are valid for any task involving texts written in that language
  • are maintained as a standard, static part of the core library
Figure 2: Parallel grammatical structures
Figure 2: Parallel grammatical structures

For example, the meta-rules required to derive the correct semantic structure from the sentences in the Structure 2 row of Figure 2 would handle recognizing the passive construction is … by and assigning the correct semantic roles to the arguments of passive verbs, while the meta-rules required for the Structure 3 and Structure 4 rows would process compound words formed from nouns and participles.

Predicate logic

The meaning communicated by any sentence can be captured using predicate logic. For example, the sentence The child gave the dog a bone expresses a first-order predication give(child, dog, bone) linking the predicate give with the arguments child, dog and bone. Actually being able to derive correct logical structures for every sentence in a corpus could be seen as the Holy Grail of natural language understanding: a machine that genuinely understood the meanings of texts would probably be well beyond passing the Turing Test!

Holmes stops far short of such an ambitious goal, instead using its meta-rules to transform syntactic parse trees in such a way that sentences that express identical meanings emerge with matching semantic structures. Meta-rules and the structures they generate, while heavily inspired by predicate logic, are not intended to correspond to any strict formal, logical or linguistic-theoretical representation: they just do whatever enables meanings to be matched effectively.

Figure 3: Different sentences with a common semantic structure
Figure 3: Different sentences with a common semantic structure



Source link

Leave a Comment