textacy
is a Python library for performing a variety of natural language processing (NLP)
tasks, built on the high-performance spaCy library. With the fundamentals — tokenization,
part-of-speech tagging, dependency parsing, etc. — delegated to another library,
textacy
focuses primarily on the tasks that come before and follow after.
features
-
Access and extend spaCy’s core functionality for working with one or many documents
through convenient methods and custom extensions -
Load prepared datasets with both text content and metadata, from Congressional speeches
to historical literature to Reddit comments -
Clean, normalize, and explore raw text before processing it with spaCy
-
Extract structured information from processed documents, including n-grams, entities,
acronyms, keyterms, and SVO triples -
Compare strings and sequences using a variety of similarity metrics
-
Tokenize and vectorize documents then train, interpret, and visualize topic models
-
Compute text readability and lexical diversity statistics, including Flesch-Kincaid
grade level, multilingual Flesch Reading Ease, and Type-Token Ratio
… and much more!
maintainer
Howdy, y’all. 👋