NLP & Text Processing

This session explores how traditional NLP techniques function as a vital foundation for modern agentic AI systems rather than being replaced by them.

It outlines a hybrid approach where deterministic tools like regular expressions, HTML parsers, and grammatical rules handle the initial cleaning and structuring of messy data. By using lexical processing and named entity recognition, developers can create high-speed, cost-effective pipelines that provide reliable inputs for large language models.

The session emphasizes that while LLMs excel at complex reasoning and synthesis, classical methods ensure data integrity and enforce business logic. Ultimately, the material advocates for an orchestrated architecture that combines the precision of symbolic programming with the flexible understanding of generative AI.

Why AI Agents Need Classical NLP
Classical NLP Guardrails for LLM Agents

Presentation

Notebooks

Resources

PackageDocumentationDescription
rere — Regular expression operationsPython standard library module for pattern matching with regular expressions. Used for extracting dates, amounts, emails, and IDs from text.
beautifulsoup4Beautiful Soup DocumentationHTML/XML parser for navigating, searching, and extracting content from web pages. Used to strip noise (nav, ads, scripts) and extract clean text.
nltkNLTK DocumentationComprehensive natural language processing library. Used across notebooks for tokenization, stemming, lemmatization, POS tagging, grammars, WordNet, stop words, and VADER sentiment.
nltk.tokenizenltk.tokenize APIWord and sentence tokenizers (word_tokenize, sent_tokenize) that handle contractions, abbreviations, and punctuation correctly.
nltk.stem .PorterStemmernltk.stem APIRule-based suffix-stripping stemmer. Fast, moderate aggressiveness. Used for keyword matching and alert triggers.
nltk.stem .SnowballStemmernltk.stem APIImproved Porter variant with multi-language support.
nltk.stem .LancasterStemmernltk.stem APIAggressive stemmer that strips more suffixes than Porter or Snowball.
nltk.stem .WordNetLemmatizernltk.stem APIDictionary-based lemmatizer that reduces words to valid base forms (e.g., “geese” → “goose”). Requires POS tags for best results.
nltk.corpus .wordnetWordNet InterfaceLexical database of English providing synsets, synonyms, antonyms, hypernyms, and hyponyms. Used to build synonym lexicons for keyword expansion.
nltk.corpus .stopwordsNLTK CorporaCurated lists of high-frequency, low-information words (179 English stop words) used to filter noise from text.
nltk.sentiment .SentimentIntensityAnalyzer (VADER)VADER SentimentLexicon-based sentiment analyzer tuned for social media. Returns compound, positive, neutral, and negative scores. No training required.
nltk.CFG / nltk.ChartParsernltk.parse APIContext-Free Grammar definition and chart parsing. Used to build deterministic command interpreters with parse trees.
nltk.pos_tagnltk.tag APIPenn Treebank POS tagger using the averaged perceptron model. Labels words as NNP, VBD, JJ, etc.
spacyspaCy DocumentationIndustrial-strength NLP library for tokenization, POS tagging, dependency parsing, NER, and lemmatization in a single pipeline call.
en_core_web_smspaCy English ModelsSmall English pipeline model for spaCy (~12 MB). Includes tok2vec, tagger, parser, NER, and lemmatizer. Install with python -m spacy download en_core_web_sm.
spacy.displacydisplaCy VisualizerBuilt-in entity and dependency visualizer that renders inline in Jupyter notebooks.
scikit-learn (sklearn)scikit-learn DocumentationMachine learning library. Used for TF-IDF vectorization, logistic regression classification, LDA topic modeling, and evaluation metrics.
sklearn .feature_extraction.text .TfidfVectorizerTfidfVectorizer APIConverts text to TF-IDF feature matrices. Supports stop words, n-grams, min/max document frequency thresholds.
sklearn .feature_extraction.text .CountVectorizerCountVectorizer APIConverts text to raw word-count matrices (bag of words). Used as input for LDA topic modeling.
sklearn .linear_model .LogisticRegressionLogisticRegression APILinear classifier for text classification. Supports multi-class, outputs probabilities, and has inspectable coefficients for explainability.
sklearn .decomposition .LatentDirichletAllocationLDA APIUnsupervised topic model that discovers latent themes from a document-term matrix.