NLP & Text Processing

This session explores how traditional NLP techniques function as a vital foundation for modern agentic AI systems rather than being replaced by them.

It outlines a hybrid approach where deterministic tools like regular expressions, HTML parsers, and grammatical rules handle the initial cleaning and structuring of messy data. By using lexical processing and named entity recognition, developers can create high-speed, cost-effective pipelines that provide reliable inputs for large language models.

The session emphasizes that while LLMs excel at complex reasoning and synthesis, classical methods ensure data integrity and enforce business logic. Ultimately, the material advocates for an orchestrated architecture that combines the precision of symbolic programming with the flexible understanding of generative AI.

Why AI Agents Need Classical NLP

Classical NLP Guardrails for LLM Agents

Presentation

NLP and Text Processing

Notebooks

Resources

Package	Documentation	Description
re	re — Regular expression operations	Python standard library module for pattern matching with regular expressions. Used for extracting dates, amounts, emails, and IDs from text.
beautifulsoup4	Beautiful Soup Documentation	HTML/XML parser for navigating, searching, and extracting content from web pages. Used to strip noise (nav, ads, scripts) and extract clean text.
nltk	NLTK Documentation	Comprehensive natural language processing library. Used across notebooks for tokenization, stemming, lemmatization, POS tagging, grammars, WordNet, stop words, and VADER sentiment.
nltk.tokenize	nltk.tokenize API	Word and sentence tokenizers (`word_tokenize`, `sent_tokenize`) that handle contractions, abbreviations, and punctuation correctly.
nltk.stem .PorterStemmer	nltk.stem API	Rule-based suffix-stripping stemmer. Fast, moderate aggressiveness. Used for keyword matching and alert triggers.
nltk.stem .SnowballStemmer	nltk.stem API	Improved Porter variant with multi-language support.
nltk.stem .LancasterStemmer	nltk.stem API	Aggressive stemmer that strips more suffixes than Porter or Snowball.
nltk.stem .WordNetLemmatizer	nltk.stem API	Dictionary-based lemmatizer that reduces words to valid base forms (e.g., “geese” → “goose”). Requires POS tags for best results.
nltk.corpus .wordnet	WordNet Interface	Lexical database of English providing synsets, synonyms, antonyms, hypernyms, and hyponyms. Used to build synonym lexicons for keyword expansion.
nltk.corpus .stopwords	NLTK Corpora	Curated lists of high-frequency, low-information words (179 English stop words) used to filter noise from text.
nltk.sentiment .SentimentIntensityAnalyzer (VADER)	VADER Sentiment	Lexicon-based sentiment analyzer tuned for social media. Returns compound, positive, neutral, and negative scores. No training required.
nltk.CFG / nltk.ChartParser	nltk.parse API	Context-Free Grammar definition and chart parsing. Used to build deterministic command interpreters with parse trees.
nltk.pos_tag	nltk.tag API	Penn Treebank POS tagger using the averaged perceptron model. Labels words as NNP, VBD, JJ, etc.
spacy	spaCy Documentation	Industrial-strength NLP library for tokenization, POS tagging, dependency parsing, NER, and lemmatization in a single pipeline call.
en_core_web_sm	spaCy English Models	Small English pipeline model for spaCy (~12 MB). Includes tok2vec, tagger, parser, NER, and lemmatizer. Install with `python -m spacy download en_core_web_sm`.
spacy.displacy	displaCy Visualizer	Built-in entity and dependency visualizer that renders inline in Jupyter notebooks.
scikit-learn (sklearn)	scikit-learn Documentation	Machine learning library. Used for TF-IDF vectorization, logistic regression classification, LDA topic modeling, and evaluation metrics.
sklearn .feature_extraction.text .TfidfVectorizer	TfidfVectorizer API	Converts text to TF-IDF feature matrices. Supports stop words, n-grams, min/max document frequency thresholds.
sklearn .feature_extraction.text .CountVectorizer	CountVectorizer API	Converts text to raw word-count matrices (bag of words). Used as input for LDA topic modeling.
sklearn .linear_model .LogisticRegression	LogisticRegression API	Linear classifier for text classification. Supports multi-class, outputs probabilities, and has inspectable coefficients for explainability.
sklearn .decomposition .LatentDirichletAllocation	LDA API	Unsupervised topic model that discovers latent themes from a document-term matrix.