NLP & Text Processing
This session explores how traditional NLP techniques function as a vital foundation for modern agentic AI systems rather than being replaced by them.
It outlines a hybrid approach where deterministic tools like regular expressions, HTML parsers, and grammatical rules handle the initial cleaning and structuring of messy data. By using lexical processing and named entity recognition, developers can create high-speed, cost-effective pipelines that provide reliable inputs for large language models.
The session emphasizes that while LLMs excel at complex reasoning and synthesis, classical methods ensure data integrity and enforce business logic. Ultimately, the material advocates for an orchestrated architecture that combines the precision of symbolic programming with the flexible understanding of generative AI.
Why AI Agents Need Classical NLP
Classical NLP Guardrails for LLM Agents
Presentation
Notebooks
- 01_From_Raw_Text_to_Structured_Inputs
- 02_Lexical_Processing
- 03_Grammars_and_Parsing
- 04_NER_and_POS_Tagging
- 05_Text_Classification_Sentiment_Topic_Modeling
Resources
| Package | Documentation | Description |
|---|---|---|
| re | re — Regular expression operations | Python standard library module for pattern matching with regular expressions. Used for extracting dates, amounts, emails, and IDs from text. |
| beautifulsoup4 | Beautiful Soup Documentation | HTML/XML parser for navigating, searching, and extracting content from web pages. Used to strip noise (nav, ads, scripts) and extract clean text. |
| nltk | NLTK Documentation | Comprehensive natural language processing library. Used across notebooks for tokenization, stemming, lemmatization, POS tagging, grammars, WordNet, stop words, and VADER sentiment. |
| nltk.tokenize | nltk.tokenize API | Word and sentence tokenizers (word_tokenize, sent_tokenize) that handle contractions, abbreviations, and punctuation correctly. |
| nltk.stem .PorterStemmer | nltk.stem API | Rule-based suffix-stripping stemmer. Fast, moderate aggressiveness. Used for keyword matching and alert triggers. |
| nltk.stem .SnowballStemmer | nltk.stem API | Improved Porter variant with multi-language support. |
| nltk.stem .LancasterStemmer | nltk.stem API | Aggressive stemmer that strips more suffixes than Porter or Snowball. |
| nltk.stem .WordNetLemmatizer | nltk.stem API | Dictionary-based lemmatizer that reduces words to valid base forms (e.g., “geese” → “goose”). Requires POS tags for best results. |
| nltk.corpus .wordnet | WordNet Interface | Lexical database of English providing synsets, synonyms, antonyms, hypernyms, and hyponyms. Used to build synonym lexicons for keyword expansion. |
| nltk.corpus .stopwords | NLTK Corpora | Curated lists of high-frequency, low-information words (179 English stop words) used to filter noise from text. |
| nltk.sentiment .SentimentIntensityAnalyzer (VADER) | VADER Sentiment | Lexicon-based sentiment analyzer tuned for social media. Returns compound, positive, neutral, and negative scores. No training required. |
| nltk.CFG / nltk.ChartParser | nltk.parse API | Context-Free Grammar definition and chart parsing. Used to build deterministic command interpreters with parse trees. |
| nltk.pos_tag | nltk.tag API | Penn Treebank POS tagger using the averaged perceptron model. Labels words as NNP, VBD, JJ, etc. |
| spacy | spaCy Documentation | Industrial-strength NLP library for tokenization, POS tagging, dependency parsing, NER, and lemmatization in a single pipeline call. |
| en_core_web_sm | spaCy English Models | Small English pipeline model for spaCy (~12 MB). Includes tok2vec, tagger, parser, NER, and lemmatizer. Install with python -m spacy download en_core_web_sm. |
| spacy.displacy | displaCy Visualizer | Built-in entity and dependency visualizer that renders inline in Jupyter notebooks. |
| scikit-learn (sklearn) | scikit-learn Documentation | Machine learning library. Used for TF-IDF vectorization, logistic regression classification, LDA topic modeling, and evaluation metrics. |
| sklearn .feature_extraction.text .TfidfVectorizer | TfidfVectorizer API | Converts text to TF-IDF feature matrices. Supports stop words, n-grams, min/max document frequency thresholds. |
| sklearn .feature_extraction.text .CountVectorizer | CountVectorizer API | Converts text to raw word-count matrices (bag of words). Used as input for LDA topic modeling. |
| sklearn .linear_model .LogisticRegression | LogisticRegression API | Linear classifier for text classification. Supports multi-class, outputs probabilities, and has inspectable coefficients for explainability. |
| sklearn .decomposition .LatentDirichletAllocation | LDA API | Unsupervised topic model that discovers latent themes from a document-term matrix. |