MSA 8700 — Module 8: NLP and Text Processing
A deal intelligence agent monitors news feeds, emails, and web pages to keep a venture capital firm informed about competitors, acquisitions, and market moves.
Every day, hundreds of raw documents arrive — HTML pages, plain-text emails, and press releases. The firm wants structured, actionable intelligence, not piles of unread text.
Each step maps to a technique we studied in this module.
Raw HTML / Email
↓
① Beautiful Soup + Regex → clean text + structured fields
↓
② Tokenization + Lemmatization → normalized tokens
↓
③ POS Tagging + NER (spaCy) → entities + descriptors
↓
④ TF-IDF Classifier → topic routing
VADER Sentiment → sentiment score
↓
⑤ LDA Topic Model → emerging themes
↓
⑥ LLM Reasoning → narrative summary + recommendations
Imagine the agent fetches this HTML from a news site:
<html>
<body>
<nav><a href="/">Home</a> | <a href="/news">News</a></nav>
<div class="ad">Buy premium analytics tools!</div>
<article>
<h1>Acme Corp Acquires DataFlow for $240M</h1>
<p class="meta">Published 2026-03-01 | Business News</p>
<p>Acme Corp announced on March 1, 2026 that it would
acquire DataFlow Inc., a San Francisco-based data
startup, for $240 million in cash.</p>
<p>CEO Jane Rivera said the deal strengthens Acme's
position in the real-time analytics market.</p>
</article>
<script>trackPageView();</script>
</body>
</html>
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Remove noise: nav, ads, scripts
for tag in soup.find_all(["nav", "script"]):
tag.decompose()
for ad in soup.find_all("div", class_="ad"):
ad.decompose()
# Extract the article content
article = soup.find("article")
title = article.find("h1").get_text(strip=True)
paragraphs = [p.get_text(strip=True)
for p in article.find_all("p")]
Result: Clean title and paragraphs — no HTML, no ads, no scripts.
import re
full_text = " ".join(paragraphs)
# Extract date (ISO format)
dates = re.findall(r"\d{4}-\d{2}-\d{2}", full_text)
# Extract dollar amounts
amounts = re.findall(r"\$[\d,]+(?:\.\d{2})?\s*(?:million|billion)?",
full_text, re.IGNORECASE)
# Extract the article's publication date from meta
meta_text = article.find("p", class_="meta").get_text()
pub_date = re.search(r"\d{4}-\d{2}-\d{2}", meta_text).group()
Result: dates = ["2026-03-01"], amounts = ["$240 million"]
| What We Did | Why |
|---|---|
| Beautiful Soup removed noise | Saves LLM tokens, improves focus |
| Regex extracted dates + amounts | 100% reliable on known formats |
| Deterministic, microseconds | No API cost, no hallucination |
Rule: Extract structured fields deterministically. Reserve the LLM for reasoning.
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
tokens = word_tokenize(full_text.lower())
tagged = pos_tag(tokens)
# Lemmatize with correct POS, remove stop words
clean_tokens = []
for word, tag in tagged:
if word.isalpha() and word not in stop_words:
pos = get_wordnet_pos(tag) # map to WordNet POS
clean_tokens.append(lemmatizer.lemmatize(word, pos))
| Raw Token | Lemma |
|---|---|
| acquired | acquire |
| strengthens | strengthen |
| announced | announce |
| analytics | analytics |
acquire — catching all inflected formsimport spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(full_text)
entities = {}
for ent in doc.ents:
entities.setdefault(ent.label_, []).append(ent.text)
Extracted entities:
| Type | Entities |
|---|---|
| ORG | Acme Corp, DataFlow Inc. |
| PERSON | Jane Rivera |
| GPE | San Francisco |
| MONEY | $240 million |
| DATE | March 1, 2026 |
# Regex found: $240 million
# spaCy found: $240 million ✓ Match
# Regex found date: 2026-03-01
# spaCy found date: March 1, 2026 ✓ Consistent
If classical tools and LLM disagree on critical fields → escalate to a human or retry.
This cross-validation pattern is a key safety mechanism in agentic systems.
# Extract adjective-noun pairs for market descriptors
descriptors = []
for i in range(len(doc) - 1):
if doc[i].pos_ == 'ADJ' and doc[i+1].pos_ == 'NOUN':
descriptors.append(f"{doc[i].text} {doc[i+1].text}")
# Result: ["real-time analytics"]
These descriptors characterize the market context of the deal — useful for the LLM reasoning step.
A pre-trained TF-IDF + Logistic Regression classifier routes the article:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
# Pre-trained on labeled articles
X_new = vectorizer.transform([full_text])
category = clf.predict(X_new)[0]
confidence = clf.predict_proba(X_new).max()
# Result: category = "M&A", confidence = 94%
Fast, free, deterministic — the classifier routes hundreds of articles per second.
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
scores = sia.polarity_scores(full_text)
# Result:
# compound = +0.72 → POSITIVE
# The language ("strengthens", "position") is positive
| Score | Value | Interpretation |
|---|---|---|
| Positive | 0.28 | Moderate positive language |
| Neutral | 0.72 | Most of the text is factual |
| Negative | 0.00 | No negative language |
| Compound | +0.72 | Overall positive |
# Agentic alert rules
if category == "M&A" and any(a for a in amounts
if "million" in a or "billion" in a):
alert = "HIGH PRIORITY — M&A deal detected"
if scores['compound'] < -0.3:
alert += " — NEGATIVE SENTIMENT"
The classical NLP layer generates structured signals — topic, sentiment, priority — that feed into the LLM reasoning layer.
The classifier from Step ④ handles known categories (M&A, earnings, product launch).
But what about emerging themes nobody anticipated?
LDA runs over the full corpus of recent articles — unsupervised, no labels — and discovers latent topics.
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
# Build document-term matrix from all recent articles
count_vec = CountVectorizer(stop_words='english',
min_df=2, max_df=0.9)
dtm = count_vec.fit_transform(all_articles)
# Fit LDA with 5 topics
lda = LatentDirichletAllocation(n_components=5,
random_state=42)
lda.fit(dtm)
| Topic | Top Words | Suggested Label |
|---|---|---|
| 1 | acquire, deal, million, startup, funding | M&A Activity |
| 2 | data, privacy, regulation, compliance, GDPR | Data Privacy |
| 3 | revenue, quarter, growth, earnings, profit | Earnings Reports |
| 4 | AI, model, research, launch, platform | AI Product Launches |
| 5 | layoff, restructure, cut, workforce, hire | Workforce Changes |
LDA discovered these without any labels — the topics emerge from word co-occurrence patterns in the corpus.
The LDA model gives you word lists. The LLM turns them into human-readable labels:
Prompt: "Given these top words for a discovered topic:
acquire, deal, million, startup, funding
Suggest a concise, descriptive label."
LLM Response: "M&A Activity and Startup Funding"
Unsupervised discovery (LDA) + natural language labeling (LLM) — each does what it’s best at.
After Steps ①–⑤, the agent has built this structured record:
{
"title": "Acme Corp Acquires DataFlow for $240M",
"pub_date": "2026-03-01",
"organizations": ["Acme Corp", "DataFlow Inc."],
"people": ["Jane Rivera"],
"locations": ["San Francisco"],
"amounts": ["$240 million"],
"dates": ["March 1, 2026"],
"category": "M&A",
"category_confidence": 0.94,
"sentiment_compound": 0.72,
"lda_topic": "M&A Activity",
"descriptors": ["real-time analytics"]
}
The LLM receives the structured record plus the clean text and reasons over it:
Prompt: "You are a deal intelligence analyst. Given:
- Acme Corp acquired DataFlow Inc. for $240M
- Location: San Francisco
- Sector: real-time analytics
- Sentiment: positive
- CEO Jane Rivera described it as strategic
1. How does this affect our portfolio company X?
2. Is Acme Corp now a competitor in our space?
3. What follow-up actions should we take?"
| Approach | Problem |
|---|---|
| Send raw HTML to LLM | Wastes tokens on ads, nav, scripts |
| Ask LLM to extract entities | May hallucinate amounts or dates |
| Ask LLM to classify 1000 articles | $100+ per run, minutes of latency |
| Our Approach | Benefit |
|---|---|
| Classical NLP extracts + classifies | Free, fast, deterministic |
| LLM reasons over structured data | Focused, accurate, cost-efficient |
The LLM sees clean, verified, structured input — not raw noise.
┌─────────────────────────────────────────────────┐
│ DEAL INTELLIGENCE AGENT │
├─────────────────────────────────────────────────┤
│ │
│ ① Beautiful Soup + Regex │
│ → Clean text, dates, amounts │
│ │
│ ② NLTK Tokenization + Lemmatization │
│ → Normalized tokens for downstream use │
│ │
│ ③ spaCy NER + POS Tagging │
│ → ORG, PERSON, GPE, MONEY, DATE │
│ → Adjective-noun descriptors │
│ → Cross-validation with regex │
│ │
│ ④ TF-IDF Classifier + VADER Sentiment │
│ → Topic routing + sentiment score │
│ │
│ ⑤ LDA Topic Model │
│ → Emerging theme discovery │
│ │
│ ⑥ LLM Reasoning │
│ → Narrative summary + recommendations │
│ │
└─────────────────────────────────────────────────┘
| Step | Technique | Role | Notebook |
|---|---|---|---|
| ① | Beautiful Soup, Regex | Deterministic extraction | 01 |
| ② | Tokenization, Lemmatization | Text normalization | 02 |
| ③ | spaCy NER, POS tagging | Entity + descriptor extraction | 04 |
| ④ | TF-IDF + LogReg, VADER | Classification + sentiment | 05 |
| ⑤ | LDA | Unsupervised topic discovery | 05 |
| ⑥ | LLM | Reasoning + generation | — |
Each technique handles what it does best. Together they form a system that is reliable, fast, cheap, and intelligent.
Use deterministic methods for what they do best. Use LLMs for what they do best.
Most agent tasks operate over large volumes — millions of log lines, reviews, or emails.
The classical NLP layer handles the bulk:
1,000 incoming articles per day
↓
Classical classifier routes 80% with high confidence
→ 400 M&A → 200 Earnings → 200 Product
↓
VADER flags 50 with negative sentiment
↓
Only 50 high-priority items sent to LLM
→ Narrative summaries
→ Risk assessments
→ Recommended actions
| Approach | Articles/Day | LLM Calls | Est. Cost |
|---|---|---|---|
| Send everything to LLM | 1,000 | 1,000 | ~$50/day |
| Classical first pass | 1,000 | 50 | ~$2.50/day |
95% cost reduction — same intelligence quality on the items that matter.
The classical layer is free after training. The LLM budget is focused on high-value reasoning.
Regex extracted: $240 million
spaCy NER found: $240 million ✓ AGREE
Regex extracted: 2026-03-01
spaCy NER found: March 1, 2026 ✓ CONSISTENT
LLM summary says: March 2026 ✗ IMPRECISE
→ Flag for review
If classical tools and LLM disagree on critical fields, the agent can:
Classical NLP outputs are deterministic for a fixed model:
LLM outputs are stochastic:
Combining both gives you reliability where it matters (extraction) and flexibility where it matters (reasoning).
A deal intelligence agent that processes raw news articles through a six-step pipeline:
Classical NLP and LLMs are not competitors — they are collaborators.
| Classical NLP | LLM |
|---|---|
| Fast (microseconds) | Slow (seconds) |
| Free at inference | Per-token cost |
| Deterministic | Stochastic |
| Structured extraction | Narrative reasoning |
| High volume | High value |
Use each for what it does best. Together, they form agent systems that are reliable, efficient, and intelligent.

