Document Processing Pipeline with Python
This tutorial walks through two complementary approaches for transforming raw documents—PDFs, Word files, PowerPoints, images, and plain text—into structured, machine-readable formats suitable for AI systems such as Retrieval-Augmented Generation (RAG), knowledge bases, and document summarization engines.
Both implementations are available in the document-processing-pipeline repository[^1].
| Speaker | Text |
|---|---|
| Alex | You know, there’s this, uh, very specific kind of dread that I think every data scientist feels at some point. It’s usually like a Tuesday morning right after stand-up. You’ve just promised this beautiful shiny new rag system. Everyone’s excited about the massive model you’re using, the new agentic workflow, and then you open |
| Sam | the data folder, |
| Alex | and then you open the data folder, and it’s not a CSV. It’s not clean JSON. It’s just a digital landfill. PDFs scanned contracts from like 1998. |
| Sam | PowerPoints with nothing on the actual slides, a few corrupt word |
| Alex | docs. Yeah. Welcome to the, uh, the unsexy reality of Enterprise AI. |
| Sam | It’s the part no one ever puts in the titch deck, but it’s the. Part that determines if any of this actually works. |
| Alex | Exactly. We spend all this time on the brain, right, the LLM, and we totally forget about the eyes, the ingestion part. If your AI can’t read, it doesn’t matter how smart it is. |
| Sam | So that’s what we’re doing today. We’re going deep on that ground floor. Document processing. |
| Alex | Specifically through a Python lens, because let’s be honest, that’s where most RAke systems and agents are being built. |
| Sam | And we’ve got two, very different philosophies to look at from the source material. There’s the fine-grained DIY approach and then there’s the unified framework approach, right? |
| Alex | And before we even get into the libraries, the Pima PDF and the unstructured, I just want to set the stage on why this is so hard. To a person, a PDF is a document. To a computer. It’s a disaster. |
| Sam | Well, it’s worse than a disaster. It’s, it’s kind of a lie. |
| Alex | A lie. |
| Sam | How so? A PDF was never meant to hold data. It was designed to hold visual instructions for a printer. The file is literally a script that says go to coordinate x y and put a tiny black mark here. |
| Alex | It doesn’t know that’s the letter A. It |
| Sam | has no idea and it definitely doesn’t know that a big bold line of text at the top of the page is a header. The machine, it’s all just pixels, |
| Alex | and that’s the core friction, isn’t it? We’re trying to pull some. man tic meaning out of a visual layout. |
| Sam | Exactly. |
| Alex | And if you get that wrong, I mean, walk us through the garbage bin scenario for a vector database. |
| Sam | Sure. So let’s say you just scrape all the text from a PDF, no layout analysis. You just grab everything headers, footers, page numbers, the main content all in one giant string. Then you chunk it. |
| Alex | OK, so now my vector database has a chunk that’s just the legal disclaimer from the footer, right? |
| Sam | And if that footer is on every page of a 500 page document, you’ve just polluted your vector space with 500 nearly identical useless chunks. So when a user asks a real question like what’s the warranty policy, the retriever might pull up. Page 42 just because the footer text was a vague match and completely missed the real answer on page 12 because your chunking split the paragraph in half and |
| Alex | then the LLM gets this context of page numbers and disclaimers and just hallucinates an answer. |
| Sam | Precisely. Your retrieval quality is capped by your parsing quality. You can’t prompt engineer your way out of bad data. OK, |
| Alex | so let’s fix it. The source material gives us two paths. Let’s start with what they call the fine-grained approach. This feels like the one for the control freak engineer. The DIY pipeline. It is. |
| Sam | This is for when you want to build a custom parser for each file type. You’re not trusting a black box. You’re picking the tools yourself. And for PDFs, the weapon of choice is almost always Pima |
| Alex | PDF. I see import fits everywhere. Why that one specifically? There are others. Pi PDF too, |
| Sam | it’s all about performance and depth. Pima PDF is just a Python binding for the PDF engine which is written in C, so it’s ridiculously fast. I mean, if you’re processing 10 documents, who cares? But if you’re processing 50,000, the difference between a pure Python parser and a C-based one is. Well, it’s the difference between rain running for hours versus days. OK. |
| Alex | Speed is great, but we need structure. Does Fitz just give me a text string? It |
| Sam | can, but that’s the rookie move. The real power and why it’s in this fine-grained category is using its dictionary mode. Get text, detect. Exactly. That doesn’t just give you the text, it gives you the metadata. The bounding box coordinates the font name, the font size, the color for every single span of text. |
| Alex | OK, hold on, font size. Why? Why do I, as a data scientist, care about the font size? I’m not a designer. |
| Sam | Because it’s a proxy. It’s a proxy for semantic hierarchy. If you want to know where a new section starts, you can just write a rule. If font size is bigger than 12 ft and the font weight is bold, well, that’s a header. |
| Alex | I see. So you’re reverse engineering the document structure using its visual cues. |
| Sam | Exactly. You can write your own logic. Anything in the bottom 5% of the page’s Y coordinates is a footer. Just delete it. You’ve just solved the noise problem before it even gets close to your embedding model. That feels |
| Alex | powerful, but Also really fragile. Like, if the marketing department changes the template next month, my whole parser breaks. |
| Sam | That is the trade-off. You get total precision, but you pay for it with maintenance. |
| Alex | OK, so what about other formats? The source mentions Python Docs and Python PPTX. PowerPoint is an interesting one. Most slides are just. Fluff bullet points and stock photos. The |
| Sam | slides themselves are, yeah, but the source makes a great point here. The real gold in a PowerPoint for a rag system is almost always in the speaker notes. The presenter’s script, right? The slide says Q3 goals. The speaker notes explain why those are the goals, what the risks are, the whole strategy. The library Python PPTX lets you target those notes directly. If you just grab the slide text, you get nothing. You grab the notes, you get the actual |
| Alex | insight. OK, so we’ve got PDFs, we’ve got office docs. But what about the true nightmare, the scanned document, the PDF that’s just an image of |
| Sam | text. For that you need eyes. You need OCR. So this pipeline integrates Pysseract and pillow. The logic is pretty simple. It tries to extract text with fits. If it gets back nothing, it assumes it’s an image, renders the page as a JPEG, and hands it off to Tesseract to read the pixels, a fallback, a crucial one, yeah. But here’s where this pipeline gets really interesting. It doesn’t just stop at extraction. The source describes integrating it directly with a local LLM using a llama, so the enrichment step, yes, and for agentic workflows, this is huge. Imagine you’re ingesting invoices. You don’t just want the text, you want the vendor name, the total amount, the date, specific entities, right? So you can pipe the raw text from Pima PDF right into a local model running on a llama like llama 3 or Mistral. |
| Alex | But this is where I always get stuck. I ask the. LLM for the vendor name and I get back a whole sentence. Sure, I found the vendor name for you. It is Acme Corp. I can’t put that in a database field. |
| Sam | Exactly. And this is why the source highlights structured outputs. It’s a feature that completely changes the game for data engineering. So what is it? In a llama, you can pass a JSON schema along with your prompt. You literally tell the model, your output must be a JSON object with a key called vendor and another key called amount. And it |
| Alex | Just does it. It forces it. It |
| Sam | forces it at the decoding level. It’s not a suggestion in the prompt. It uses constrained decoding to literally mask out any token that would break the JSON syntax. It turns a chatty, probabilistic LLM into a deterministic data extraction tool. |
| Alex | That is huge. So you extract, clean, and then immediately structure into JSON before it’s even saved. |
| Sam | Right now you have a clean queriable data set from the get-go. |
| Alex | OK, so that’s approach one. The DIY high control heavy maintenance approach. Let’s switch gears to approach 2. The unified approach built around the unstructured |
| Sam | library. This is a totally different philosophy. If DIY is like building the car engine yourself, unstructured is just driving the car. The goal is to completely abstract away the file types. |
| Alex | So I don’t need an if statement for PDF, another for Word, another for HTML. |
| Sam | Nope, you just point the API at a file and it figures it out. But the real magic isn’t that it can open the file, it’s how it understands the content inside. It uses a concept called partitioning. |
| Alex | Partitioning. Walk me through that. |
| Sam | When Unstructured reads a document, it doesn’t see a string of text. It breaks it down into a list of typed elements. So it label |
| Alex | them. |
| Sam | It labels them. It says, this block of text is a title. This part is narrative text. Oh, this looks like a table. This is a list |
| Alex | item. We’re back to that semantic understanding again. We are, |
| Sam | but instead of you writing the font size rules, Unstructured has pre-built models, some rules-based, some vision-based to do that detection for you. |
| Alex | And why is that element distinction so critical for rag |
| Sam | filtering. Remember the footer problem with unstructured, you can just say excludiments, header, footer, and boom, they’re gone. You don’t have to write a single regular expression. You just filter by |
| Alex | type. That’s a massive time saver. But the real headline here, the thing that I think matters most for retrieval quality, is how it handles chunking. Oh, |
| Sam | absolutely. Chunking is the dark art of rag. Everyone thinks it’s simple until their system can’t answer a basic question. |
| Alex | The naive approach is just cut the text every 500 characters, |
| Sam | which is awful. You’re slicing sentences in half. You’re breaking up ideas. Unstructured has this strategy called bi title chunking. It’s semantic chunking. |
| Alex | So how does it know where to make the cut? |
| Sam | It uses the hierarchy it found during partitioning. It knows that a title element probably means the topic is changing. So when it sees a new title, it starts a new chunk, period. Even if the last chunk was tiny. So it |
| Alex | won’t just merge the end of chapter one at the start of chapter two to fill up the character count. |
| Sam | Never. It respects the document’s own boundaries, and it creates these composite element objects that wrap the text, but also carry metadata. So a paragraph knows, I belong to the section called Risk factors on page 15. |
| Alex | That metadata is gold for an agentic system. An agent needs to be able to cite its sources, right? It can’t just make stuff up. It needs to say, according to Section 3.1, correct. |
| Sam | If you strip that metadata out during chunking, you’ve essentially lobotomized your agent. It has facts, but it has no provenance. Unstructured keeps that link intact. |
| Alex | It really sounds like the unified approach is the clear winner. It’s easier. It handles noise. It chunks intelligently. Is there a catch? |
| Sam | The catch is the infrastructure, and this brings us to a really unsexy but totally necessary topic, Docker. Right? |
| Alex | I noticed both approaches in the source material were containerized. |
| Sam | They have to be. And if you’ve ever tried to PI install this stuff on a Windows laptop or a bare metal server, you know why. It’s dependency hell. |
| Alex | It’s never just Python, is it? It’s |
| Sam | never just Python. Tesseract is C++ O. You need Lib Tesseract dev for PDF rendering. You need system libraries like Poplar or McDiv. For image processing, you need LidMagic and a bunch of doll libraries for Open CV. PI doesn’t handle any of that. |
| Alex | So the whole it works on my machine problem is a |
| Sam | huge |
| Alex | risk. |
| Sam | It’s a guarantee of failure in production. You deploy your script to the cloud, forget to install Poplar, and the whole thing just crashes. That’s why Docker is so essential. Unstructured provides a pre-built Docker image. It’s huge, yes, several gigs, but it has every single dependency pre-installed and |
| Alex | configured. You’re trading disk space for |
| Sam | sanity. I will make that trade every single day of the week, and there’s isolation too. Document parsing is risky. You’re taking in files from the outside world. files could be. PDFs can have JavaScript. Word docs have macros. Do you really want to be parsing some random resume on your host OS? |
| Alex | No, you put it in a sandbox. |
| Sam | Exactly. Let the container blow up, not your server. |
| Alex | OK, so we’ve got the DIY approach for precision control and the unified approach for scale and semantic structure. How do we synthesize this? If I’m listening right now planning a sprint. How do I choose? |
| Sam | I think we can build a pretty simple decision matrix. Let’s do it. If you’re working with a very specific, very uniform set of documents, like you’re only processing invoices from one vendor, and they always look the same, go DIY. Use Pine Up PDF. It’s fast. It’s lightweight. And you can tune your heuristics to be absolutely perfect for that one |
| Alex | layout. And if you need that immediate LLM enrichment |
| Sam | loop, right, that extract to JSON flow with a llama is very easy to build in a custom pipeline. But |
| Alex | if you’re building a real enterprise RG system, |
| Sam | if you’re building a knowledge base with Thousands of different manuals, white papers, PowerPoints. You have to go with a unified approach like unstructured. You simply cannot maintain custom parsers for 10 different file formats. It’s |
| Alex | impossible. And the chunking is the clincher. |
| Sam | It really is. That by title semantic chunking will almost certainly give you better retrieval results out of the box than a custom script you could write in a week. Keeping section 1 and Section 2 separate is just fundamental to not confusing the RG system. |
| Alex | The source material also mentioned a hybrid approach, which I thought was pretty smart, |
| Sam | the best of both worlds idea. You use unstructured for the heavy lifting, the partitioning, the smart semantic chunking, so you get these beautiful, clean, context aware chunks. Then you feed those clean chunks to a llama, |
| Alex | so not for cleaning, but for summarizing or |
| Sam | for tagging. You can ask the LLM for this chunk, generate 5 keywords and a one sentence summary. Then you embed the summary and the keywords right alongside the original text. It’s like a lightweight graph rag approach. You’re enriching the vector space with synthesized metadata. |
| Alex | That feels like the sweet spot. You’re not making the expensive LLM do the janitorial work of fixing formatting. You let the parser handle the layout and you let the LLM do the thinking. |
| Sam | Use the right tool for the job, geometric analysis for layout, a neural network for reasoning. Don’t get them mixed up. |
| Alex | It all comes back to that first idea, doesn’t it? The quality of your AI is defined by its ability to read. |
| Sam | It is, you know, we talk about hallucinations as if there’s some weird magical property of the model, but so often the model isn’t hallucinating. It’s just confused because we fed it a header, a footer, and half sentence and asked it to explain quantum physics. |
| Alex | We set it up for failure. We |
| Sam | did. A PDF treated like a text file is noise. A PDF treated as a structured hierarchy of information, that’s knowledge. And that transformation happens in the parser, not the prompt. |
| Alex | So a final thought for everyone listening to take back to their IDE. Think about your vector database, the one running right now. How much garbage, how many headers, footers, fragmented sentences is sitting in there dragging your scores down just because you treated a PDF like a .txt file? And |
| Sam | how much smarter could your agent be tomorrow if it actually understood the structure of the documents it was trained on today? A little |
| Alex | bit of plumbing goes a very long way. Thanks for unpacking all this anytime. |
| Sam | It was fun. |
| Alex | We’ll see you on the next deep dive. |
Why Documents Need to Be Processed
Modern enterprises store vast amounts of knowledge in heterogeneous document formats—PDF reports, Word memos, PowerPoint decks, scanned images, and plain text files. These documents are designed for human consumption, with rich formatting, embedded images, multi-column layouts, headers, footers, and tables that serve a visual purpose but are largely meaningless to an AI model[^1].
Large Language Models and embedding models operate on plain text. They cannot natively interpret the binary structure of a .pdf or .docx file, nor can they understand that a bold heading signals a new section or that a table cell maps to a particular column header. Without preprocessing, feeding raw document bytes into an AI pipeline produces garbage—or nothing at all[^1].
Document processing solves this by:
- Extracting text from format-specific binary containers (PDF streams, OOXML archives, image pixels via OCR)[^1]
- Preserving metadata such as titles, authors, page counts, and creation dates that provide context for downstream analysis[^2]
- Structuring content into semantically meaningful units—paragraphs, headings, list items, tables—so retrieval systems can match user queries against coherent passages rather than arbitrary character windows[^7]
- Enabling AI analysis by producing clean text that can be summarized, classified, embedded into vector databases, or used for entity extraction[^1]
The document processing pipelines in this project showcase how to handle these heterogeneous document types and transform them into structured, machine-readable formats suitable for content analysis and indexing, RAG systems, knowledge base creation, document summarization, and information extraction[^1].
Approach 1: Simple PDF Extraction
The first pipeline, simple-pdf-extraction, takes a traditional, format-specific approach. It uses a dedicated Python library for each supported document type, giving you fine-grained control over the extraction process[^1]. The pipeline supports PDF documents, Microsoft Word (.docx), Microsoft PowerPoint (.pptx), plain text files (.txt), and images with OCR support (.jpg, .jpeg, .png, .tiff)[^2].
Extraction and Conversion Packages
The pipeline relies on five key packages, each handling a distinct extraction task[^1]:
PyMuPDF (fitz) is the workhorse for PDF processing. Imported in Python as fitz, PyMuPDF is a high-performance wrapper around the MuPDF rendering engine that extracts text, metadata (title, author, page count), and even image data from PDF files[^15]. It supports multiple extraction modes—plain text via page.get_text("text"), block-level extraction that preserves spatial layout, and dictionary mode that includes font properties and bounding box coordinates[^9]. PyMuPDF is significantly faster than many alternatives because it operates at the C level through MuPDF rather than parsing PDF structure in pure Python[^15]. For best performance, open each document only once and prefer "text" mode unless you need coordinates or font data[^15].
python-docx handles Microsoft Word .docx files. Word documents are stored as ZIP archives containing XML files (the Office Open XML format). python-docx parses this structure and exposes paragraphs, tables, headers, styles, and document properties through a clean Python API[^1]. The pipeline uses it to iterate through document paragraphs and extract their text content along with formatting metadata.
python-pptx performs an analogous role for PowerPoint .pptx presentations. Presentations contain slides, and each slide contains shapes (text boxes, titles, content placeholders, tables, images). python-pptx traverses this hierarchy, extracting text from each shape and optionally from slide notes[^1]. This is essential because much of the valuable content in presentations resides in speaker notes rather than bullet points.
pytesseract + Pillow provide OCR (Optical Character Recognition) capabilities for scanned documents and images. pytesseract is a Python wrapper around Google’s Tesseract OCR engine[^1]. When the pipeline encounters an image file or a scanned PDF page that contains no extractable text layer, it renders the page as a bitmap using Pillow (the Python Imaging Library) and then passes it to Tesseract for text recognition[^18]. Tesseract supports over 100 languages and can be configured with different page segmentation modes (--psm) and OCR engine modes (--oem) for optimal accuracy[^6].
requests serves as the HTTP client for communicating with the Ollama API. After text is extracted from a document, the pipeline sends it to an Ollama-hosted LLM for AI-powered analysis—generating summaries, identifying topics, classifying document types, and extracting entities and keywords[^1][^2].
Source Code Architecture
The simple-pdf-extraction/src/ directory contains four modules that form a clean separation of concerns[^2]:
main.py— The pipeline entry point that orchestrates the overall workflow: scanning the input directory, dispatching documents to the extractor, sending extracted text to the processor, and writing JSON outputextractor.py— Contains format-specific extraction logic, dispatching to PyMuPDF, python-docx, python-pptx, or pytesseract based on MIME type detectionprocessor.py— Handles AI-powered content analysis by sending extracted text to Ollama and parsing structured responsesollama_lister.py— Utility for querying available models on the Ollama server
Output Structure
The pipeline produces one JSON file per input document in data/output/simple-pdf-extraction/, containing the full extracted content, document metadata, extraction method used, processing timestamp, word count, and AI analysis results[^2]. A sample output structure looks like:
{
"filename": "document.pdf",
"filepath": "data/input/document.pdf",
"mime_type": "application/pdf",
"content": "Extracted text content...",
"metadata": {
"title": "Document Title",
"author": "Author Name",
"page_count": 10
},
"extraction_method": "pdf_text_extraction",
"text_length": 5000,
"word_count": 800,
"ai_analysis": {
"summary": "Document summary...",
"topics": ["topic1", "topic2"],
"document_type": "report",
"language": "english",
"confidence": 0.95
},
"entities": ["Entity1", "Entity2"],
"keywords": ["keyword1", "keyword2"]
}
After all documents are processed, a processing_report.json is generated with aggregate statistics including total files processed, success/failure counts, document type distribution, and average word counts[^2].
Connecting to LLMs via Ollama
The simple-pdf-extraction pipeline integrates with LLMs through Ollama, an open-source tool for running language models locally or connecting to cloud-hosted inference endpoints[^1][^2]. This architecture provides flexibility: users with GPU hardware (NVIDIA CUDA, Apple Metal) can run models locally for data privacy, while others can point to a remote Ollama-compatible API endpoint for cloud-hosted inference.
Configuration
Connection is configured through environment variables in a .env file placed in the simple-pdf-extraction directory[^1][^2]:
# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_API_KEY=your-api-key-if-needed
OLLAMA_GENERATION_MODEL_NAME=gpt-oss20b-cloud
OLLAMA_EMBEDDING_MODEL_NAME=mxbai-embed-large
# Logging
LOG_LEVEL=INFO
| Variable | Required | Default | Description |
|---|---|---|---|
OLLAMA_BASE_URL | No | http://localhost:11434 | URL of your Ollama instance[^2] |
OLLAMA_API_KEY | Yes (if using Ollama) | — | API key for authentication[^2] |
OLLAMA_GENERATION_MODEL | No | gpt-oss20b-cloud | LLM for chat/generation tasks[^2] |
OLLAMA_EMBEDDING_MODEL | No | mxbai-embed-large | Model for vectorization[^2] |
The OLLAMA_BASE_URL can point to localhost for a local Ollama installation, a remote server on your network, or a cloud endpoint that exposes an Ollama-compatible API[^2]. This means the same pipeline code works whether you’re running a quantized 7B model on a laptop GPU or routing requests to a hosted inference service.
Structured Outputs
A key capability that Ollama provides is structured outputs—the ability to constrain a model’s response to conform to a specific JSON schema[^8][^14]. Instead of parsing free-form natural language responses (which is error-prone), the pipeline can request that Ollama return results in a predefined structure by passing a JSON schema to the format parameter of the API call[^14].
For example, to extract structured analysis from a document, you can define a schema and pass it in the API request:
curl -X POST http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss",
"messages": [{"role": "user", "content": "Analyze this document..."}],
"stream": false,
"format": {
"type": "object",
"properties": {
"summary": {"type": "string"},
"topics": {"type": "array", "items": {"type": "string"}},
"document_type": {"type": "string"},
"confidence": {"type": "number"}
},
"required": ["summary", "topics"]
}
}'
This guarantees that the model’s output will always be valid JSON matching the specified schema, making downstream processing reliable and repeatable[^8][^14]. Under the hood, Ollama uses constrained decoding (often leveraging Pydantic model capabilities) to enforce the schema at generation time[^5]. Use cases include parsing data from documents, extracting data from images, and structuring all language model responses with more reliability and consistency than basic JSON mode[^8].
Approach 2: Chunking with Unstructured.io
The second pipeline, simple-chunking-with-unstructured, takes a fundamentally different approach using the Unstructured library (unstructured[all-docs])[^1]. Rather than using format-specific libraries and writing custom extraction logic for each document type, Unstructured provides a unified API that handles PDFs, Word, PowerPoint, HTML, images, and more through a single interface[^1].
Partitioning: From Documents to Elements
The core concept in Unstructured is partitioning—the process of breaking a document down into a list of typed Element objects that represent the semantic components of the source file[^7][^4]. Instead of treating a document as a wall of plain text, Unstructured preserves the document’s semantic structure, giving you control over how each component is used downstream[^7].
When you call a partition function (e.g., partition_pdf, partition_docx, or the universal partition), Unstructured analyzes the document and returns a list of elements, each with a type, a unique element_id, the extracted text, and rich metadata[^7]. Here is an example element:
{
"type": "NarrativeText",
"element_id": "5ef1d1117721f0472c1ad825991d7d37",
"text": "The Unstructured API documentation covers the following API services:",
"metadata": {
"last_modified": "2024-05-01T14:15:22",
"page_number": 1,
"languages": ["eng"],
"parent_id": "56f24319ae258b735cac3ec2a271b1d9",
"filename": "Unstructured.html",
"filetype": "text/html"
}
}
Element Types
Unstructured defines a rich taxonomy of element types that capture the semantic role of each piece of content[^7]:
| Element Type | Description |
|---|---|
Title | Text element for capturing headings and titles[^7] |
NarrativeText | Multiple well-formulated sentences; excludes headers, footers, captions[^7] |
ListItem | A NarrativeText element that is part of a list[^7] |
Table | Captured with raw text and optional HTML representation in metadata[^7] |
Image | Image metadata and optionally Base64-encoded image data[^7] |
Header / Footer | Document headers and footers[^7] |
FigureCaption | Text associated with figure captions[^7] |
Formula | Mathematical formulas in the document[^7] |
Address / EmailAddress | Physical and email addresses[^7] |
CodeSnippet | Code blocks within documents[^7] |
PageBreak / PageNumber | Page structure markers[^7] |
UncategorizedText | Free text not matching other categories[^7] |
CompositeElement | Produced only by chunking; combines sequential elements into a single chunk[^7] |
This typed element system is powerful because it lets you filter and route content based on its semantic role. For instance, if you’re building a summarization pipeline, you might only process NarrativeText elements while ignoring Header, Footer, and PageNumber elements[^7]. Each element also carries a parent_id that establishes hierarchy—a NarrativeText element might have a Title as its parent, enabling reconstruction of the document’s outline[^7].
Chunking Strategies
After partitioning, Unstructured can apply chunking to rearrange elements into appropriately sized passages for embedding models and retrieval systems[^19][^21]. The pipeline supports three modes via command-line arguments[^1]:
No chunking (default): Elements are output as-is from partitioning. Each element becomes its own unit. This is useful when you want maximum granularity.
Basic chunking (--basic): A simple size-based strategy that fills each chunk with whole elements up to a maximum character limit. When adding the next element would exceed --max-chunk-size, the current chunk is closed and a new one begins[^21].
By-title chunking (--by-title): A semantic strategy that preserves section boundaries. When a Title element is encountered, the current chunk is closed and a new section begins, even if there is room remaining in the current chunk[^19][^21]. This ensures that chunks respect the document’s topical structure—content from two different sections never bleeds into a single chunk. The --max-chunk-overlap parameter enables a sliding window where trailing characters from one chunk are prepended to the next, providing context continuity for retrieval[^19].
Both strategies respect the --max-chunk-size hard limit. When chunking is applied, individual elements are combined into CompositeElement objects that preserve references to their original constituent elements via the orig_elements metadata field[^7].
Benefits of Containerization
Both pipelines are packaged as Docker containers, and this is not merely a convenience—it addresses fundamental challenges in document processing[^1].
System-Level Dependencies Beyond pip
Document processing libraries depend on native system packages that cannot be installed via pip alone[^1]:
- Tesseract OCR (
tesseract-ocr,libtesseract-dev): The C++ engine that pytesseract calls. Must be installed via the system package manager (apt install tesseract-ocron Debian/Ubuntu)[^20][^27]. Without it, any OCR call will fail with “tesseract is not installed or it’s not in your path”[^22]. - Tesseract language packs (
tesseract-ocr-eng,tesseract-ocr-fra, etc.): Additional packages for each supported language[^20] - Leptonica (
libleptonica-dev): Image processing library required by Tesseract[^18] - Poppler or MuPDF system libraries: Low-level PDF rendering engines that some Python packages depend on
- LibreOffice or system font packages: Required for accurate rendering of Office documents
- OpenCV system dependencies (
libgl1-mesa-glx,libsm6,libxext6): Required if image preprocessing is applied before OCR[^27]
The Unstructured pipeline adds even more dependencies. Its unstructured[all-docs] package bundles parsers for dozens of formats, each potentially requiring its own native library. The Unstructured team provides a pre-built Docker image (downloads.unstructured.io/unstructured-io/unstructured:latest) that includes all of these[^2].
Why Docker
The project uses Docker for six specific reasons[^1]:
- Dependency Management — Containers encapsulate all system dependencies (Tesseract, system libraries for PDF processing) in a single, reproducible image, eliminating version conflicts with the host system[^1]
- Reproducibility — A container image guarantees identical behavior across different machines and environments, eliminating “it works on my machine” problems[^1]
- Isolation — Document processing can be CPU- and memory-intensive. Containers allow resource limits to be applied without affecting the host[^1]
- Portability — The same image runs on Windows, macOS, and Linux without modification[^1]
- Scalability — Containerized pipelines integrate naturally with Docker Compose or Kubernetes for batch processing of large document collections[^1]
- Security — Documents often contain sensitive information; containers add an isolation layer and can be run with restricted permissions[^1]
Running the Pipelines
Prerequisites
- Docker installed on your system (Docker Desktop for Windows/macOS, or
docker.ioanddocker-composepackages for Linux)[^1] - For AI features in
simple-pdf-extraction: Ollama running locally or accessible via a URL[^1]
Preparing Your Documents
On all platforms, start by creating the data directories and placing your documents in the input folder[^1]:
mkdir -p data/input
mkdir -p data/output
# Copy your documents to data/input
Running Simple PDF Extraction
macOS/Linux:
cd simple-pdf-extraction
docker build -t simple-pdf-extraction .
docker run -v $(pwd)/../data:/data simple-pdf-extraction
Windows:
cd simple-pdf-extraction
docker build -t simple-pdf-extraction .
docker run -v %cd%\..\data:/data simple-pdf-extraction
Alternatively, use Docker Compose from the simple-pdf-extraction directory[^2]:
docker-compose build
docker-compose run --rm pdf-extractor ./run.sh
Or use the convenience scripts from the repository root: run-simple-pdf-extraction.sh (Linux/macOS) or run-simple-pdf-extraction.ps1 (Windows)[^1].
Results are saved in data/output/simple-pdf-extraction/ as JSON files containing extracted content, metadata, and AI analysis[^1].
Running Chunking with Unstructured
First, pull the Unstructured base image if needed[^2]:
# AMD64 (default)
docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
# Apple Silicon / ARM64
docker pull --platform=linux/arm64 downloads.unstructured.io/unstructured-io/unstructured:latest
Then build and run[^1]:
cd simple-chunking-with-unstructured
docker build -t simple-chunking-unstructured .
# No chunking (elements as-is)
docker run -v $(pwd)/../data:/data simple-chunking-unstructured
# By-title chunking
docker run -v $(pwd)/../data:/data simple-chunking-unstructured \
--by-title --max-chunk-size 1000 --max-chunk-overlap 100
# Basic chunking
docker run -v $(pwd)/../data:/data simple-chunking-unstructured \
--basic --max-chunk-size 1500
Or use the convenience scripts: run-simple-chunking-with-unstructed.sh / .ps1 from the repository root[^1].
Results are saved in data/output/unstructured-*/ with each element or chunk as a separate JSON file in document-specific subdirectories[^1].
Chunking Configuration
The Unstructured pipeline accepts these command-line arguments[^1]:
| Argument | Description |
|---|---|
--by-title | Use title-based chunking (groups content under headings)[^1] |
--basic | Use basic size-based chunking[^1] |
--max-chunk-size N | Maximum characters per chunk (default: 1000)[^1] |
--max-chunk-overlap N | Overlap between chunks (default: 0)[^1] |
Choosing Between the Two Approaches
Use Simple PDF Extraction when[^1]:
- You need AI-powered content analysis (summaries, topic extraction, entity recognition)
- You want direct control over the extraction process for each format
- You’re working with a smaller number of documents
- You need integration with Ollama for LLM-based analysis
Use Chunking with Unstructured when[^1]:
- You’re building RAG or retrieval systems that need semantic chunking
- You need uniform handling across many document formats through a single API
- You want to preserve document structure as typed, hierarchical elements
- You’re processing large volumes of diverse document types
Both pipelines serve as reference implementations. They can be adapted, extended, or combined—for example, using Unstructured for partitioning and chunking, then feeding chunks through Ollama for AI-enriched metadata[^1].
References
README.md - This project provides two example implementations of document processing pipelines that demonstrate …
README.md - Pull docker image first he AMD64 platform is the default. docker pull downloads.unstructured.iounstr…
Partitioning - Unstructured Documentation - Partitioning functions in unstructured allow users to extract structured content from a raw unstruct…
Structured LLM Output Using Ollama - Towards Data Science - By introducing structured outputs, Ollama now makes it possible to constrain a model’s output to a s…
python 3.x - i am building code to extact text from image if the pdf … - If you PyMuPDF, you do not need pytesseract, because there is a native Tesseract-OCR built into PyMu…
Document elements and metadata - Unstructured - When you partition a document with Unstructured, the result is a list of document Element objects. T…
Structured outputs · Ollama Blog - Ollama now supports structured outputs making it possible to constrain a model’s output to a specifi…
Advanced PyMuPDF Text Extraction Techniques | Full Tutorial - learnpython #programming #pdfautomation Learn how to extract and structure text from PDF documents u…
Structured Outputs - Ollama’s documentation - Structured outputs let you enforce a JSON schema on model responses so you can reliably extract stru…
How to extract text from a PDF using PyMuPDF and Python - PyMuPDF is fast for basic PDF text extraction, while Nutrient DWS Processor API handles complex docu…
PyMuPDF with tesseract OCR as External Content Extraction Engine - Example of External Content Extraction Engine. PyMuPDF with tesseract OCR as External Content Extrac…
Chunking - Unstructured Documentation - Chunking rearranges the resulting document elements into manageable “chunks” to stay within the limi…
From Image to Text in Seconds — Tesseract OCR in a Docker … - In this tutorial, we’ll containerize Tesseract so you can run OCR anywhere — no OS dependencies, no …
Chunking - Unstructured Documentation - The by_title chunking strategy preserves section boundaries and optionally page boundaries as well. …
Pytesseract in a docker container cannot find Tesseract OCR - Reddit - I am working on OCR related project where I need to use pytesseract to do some OCR. The project is c…
How do I add tesseract to my Docker container so i can use … - I am working on a project that requires me to run pytesseract on a docker container, but am unable t…