Knowledge Graphs for Agents
Modern AI agents often generate fluent but shallow responses. Knowledge graphs (KGs) introduce structured, semantically grounded knowledge that enables agents to reason, check consistency, and retrieve facts — all grounded in explicit relationships. In agentic systems, KGs play a role similar to a semantic memory: a persistent, queryable structure that captures entities, relationships, and context over time.
2. Core Graph Concepts
Graphs are mathematical structures used to model relationships among objects.
- Vertices (Nodes): Represent entities or objects — e.g., Person, Company, Location.
- Edges (Links): Represent relationships between entities — e.g., works_at, founded, is_located_in.
Graph types:
- Directed graphs: Edges have direction (A → B). Example: Alice → works_at → Google.
- Undirected graphs: Relationships are symmetrical (A—B). Example: Alice — married_to — Bob.
- Weighted graphs: Edges have numerical weights, indicating strength or confidence. Useful for ranking facts or encoding probabilities in uncertain data.
Example illustration:
If we take the sentence “Alan Turing worked at Bletchley Park during WWII,” a simple graph representation would be:
- Vertices: Alan Turing, Bletchley Park, WWII
- Edges:
- (Alan Turing) —worked_at→ (Bletchley Park)
- (Alan Turing) —active_during→ (WWII)
3. Building a Knowledge Graph
There are several approaches to constructing knowledge graphs, depending on the source and structure of available data.
(a) From Structured Data
- Use databases, CSVs, or APIs that already have well-defined schema.
- Convert tabular or relational data into triples: (entity, relation, entity/value).
Example: SQL
employeestable → (Alice, works_at, Google).
(b) From Semi-structured Data
- Extract information from sources like JSON, XML, or infoboxes (e.g., Wikipedia).
- Use schema mapping to align fields with ontology terms (e.g., “founder” → hasFounder).
(c) From Unstructured Text (Document-Derived KGs)
- Apply NLP pipelines to extract:
- Named entities (NER): people, organizations, locations.
- Relations (RE): verbs or prepositional phrases linking entities.
- Coreference resolution: unify references like “he” and “Turing”.
- LLMs can generate semantic triples or structured JSON via prompt templates such as: “Extract entities and their relationships as subject–predicate–object triples.”
- Store extracted triples in a graph database such as Neo4j, GraphDB, or RDF-based store.
4. Ontologies and Schema Design
A knowledge graph is not just a collection of triples — it also has a schema defining:
- Classes (Types): What kind of entities exist (Person, Organization, Event).
- Relations (Predicates): Allowed edges and their semantics.
- Constraints: Domain/range typing, cardinalities, inheritance.
This allows semantic reasoning — inferring new facts from known ones (e.g., via RDF Schema or OWL reasoning).
5. Using Knowledge Graphs in Agentic AI Systems
Agentic AI systems orchestrate multiple tools or reasoning steps. KGs empower them in several key ways:
(a) Retrieval-Augmented Reasoning
- The agent uses the KG as a fact repository to ground responses or validate LLM output.
- Example: Before answering “Who founded DeepMind?”, the agent queries KG → returns “Demis Hassabis”.
(b) Contextual Memory and State Tracking
- The KG maintains long-term memory across multiple interactions.
- Example: In a conversational system, nodes track “user preferences” or “past tasks”.
(c) Multi-agent Collaboration
- Agents share structured knowledge via a shared KG workspace.
- Example: A “research agent” adds verified facts; a “summary agent” consumes them to generate reports.
(d) Reasoning and Inference
- Graph algorithms (shortest path, PageRank, neighborhood clustering) help determine relevance or influence.
- Logical reasoners can infer new relationships: If (Turing, worked_at, Bletchley Park) and (Bletchley Park, located_in, UK) ⇒ (Turing, worked_in, UK).
6. Integrating LLMs and Knowledge Graphs
The modern fusion of LLMs and KGs can take several patterns:
- KG-Augmented RAG: LLM queries KG to retrieve nodes relevant to a question, grounding answers with factual context.
- LLM-to-KG Extraction: LLM parses new documents, meetings, or chats to expand the KG dynamically.
- Graph Embeddings: Use algorithms like Node2Vec, TransE, or Graph Neural Networks to represent nodes and edges in vector space for similarity search.
- KG as Context Builder: LLM injects KG-derived relational context into prompts to improve coherence and factuality.
7. Example: Agentic Workflow with a Knowledge Graph
Scenario: “A research assistant agent helps summarize academic papers.”
- Document ingestion: LLM extracts concepts, authors, institutions, and relationships.
- KG update: Stores triples like (Paper X, written_by, Author Y), (Author Y, affiliated_with, University Z).
- Query phase: When asked “Which universities collaborate on graph representation learning?”, the agent traverses relationships in the KG to find co-authorship or shared projects.
- Answer generation: The agent cites results with grounding retrieved from the graph.
8. Tools and Frameworks
- Graph Databases: Neo4j, GraphDB, ArangoDB, JanusGraph.
- RDF/Linked Data: Apache Jena, Stardog, Blazegraph.
- Extraction Pipelines: spaCy, OpenIE, Stanford CoreNLP, LLM-based wrappers (LangChain, Haystack).
- Visualization: Graphistry, Gephi, Neo4j Bloom, or custom D3.js tools.
9. Key Takeaways
- Knowledge graphs make implicit relationships explicit — forming a durable foundation for reasoning, retrieval, and memory in agentic systems.
- Combine LLM-driven unstructured information extraction with graph databases for persistent, explainable knowledge.
- In multi-agent environments, KGs serve as shared world models — enabling collaboration, grounding, and continuous learning.
Creating Triples from Text
Triples from text can be created with fairly different pipelines depending on whether you rely on “classic” NLP (e.g., spaCy + RE models) or prompt an LLM directly; in practice, the best systems often hybridize both.123
What a “triple extraction” pipeline does
Goal: from raw text, produce triples of the form $(\text{subject}, \text{predicate}, \text{object})$ plus optional attributes (time, source span, confidence).
Canonical stages:
- Sentence segmentation and tokenization.
- Entity detection and normalization.
- Relation extraction between entity pairs.
- Optional: entity linking to a canonical ID and ontology alignment.34
These stages are implemented differently in an NLP-centric vs LLM-centric stack.
Traditional NLP / spaCy-based pipelines
Typical architecture
- NER with spaCy (possibly custom-trained)
- Dependency- or pattern-based relation extraction
- Entity linking and ontology mapping
- Triple construction
- For each recognized relation between entity spans, emit: $(\text{subject\_ID}, \text{relation\_URI}, \text{object\_ID})$ plus provenance (sentence, offsets, confidence).4
Strengths
- Predictability and control
- High precision in narrow domains
- Cheaper and easier to run at scale
Weaknesses
- Coverage and recall
- Annotation cost and rigidity
- Limited abstraction
LLM-centric triple extraction
Core patterns
- Single-pass instruction prompting
- Two-step or multi-step extraction
- Step 1: extract entities and type them.
- Step 2: given the entities and text, extract relations between named entities.
- This has been used in systems like KGGen to keep entities consistent across triples.1
- LLM-based clustering and refinement
- After initial triples, an LLM is used again to cluster nodes referring to the same entity, collapse duplicates, and clean predicates.1
- Specialized “text-to-triple” models
- Some frameworks (e.g., Triplex, KG-LLM) train or tune models specifically to convert text into triples or to complete triples (knowledge graph completion).10
Strengths
- High recall and semantic coverage
- Low setup cost and rapid prototyping
- Flexible schema and ontology handling
- Rich attributes and event structures
Weaknesses
- Hallucinations and factual drift
- Stability and reproducibility
- Cost and latency
- Schema enforcement is indirect
Direct comparison: spaCy-style NLP vs LLMs
Methodological differences
| Aspect | spaCy / traditional NLP | LLM-based extraction |
|---|---|---|
| Entity detection | NER models over tokens and spans, trained per domain 23 | Prompted recognition with implicit type knowledge 19 |
| Relation extraction | Dependency patterns, rule-based grammars, supervised RE 53 | Natural-language instructions; model infers relations end-to-end 18 |
| Schema integration | Explicit: predicates hard-coded or trained; strict types 43 | Implicit: prompt describes ontology; LLM maps raw relations to it 110 |
| Data requirements | Needs labeled examples for NER/RE/linking 34 | Can work zero/few-shot with prompts and a few exemplars 111 |
| Precision vs recall | High precision, lower recall outside training domain 46 | Higher recall and coverage, but risk of hallucination 1106 |
| Scalability & cost | Fast, cheap, good for large corpora 27 | Slower, more expensive per token 813 |
| Interpretability | Transparent rules and limited models 53 | Black-box; behavior guided by prompts but hard to fully predict 612 |
Hybrid pipelines (often best for agentic systems)
Many recent practical recipes combine both, leveraging spaCy for robust low-level structure and LLMs for semantic flexibility.23
Common hybrid patterns:
- spaCy for entities, LLM for relations
- spaCy + RE model, LLM for cleaning and ontology alignment
- LLM-first, spaCy/regex for verification
- Annotation bootstrapping
For agentic systems, the hybrid pattern also lets different agents specialize: a parsing agent using spaCy, an extraction agent using an LLM, and a curation agent validating triples.
How choice of method impacts agentic QA over a KG
When the resulting KG is used in an agentic QA or planning pipeline:
- spaCy-first graphs
- LLM-first graphs
A typical agentic QA loop with a KG:
- Retrieval agent: queries KG for candidate entities/paths.
- Validation/curation agent: checks key triples using an LLM as a truthiness filter.12
- Answer-generation agent: conditions on both KG evidence and raw text snippets (RAG + KG).
The more “surgical” and high-precision your triple extraction (spaCy-heavy), the more you’ll depend on fallback text search; the more “wide” and LLM-heavy you go, the more you must invest in validation agents and confidence scoring.
Ontologies Deep-dive
An ontology is the formal semantic blueprint of a domain: it defines the classes of things that exist, their properties, and the admissible relationships and constraints between them, providing shared meaning for a knowledge graph.1617
What an ontology is (and is not)
- Definition
- Ontology vs. knowledge graph
- Role in agentic systems
Mini example (e‑commerce):
- Classes:
Product,Customer,Order. - Object properties:
placesOrder(Customer → Order),containsProduct(Order → Product). - Data properties:
hasPrice(Product → decimal),orderDate(Order → date).1619
Examples of ontologies
- FOAF (Friend of a Friend)
- Models people, organizations, online accounts, and their social links, with classes like
foaf:Personand properties such asfoaf:knows,foaf:homepage.20
- Models people, organizations, online accounts, and their social links, with classes like
- SNOMED CT / biomedical ontologies
- Large, hierarchically structured ontologies of diseases, procedures, and findings, enabling semantic interoperability in EHRs.22
- Enterprise product ontology
These ontologies are usually expressed in OWL/RDF or similar formalisms to support machine reasoning.2016
Process of creating an ontology
A practical ontology engineering process usually follows these steps.2320
- Scope and purpose
- Identify why you need the ontology: search, analytics, integration, regulatory reporting, etc.
- Define boundaries: what is in scope vs. out of scope.23
- Gather requirements (competency questions)
- Concept and relation elicitation
- Design the class hierarchy (TBox)
- Define properties and constraints
- Align with existing ontologies
- Populate with sample instances and test competency questions
- Create example individuals and assert facts.
- Use a reasoner to check if the ontology answers the competency questions and is logically consistent.23
- Iterative refinement and governance
What to focus on (and common pitfalls)
Key design focuses
- Clarity and shared meaning
- Modularity
- Prefer smaller, “plug‑and‑play” ontologies that can be composed, rather than one monolith; this improves maintainability and governance.23
- Balance expressivity vs. complexity
- Only use as much OWL expressivity as you need; overly complex axioms can make reasoning slow and the ontology fragile.23
- Alignment with real-world use cases
Typical pitfalls (empirical studies)
Analyses of ontologies and tools like OOPS! highlight recurring mistakes.2526
- Lack of annotations and documentation
- Classes and properties without clear labels, comments, or definitions; this undermines reuse and governance.25
- Missing or incorrect domain/range
- Unconnected or orphaned elements
- Classes or properties not integrated into the main hierarchy or graph, reducing coherence.25
- Misuse of inheritance and logical constructors
- Using intersection instead of union for domain/range, overusing “miscellaneous” catch‑all classes, or recursive/self-referential definitions.26
- Inconsistencies and contradictions
- Over-modeling / ontological bloat
Creating ontologies with LLMs
LLMs are increasingly used to reduce the manual burden of ontology design and curation.282224
LLM roles in ontology engineering
- From requirements to draft ontology
- Incremental ontology generation
- Methods such as CQ-by-CQ or Ontogenia feed competency questions one by one or in batches, letting the LLM produce and iteratively refine an ontology module.24
- Ontology learning from text
- Curation and quality control
Concrete patterns from recent work
- Memoryless CQ-by-CQ vs incremental generation
- Treat each competency question independently to generate local ontology fragments, then merge; or prompt with all questions and grow a unified ontology.24
- End-to-end ontology learning (OLLM)
- Approaches that fine-tune LLMs to produce taxonomic backbones from scratch, with regularizers that avoid overfitting to frequent concepts.28
- LLM-assisted OWL axiom drafting (e.g., SPIRES-style frameworks)
- LLM drafts axioms which are then validated by experts or automated reasoners.22
Practical caveats
- Hallucinated or inconsistent axioms
- Schema drift and lack of global coherence
A pragmatic pattern is: experts define scope and core concepts, LLMs propose draft hierarchies and axioms, and automated tools plus experts refine and validate.2422
Optimizing ontologies with graph algorithms and other techniques
Once an ontology is represented as a graph (classes and properties as nodes/edges), you can apply graph-theoretic methods for analysis and optimization.27
Structural optimization
- Normalization and redundancy removal
- Remove duplicate vertices and parallel edges; ensure each concept is uniquely and consistently represented.27
- Minimal spanning trees for pruning
- Use weighted graphs (weights reflect importance or usage) and minimum spanning tree or related optimization to reduce structural complexity while preserving coverage.27
- Backpack/knapsack-style optimization
- Treat ontology refinement as selecting a subset of concepts/relations that maximizes overall value (coverage, relevance) under constraints like size or reasoning time.27
Graph-based diagnostics
- Centrality measures
- Degree, betweenness, and eigenvector centrality identify highly connected or critical concepts that may need clearer definitions or modularization.27
- Community detection / clustering
- Detect groups of concepts that form natural modules; can guide modularization or reveal domains that should be split into separate ontologies.27
- Alignment and mapping via graph similarity
- Graph mapping and similarity measures help align two ontologies, measuring precision/recall of mappings and spotting structural mismatch.27
Semantic optimization and learning
- Usage-driven refinement
- Ontology-enhanced KG completion and reasoning
- Recent methods integrate ontological constraints into LLM-based KG completion, improving reasoning performance and revealing missing or mis-specified ontology elements.29
- Iterative learning loop
- Automatically weight concepts and relations, expand the ontology from data, then run optimization steps to keep integrity, thematic balance, and performance.27
Graph Datavases: RDF/SPARQL (Apache Jena) vs LPG/CYPHER (Neo4j)
RDF/SPARQL (e.g., Apache Jena) and labeled‑property graphs with Cypher (e.g., Neo4j) both store nodes and edges, but they optimize for different things: RDF for semantic interoperability and reasoning, LPG for developer ergonomics and high‑performance traversals.313233
RDF + SPARQL + Apache Jena: overview
- Data model (RDF)
- Query language (SPARQL)
- Apache Jena
Agentic KG example (RDF/SPARQL)
Triples (Turtle):
@prefix ex: <http://example.org/> .
ex:Task123 a ex:ResearchTask ;
ex:assignedTo ex:AgentSummarizer ;
ex:hasInput ex:Doc456 ;
ex:status "pending" .
ex:AgentSummarizer a ex:LLMAgent ;
ex:capability "summarization" .
SPARQL:
SELECT ?task ?doc
WHERE {
?task a ex:ResearchTask ;
ex:assignedTo ex:AgentSummarizer ;
ex:hasInput ?doc ;
ex:status "pending" .
}
This can drive an agent scheduler that picks pending tasks for the summarization agent.
LPG + Cypher + Neo4j: overview
- Data model (labeled property graph)
- Query language (Cypher)
- Neo4j characteristics
Agentic KG example (LPG/Cypher)
Graph:
CREATE (t:Task {id: "Task123", type: "ResearchTask", status: "pending"})
CREATE (a:Agent {id: "AgentSummarizer", type: "LLMAgent", capability: "summarization"})
CREATE (d:Document {id: "Doc456"})
CREATE (a)-[:ASSIGNED_TO]->(t)
CREATE (t)-[:HAS_INPUT]->(d);
Query:
MATCH (a:Agent {id: "AgentSummarizer"})-[:ASSIGNED_TO]->(t:Task {status: "pending"})-[:HAS_INPUT]->(d:Document)
RETURN t.id AS taskId, d.id AS docId;
Side‑by‑side comparison
Conceptual & data‑model differences
| Dimension | RDF/SPARQL (Jena) | LPG/Cypher (Neo4j) |
|---|---|---|
| Core model | Global triples $(s,p,o)$ only 3334 | Nodes, relationships, properties; properties on both nodes and edges 3339 |
| Semantics | Strong: IRIs, ontologies (RDFS/OWL), formal reasoning 3138 | Implicit: labels and conventions; limited formal semantics 3932 |
| Edge attributes | Modeled via extra triples or reification 33 | Native properties on relationships (e.g., since, confidence) 3339 |
| Schema | Optional but often explicit via ontologies 4243 | Optional, implicit in labels and property usage 4033 |
| Identity | Global IRIs, good for linked data 3433 | Local identifiers; global semantics by convention or mapping 4044 |
Querying and reasoning
| Dimension | RDF/SPARQL (Jena) | LPG/Cypher (Neo4j) |
|---|---|---|
| Query style | Triple patterns, joins via shared vars 36 | Graph pattern matching with ASCII syntax 3941 |
| Reasoning | Built‑in RDFS/OWL rule engines, entailment 3138 | No native OWL; some constraints via APOC/procedures, basic semantics only 32 |
| Federation | SPARQL 1.1 federation between endpoints 3634 | External federation via tooling/APIs, not part of Cypher spec |
| Multi‑hop traversal performance | Joins over triples; can be slower on dense graphs 3233 | Optimized for deep traversals and path queries 3932 |
Performance, tooling, and use cases
| Dimension | RDF/SPARQL (Jena) | LPG/Cypher (Neo4j) |
|---|---|---|
| Performance focus | Semantic precision, interoperability 3244 | High‑performance traversals and analytics 3239 |
| Typical use cases | Linked data, ontology‑driven KGs, regulatory/semantic integration 4245 | Recommendation, fraud detection, operational reasoning, context graphs 3240 |
| Tooling | Jena APIs, Fuseki SPARQL server, OWL reasoners 353437 | Neo4j Browser, Bloom, GDS (graph data science) library 3932 |
| AI/agent integration | Strong with ontologies and explicit constraints 4542 | Strong with traversal‑based context retrieval and graph analytics 32 |
Agentic AI: when to use which
RDF/SPARQL/Jena in agentic systems
Best when your agentic AI needs formal semantics, interoperability, and rule‑based reasoning:
- Ontology‑driven task typing
- Policy and compliance checks
Example: find all tasks involving PHI that require a privacy‑compliant agent:
PREFIX ex: <http://example.org/>
SELECT ?task ?agent
WHERE {
?task a ex:Task ;
ex:requiresDataType ex:ProtectedHealthInfo .
?agent a ex:Agent ;
ex:hasClearance ex:PHI_Compliant .
?agent ex:canHandle ?task .
}
Reasoning can infer ex:ProtectedHealthInfo ⊑ ex:SensitiveData, automatically including tasks modeled at different abstraction levels.3138
LPG/Cypher/Neo4j in agentic systems
Best when your agentic AI needs fast traversals, graph algorithms, and operational context:
- Context graph for RAG and multi‑agent planning
Example: dynamically pick the best agent based on recent performance:
MATCH (a:Agent)-[r:HANDLED]->(t:Task)
WHERE t.type = "Summarization"
WITH a, avg(r.qualityScore) AS avgScore, count(*) AS n
WHERE n > 20
RETURN a.id, avgScore
ORDER BY avgScore DESC
LIMIT 3;
- Graph algorithms for routing and coordination
Similarities and complementarity for agentic KG
- Both can represent: agents, tasks, tools, documents, user goals, and the relationships between them.4033
- Both allow pattern‑matching queries to support:
Hybrid patterns (increasingly common):
- Keep ontologies and long‑term semantic metadata in RDF/Jena, while mirroring operational facts into a property graph for fast analytics and traversal.4432
- Example:
- RDF layer: classes
LLMAgent,RetrievalAgent, constraints, capabilities. - LPG layer: concrete agent instances, real‑time interactions, performance edges.
- Agent orchestration:
- Use SPARQL to determine which type of agent is semantically appropriate for a request.
- Use Cypher and graph algorithms to pick which instance is best given load, success rates, and graph‑local context.32
- RDF layer: classes
This combination lets you give your agentic system both a formal world model (RDF/Jena) and a high‑speed operational memory (LPG/Neo4j).
Inductive Logic Programming and Praph Neural-Networks
Inductive logic programming (ILP) and graph neural networks (GNNs) both enrich a knowledge graph, but in complementary ways: ILP adds explicit, symbolic rules, while GNNs add learned, distributed representations and powerful pattern completion. Together they support more accurate, explainable, and adaptive agentic KGs.495051
Inductive Logic Programming over knowledge graphs
What ILP does for a KG
- ILP learns first‑order logic rules from example facts in the KG plus background knowledge, e.g. $\text{collaboratesWith}(X,Y) \leftarrow \text{coAuthor}(X,Z) \wedge \text{coAuthor}(Y,Z)$.49
- On a KG, this means: induce definitions of new relations or constraints from existing triples and use them to derive additional (explained) edges.5249
Enhancements to the KG
- Rule-based KG completion
- Explainability and constraints
- Labeling functions and weak supervision
- ILP can automatically discover labeling functions over the graph, which then provide cheap labels for downstream models (including GNNs).57
Agentic AI example
- From an agent‑tool KG, ILP might learn: $\text{suitableForQuery}(A,Q) \leftarrow \text{hasSkill}(A,S) \wedge \text{requiresSkill}(Q,S)$.
- The agent orchestrator uses these rules to propose candidate agents for new tasks, even when there is no direct historical link.
Graph neural networks over knowledge graphs
What GNNs do for a KG
- GNNs learn vector embeddings for entities and relations by propagating information over the graph structure, then use these embeddings to score candidate edges or classify nodes.585951
- They can operate in transductive (closed‑world) or inductive settings where new nodes and edges appear at test time.6061
Enhancements to the KG
- Knowledge graph completion
- Inductive generalization to new entities
- Graph-aware reasoning modules
Agentic AI example
- A GNN trained on a KG of past tasks predicts
ASSISTSorDELEGATES_TOedges between agents, enabling automatic discovery of cooperation patterns for complex tasks. - A completion model fills in
HAS_CAPABILITYorRELEVANT_TO_TOPICedges, which are then used to guide retrieval and planning.5859
How ILP and GNNs complement each other
1. Symbolic rules + neural embeddings
- ILP provides discrete, human-readable rules that reflect high‑level regularities; GNNs provide continuous embeddings that capture subtle statistical patterns and noise‑robust similarity.5049
- Neuro‑symbolic frameworks like KeGNN stack “knowledge enhancement layers” on top of a GNN to refine predictions using prior logical rules.50
2. Rule learning from GNNs and vice versa
- GNN-based models over KGs can be designed so that their transformations correspond to logic rules; from such monotonic GNNs, rules can be extracted and used in a symbolic solver.62
- Differentiable ILP systems (e.g., GLIDR, DFORL) blend gradient-based optimization with rule induction, achieving competitive KG completion performance while keeping rules extractable.5453
3. Better data efficiency and robustness
- ILP can provide weak labels or constraints that regularize GNN training (e.g., enforcing that predictions obey certain logical implications).5750
- GNNs, in turn, can denoise or complete sparse KGs before ILP runs, giving ILP richer structure to learn from.5150
4. For agentic KGs specifically
- Use GNNs to propose candidate new edges (e.g., potential collaborations, tool applicability, topic relevance).5958
- Use ILP to:
This gives you a KG that is both richer (thanks to GNN completion and inductive embeddings) and more interpretable and controllable (thanks to ILP rules and constraints)—a strong foundation for robust agentic AI.
Solution Architecture: Agentic AI M&A Target Identification System
System Overview
The system is a seven-layer agentic AI pipeline that continuously ingests and processes multi-source text, builds and enriches a domain knowledge graph, and deploys specialized agents to discover acquisition targets or score known candidates. The architecture integrates traditional NLP, LLMs, formal ontologies, graph algorithms, ILP, and GNNs into a unified, human-supervised loop.646566
┌──────────────────────────────────────────────────────────┐
│ ① DATA INGESTION ② NLP / EXTRACTION │
│ SEC Filings, News, →→→ spaCy NER, Relation │
│ Emails, Social Media, Extraction, LLM Triples, │
│ Reports, PDFs Coreference, Sentiment │
└────────────────────────────────────┬─────────────────────┘
↓
┌──────────────────────────────────────────────────────────┐
│ ③ ONTOLOGY LAYER + HUMAN REVIEW │
│ M&A OWL Ontology, Competency Questions, Analyst Curation│
└────────────────────────────────────┬─────────────────────┘
↓
┌──────────────────────────────────────────────────────────┐
│ ④ KNOWLEDGE GRAPH STORE │
│ RDF / Apache Jena (Semantic) + Neo4j (Operational) │
│ + Vector Store (RAG) │
└─────────────────────┬────────────────────────────────────┘
↕ continuous enrichment
┌─────────────────────┴────────────────────────────────────┐
│ ⑤ KG ENHANCEMENT ENGINE │
│ Graph Algorithms + ILP + GNNs + LLM-as-Judge Validator │
└────────────────────────────────────┬─────────────────────┘
↓
┌──────────────────────────────────────────────────────────┐
│ ⑥ AGENTIC REASONING LAYER │
│ Orchestrator | Signal | Scorer | Due Diligence | Report │
└────────────────────────────────────┬─────────────────────┘
↓
┌──────────────────────────────────────────────────────────┐
│ ⑦ ANALYST INTERFACE │
│ Target Dashboard → Explainability → Feedback Loop →→→ │
│ (feeds back to Ontology + KG) │
└──────────────────────────────────────────────────────────┘
Layer ① — Data Ingestion
All sources are normalized and ingested through a unified document store before NLP processing begins.6766
- SEC Filings (EDGAR): 10-K, 10-Q, 8-K filings; structured via XBRL where available and parsed for narrative text using tools like Docling or Apache Tika.6667
- News Articles: Web scrapers or licensed feeds (Bloomberg, Reuters APIs); timestamped and source-tagged.
- Emails / Internal Docs: IMAP-based ingestion with content stripped for NLP; high sensitivity data must be tagged for access control.
- Social Media: Twitter/X, LinkedIn mentions, Reddit; include metadata (author, timestamp, engagement) critical for signal detection.
- Financial Reports / Analyst PDFs: OCR via Docling or Tesseract; table extraction is critical since financial KPIs live in structured tables.66
All documents receive:
sourceType,timestamp,entityMentionmetadata tags.- A provenance hash linking every triple to its exact source sentence — critical for analyst trust and compliance.
Layer ② — NLP & Extraction Pipeline
A hybrid NLP + LLM pipeline processes each document to produce candidate triples.6766
Step 1: Document preprocessing
- PDF/HTML parsing → plain text + structured tables.
- Sentence segmentation, tokenization.
- Language detection and section identification (e.g., “Risk Factors” vs. “MD&A” in SEC filings).67
Step 2: Traditional NLP (spaCy)
- Custom NER trained on M&A domain labels:
Company,Executive,Investor,Regulator,Market,Technology,FinancialMetric,JurisdictionEvent.
- Dependency-based relation extraction: Subject–verb–object traversal over parse trees to extract verb-grounded triples.
- Coreference resolution: Link “the company”, “they”, “its CEO” back to canonical entity nodes.
- Sentiment + signal extraction: Classify sentences as
AcquisitionSignal,RegulatoryRisk,FinancialDistress,GrowthIndicatorusing text classifiers.68
Step 3: LLM triple extraction
- For each sentence or short passage (with entity spans injected), prompt an LLM:
Given these named entities: [TechCorp, Alice Chen, Google],
extract all subject–predicate–object triples from the text.
Output JSON: {"triples": [{"s": ..., "p": ..., "o": ...}]}
Text: "Alice Chen, CEO of TechCorp, agreed to explore a merger
with Google's cloud division last quarter."
- Output:
(Alice Chen, isCEOOf, TechCorp),(TechCorp, exploringMergerWith, Google Cloud),(event, occurredIn, Q3-2024).
Step 4: Triple validation and deduplication
- Span check: Verify subject and object both appear in the source text (prevents hallucination).
- Confidence scoring: Each triple receives a score from the extraction model.
- Low-confidence triples are flagged for human review rather than auto-committed.66
Layer ③ — M&A Domain Ontology and Human Review
The M&A Ontology
The ontology defines the semantic schema for the entire KG. It answers questions like: what is an Acquirer, what does hasStrategicOverlap mean, what constraints apply to hasRevenue?6970
Core classes:
ex:Company rdfs:subClassOf ex:Entity
ex:PublicCompany rdfs:subClassOf ex:Company
ex:PrivateCompany rdfs:subClassOf ex:Company
ex:Executive rdfs:subClassOf ex:Person
ex:InvestmentFirm rdfs:subClassOf ex:Entity
ex:Market rdfs:subClassOf ex:Domain
ex:AcquisitionEvent rdfs:subClassOf ex:Event
Key object properties:
| Property | Domain | Range | Notes |
|---|---|---|---|
hasCompetitor | Company | Company | Symmetric |
operatesIn | Company | Market | Multi-valued |
hasExecutive | Company | Executive | |
hasInvestor | Company | InvestmentFirm | Weighted (ownership %) |
partnersWith | Company | Company | Undirected |
acquiredBy | Company | Company | |
hasRegulatoryFlag | Company | RegulatoryRisk | |
hasFinancialSignal | Company | FinancialMetric | Timestamped |
Competency questions driving the schema:
- “Which companies in the SaaS market have declining revenue but strong IP portfolios?”
- “Which private companies share key executives or investors with known acquisition targets?”
- “Which companies have recently lost market share to our client’s key competitor?”
- “Are there regulatory flags that would block an acquisition in this jurisdiction?”
Human Analyst Involvement
Human involvement is not optional — it is a core architectural component, not an afterthought.70
- Ontology governance board: Domain experts (M&A analysts, legal, finance) own the ontology schema, approve new classes and properties, and resolve ambiguities.
- Active learning review queue: Low-confidence triples (score < threshold) surface to analysts via a curation UI; analysts approve, reject, or correct them — these decisions retrain the extraction models.
- Feedback-driven KG updates: When an analyst adds intelligence (“I know that Company X is actively looking to divest this division”), it is added as a high-provenance triple with the analyst as source.
- Red-teaming and bias review: Periodic review to check whether the KG over-represents certain geographies, sectors, or data sources, which would skew agent recommendations.
Layer ④ — Knowledge Graph Store (Dual + Vector)
A hybrid tri-store architecture balances semantic rigor with operational speed and unstructured retrieval.7164
RDF / Apache Jena (semantic layer)
- Stores: ontology definitions (TBox) + long-term verified facts (ABox).
- Enables RDFS/OWL reasoning: e.g., infer
isSectorPeer(A,B)from sharedoperatesInandhasCompetitoraxioms. - SPARQL endpoint for semantic queries: “Find all companies with revenue < $50M in HealthTech that have not been acquired.”
LPG / Neo4j (operational layer)
- Mirrors key operational entities for fast multi-hop traversal.
- Stores signal events, temporal edges (
hasSignalwith timestamps, confidence scores). - Graph Data Science (GDS) library enables in-database PageRank, Louvain community detection, shortest paths.
- Cypher query example for signal detection:
MATCH (c:Company)-[:HAS_SIGNAL]->(s:AcquisitionSignal)
WHERE s.type = "ExecutiveDeparture" AND s.date > date() - duration("P90D")
WITH c, count(s) AS signals
WHERE signals >= 2
RETURN c.name, signals ORDER BY signals DESC;
Vector Store (RAG layer)
- Embeds document chunks using a sentence transformer; indexed in Weaviate or pgvector.
- Used by agents when structured KG traversal does not find an answer — semantic fallback.
- Bridges the gap between structured and unstructured knowledge.64
Layer ⑤ — KG Enhancement Engine
This is the analytical “brain” that continuously improves the graph’s quality and density.727374
Graph algorithms (Neo4j GDS / NetworkX)
| Algorithm | Purpose in M&A context |
|---|---|
| PageRank / HITS | Rank companies by influence in their sector network |
| Betweenness centrality | Find “bridge” companies connecting two industry clusters — prime acquisition targets for market access |
| Louvain community detection | Identify industry clusters and cross-cluster outliers |
| Shortest path (Dijkstra) | Discover hidden connection chains between two companies |
| Jaccard / cosine similarity | Find companies structurally similar to known past acquisition targets |
| Weakly connected components | Detect isolated subgraphs — companies with thin coverage may need more data collection |
Inductive Logic Programming (ILP)
ILP mines the KG for human-interpretable rules that characterize acquisition patterns.757677
Example rules an ILP system (e.g., AMIE+, RLogic) might learn:
% Companies that share investors with an acquirer tend to be acquired
acquisitionTarget(X) ← hasInvestor(X, I) ∧ hasInvestor(Acquirer, I)
% Companies whose competitors are being acquired are themselves at risk
acquisitionTarget(X) ← hasCompetitor(X, Y) ∧ isBeingAcquired(Y)
% Market concentration rule
acquisitionTarget(X) ← operatesIn(X, M) ∧ decliningMarketShare(X, M)
∧ hasStrategicAsset(X, A)
These rules:
- Explain predictions to analysts in plain language.
- Constrain GNN training with logical regularizers.
- Auto-extend the KG by inferring new candidate edges.
Graph Neural Networks (GNNs)
A GraphSAGE or INDIGO model is trained on the KG to predict missing or future links.6578
- Training signal: Historical M&A deals (company pairs that merged) = positive edges; random non-merging pairs = negative edges.
- Node features: Revenue, growth rate, headcount, patent count, sentiment score, sector embedding.
- Edge features: Confidence, source count, recency, edge type.
- The GNN predicts
acquisitionLikelihood(CompanyA, CompanyB)scores as a link prediction task. - Research using GraphSAGE on M&A data achieved ~81.8% accuracy in identifying acquisition targets.65
ILP + GNN synergy:
- ILP rules provide logical constraints and labeling functions to weakly supervise GNN training on unlabeled pairs.79
- The GNN handles noisy, implicit signals (social media, news sentiment); ILP handles structural, rule-based patterns (shared investors, board overlaps).
- An LLM-as-Judge validator receives candidate triples from both and assigns a final confidence score before committing to the KG.66
Layer ⑥ — Agentic Reasoning Layer
Five specialized agents operate over the shared KG, coordinated by an orchestrator.808164
Orchestrator Agent
- Monitors trigger conditions (new document ingested, analyst request, scheduled scan).
- Decomposes goals into subtasks: “Evaluate TargetCo” → assign to Scorer + Due Diligence + Report agents in sequence.
- Reads and writes agent coordination state to Neo4j (task nodes, handoff edges).
Signal Detection Agent
- Runs continuously on the ingestion stream.
- Queries Neo4j for newly extracted
AcquisitionSignalnodes meeting threshold criteria:- Leadership change signals (executive departures, new M&A-experienced board members).
- Financial distress signals (negative sentiment on earnings, debt downgrades).
- Strategic pivot signals (“exploring strategic alternatives” in SEC language).
- Cross-references signals against ILP rules to elevate high-confidence alerts.
- Pushes confirmed alerts into the KG as
HighAlertTargetnodes for the Scorer to pick up.
Target Scoring Agent
- Produces a composite acquisition score for each candidate using:
- Financial features (from SEC/financial KG nodes): revenue growth, EBITDA margins, leverage ratios.
- Graph topology features: degree centrality, community membership, bridge-node status.
- GNN link prediction score (probability of acquisition given graph neighborhood).
- Sentiment trajectory: rolling sentiment score from news/social data over 90 days.68
- Final score is a weighted combination, with weights tunable by analysts.
Due Diligence Agent
- Activated when a target is elevated above a threshold score.
- Executes multi-hop KG traversal to build a structured evidence dossier:
- Uses SPARQL (RDF layer) for formal reasoning over ontology-constrained queries.
- Uses Cypher (LPG layer) for fast traversal of relationship chains.
- Falls back to vector RAG when KG gaps exist.
Report Generation Agent
- Compiles a grounded, explainable M&A brief:
Layer ⑦ — Analyst Interface and Feedback Loop
The interface is not just an output screen — it is a knowledge refinement engine.7081
- Target Dashboard: Ranked list of acquisition candidates with composite score, signal count, sector, and graph centrality rank.
- Explainability View: Per-target evidence trail showing which KG paths, ILP rules, and GNN predictions contributed to the score. Analysts can inspect source documents for any supporting triple.
- Feedback Loop (critical):
- Analyst marks a target as “not viable” → that judgment is stored as a negative training example.
- Analyst corrects a bad triple → triggers ontology or extraction model update.
- Analyst adds domain knowledge (“This company is quietly shopping itself”) → high-provenance triple injected directly into KG.
- Aggregate feedback retrains: spaCy NER/RE models, GNN link prediction weights, ILP rule thresholds, and LLM extraction prompts — closing the learning loop.
Key Design Principles
- Provenance first: Every triple carries source document, extraction method, timestamp, and confidence — essential for analyst trust in a high-stakes domain.6766
- Human-in-the-loop at every layer: Ontology governance, curation queues, feedback retraining — not just at output.8170
- Defense in depth for quality: spaCy precision → LLM coverage → ILP rule validation → GNN completion → LLM-as-Judge. Multiple layers prevent hallucinated facts from reaching agents.66
- Explainability as a requirement: In M&A advisory, every recommendation must be defensible; ILP rules and KG evidence trails make the system auditable.7581
- Modular and composable: Each layer can be upgraded independently — swap GraphSAGE for a newer GNN, add a new data source, retrain spaCy without rebuilding the entire system.8280
Implementation steps for building the knowledge graph from SEC 10-K filings
A real-world implementation of a 10-K-to-knowledge-graph pipeline requires six concrete phases: EDGAR access and document parsing, section decomposition, NLP/LLM triple extraction, entity linking and ontology alignment, KG loading, and validation. Recent open-source work on all 101 S&P 100 companies produced nearly 600,000 triples from 2024 10-K filings using 24 entity types and 27 relation types — a useful benchmark for what the pipeline should produce.899091
Step 1 — Acquire filings from EDGAR
EDGAR provides machine-readable access to all SEC filings via its full-text search and bulk download APIs.
import requests, json
# SEC EDGAR full-text search API
def get_10k_filings(cik: str, count: int = 5):
url = f"https://data.sec.gov/submissions/CIK{cik.zfill(10)}.json"
r = requests.get(url, headers={"User-Agent": "yourname@email.com"})
data = r.json()
filings = data["filings"]["recent"]
results = []
for i, form in enumerate(filings["form"]):
if form == "10-K":
results.append({
"accession": filings["accessionNumber"][i],
"date": filings["filingDate"][i],
"doc": filings["primaryDocument"][i],
})
if len(results) >= count:
break
return results
# Example: Apple CIK = 0000320193
filings = get_10k_filings("0000320193")
Key points:92
- Use the
User-Agentheader with a real email — EDGAR rate-limits anonymous requests. - Filings are in HTML (iXBRL) format post-2020; older ones are ASCII/SGML.
- The EDGAR bulk download
https://efts.sec.gov/LATEST/search-index?q=...lets you pull by form type, date range, or SIC code to target specific sectors.
Step 2 — Parse and section-split documents
10-K filings have a standardized but messy structure. Robust parsing requires handling iXBRL tags, inline XBRL, and scanned PDFs for older filings.9391
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
def parse_10k(file_path: str) -> dict:
result = converter.convert(file_path)
doc = result.document
# Docling preserves tables, headings, and text blocks separately
return {
"text_blocks": [block.text for block in doc.body],
"tables": doc.tables,
"metadata": doc.metadata,
}
Section identification
10-K filings follow a standardized Item structure (Items 1–15). Each section has distinct M&A relevance:9489
| Item | Section Name | M&A Relevance |
|---|---|---|
| Item 1 | Business | Core business, products, competitors |
| Item 1A | Risk Factors | Regulatory, operational, market risks |
| Item 1B | Unresolved Staff Comments | Potential legal flags |
| Item 2 | Properties | Physical assets, real estate |
| Item 7 | MD&A | Management narrative, forward guidance |
| Item 7A | Quantitative Risk Disclosures | Interest rate, FX, commodity exposure |
| Item 8 | Financial Statements | Revenue, EBITDA, debt structures |
| Item 13 | Related Transactions | Executive relationships, conflicts |
| Item 14 | Principal Accountant Fees | Audit firm relationships |
import re
SECTION_PATTERNS = {
"business": r"ITEM\s+1[.\s]+BUSINESS",
"risk_factors": r"ITEM\s+1A[.\s]+RISK\s+FACTORS",
"mda": r"ITEM\s+7[.\s]+MANAGEMENT",
"financials": r"ITEM\s+8[.\s]+FINANCIAL\s+STATEMENTS",
"transactions": r"ITEM\s+13[.\s]+CERTAIN\s+RELATIONSHIPS",
}
def split_sections(text: str) -> dict:
sections = {}
anchors = {}
for name, pattern in SECTION_PATTERNS.items():
match = re.search(pattern, text, re.IGNORECASE)
if match:
anchors[name] = match.start()
sorted_keys = sorted(anchors, key=lambda k: anchors[k])
for i, key in enumerate(sorted_keys):
start = anchors[key]
end = anchors[sorted_keys[i+1]] if i+1 < len(sorted_keys) else len(text)
sections[key] = text[start:end]
return sections
Step 3 — spaCy NER on each section
Train or fine-tune a domain-specific spaCy NER model for 10-K text. Existing custom models (e.g., the Jodie NER-10K model) use transfer learning on 5,000+ annotated SEC examples and significantly outperform generic en_core_web_lg on financial text.9596
Custom entity labels for M&A context:
# Labels to add beyond spaCy defaults
M_AND_A_LABELS = [
"COMPANY", # organization names
"EXECUTIVE", # named officers/directors
"COMPETITOR", # named competitors (Item 1)
"PRODUCT_LINE", # product/service families
"MARKET_SEGMENT", # target customer verticals
"FINANCIAL_METRIC",# revenue, EBITDA, etc.
"REGULATOR", # SEC, FTC, DOJ, CFIUS
"LEGAL_RISK", # litigation, regulatory events
"GEOGRAPHY", # countries/regions of operation
"INVESTOR", # named shareholders > 5%
]
import spacy
nlp = spacy.load("en_core_web_lg") # or fine-tuned Jodie SEC model
def extract_entities(section_text: str, section_name: str) -> list:
doc = nlp(section_text)
entities = []
for ent in doc.ents:
entities.append({
"text": ent.text,
"label": ent.label_,
"start": ent.start_char,
"end": ent.end_char,
"section": section_name,
})
return entities
Dependency-based relation extraction
For each sentence containing two or more entities, extract verb-anchored relations via dependency parse:9789
def extract_relations_from_sentence(sent, entities_in_sent: list) -> list:
triples = []
for token in sent:
if token.pos_ == "VERB":
subjects = [c for c in token.children if c.dep_ in ("nsubj","nsubjpass")]
objects = [c for c in token.children if c.dep_ in ("dobj","pobj","attr")]
for subj in subjects:
for obj in objects:
# Only keep if both ends are known entities
subj_ent = match_entity(subj.text, entities_in_sent)
obj_ent = match_entity(obj.text, entities_in_sent)
if subj_ent and obj_ent:
triples.append({
"subject": subj_ent,
"predicate": token.lemma_,
"object": obj_ent,
"sentence": sent.text,
})
return triples
Step 4 — LLM triple extraction (complementary pass)
Run an LLM extraction pass on the same sections, constrained to the entities already found by spaCy. This dramatically reduces hallucinated nodes while getting semantic coverage spaCy misses.9891
import json
EXTRACTION_PROMPT = """
You are extracting knowledge graph triples from a SEC 10-K filing.
Known entities in this text: {entities}
Extract subject-predicate-object triples relevant to M&A analysis.
Focus on: competitors, partnerships, risks, financial signals,
executive roles, regulatory exposure, acquisitions.
Return JSON: {{"triples": [{{"s": "...", "p": "...", "o": "...",
"confidence": 0.0-1.0,
"evidence": "exact quote"}}]}}
Text:
{text}
"""
def llm_extract_triples(text: str, entities: list, llm_client) -> list:
entity_names = [e["text"] for e in entities]
prompt = EXTRACTION_PROMPT.format(
entities=json.dumps(entity_names),
text=text[:3000] # chunk to fit context window
)
response = llm_client.chat(prompt)
result = json.loads(response)
return result.get("triples", [])
Chunking strategy for long sections
Item 7 (MD&A) and Item 8 (financials) can exceed 50,000 tokens. Use a sliding window with overlap to avoid cutting mid-sentence context:9991
def chunk_section(text: str, chunk_size: int = 2000,
overlap: int = 200) -> list:
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append({
"text": text[start:end],
"start": start,
"end": min(end, len(text))
})
start += chunk_size - overlap
return chunks
Step 5 — Financial table extraction
Structured financial data (revenue, debt, EBITDA) lives in tables, not prose. Docling and pdfplumber can extract tables as DataFrames; these are converted to triples directly.9193
import pdfplumber
def extract_financial_tables(pdf_path: str) -> list:
triples = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
for table in page.extract_tables():
# Assume first row = headers, first col = metric name
if not table or len(table[^7_0]) < 2:
continue
headers = table[^7_0] # e.g. ["Metric", "2023", "2022"]
for row in table[1:]:
metric = row[^7_0]
for i, year_col in enumerate(headers[1:], 1):
if row[i]:
triples.append({
"s": company_name,
"p": f"hasFinancialMetric_{metric}",
"o": row[i],
"year": year_col,
"source": "financial_statements"
})
return triples
Step 6 — Entity linking and ontology alignment
Raw extracted entity strings must be resolved to canonical ontology URIs; otherwise every variant (“Apple Inc.”, “Apple”, “AAPL”) becomes a different node.10093
# Step 1: Normalize surface form
def normalize_entity(text: str) -> str:
return text.strip().lower()\
.replace(",", "").replace("inc.", "").replace("corp.", "")
# Step 2: Fuzzy match against known entity registry (SEC CIK database)
from rapidfuzz import process
def link_to_registry(surface: str, registry: dict) -> str:
"""registry: {normalized_name: CIK_URI}"""
match, score, _ = process.extractOne(
normalize_entity(surface), registry.keys()
)
if score >= 85:
return registry[match] # return canonical URI
return None # flag for human review
# Step 3: Map raw predicates to ontology predicates
PREDICATE_MAP = {
"compete": "ex:hasCompetitor",
"acquire": "ex:acquiredBy",
"partner": "ex:partnersWith",
"invest": "ex:hasInvestor",
"sue": "ex:hasLegalRisk",
"operate": "ex:operatesIn",
"disclose": "ex:hasRegulatoryFlag",
}
Step 7 — Triple validation pipeline
Before loading any triple into the KG, it passes through a three-stage validation gate:90101
Triple candidate
│
▼
[Stage 1] Span validation
– subject AND object appear in source text (exact or fuzzy match)
– reject if either invented by LLM
│
▼
[Stage 2] Ontology constraint check
– predicate exists in M&A ontology
– domain/range types match (e.g., hasInvestor: Company → InvestmentFirm)
– reject or flag mismatches
│
▼
[Stage 3] Confidence threshold
– confidence ≥ 0.75 → auto-commit with provenance tag
– 0.50 ≤ confidence < 0.75 → queue for human review
– confidence < 0.50 → discard
│
▼
KG-ready triple with provenance
Each committed triple carries a provenance metadata node:
ex:triple_001 a rdf:Statement ;
rdf:subject ex:Apple ;
rdf:predicate ex:hasCompetitor ;
rdf:object ex:Microsoft ;
prov:wasDerivedFrom ex:Apple_10K_2024 ;
ex:sourceSentence "Apple faces competition from Microsoft..." ;
ex:extractionMethod "LLM-GPT4o" ;
ex:confidence 0.91 ;
ex:dateExtracted "2024-11-15"^^xsd:date .
Step 8 — Load into KG stores
Neo4j (LPG) loading via py2neo
from neo4j import GraphDatabase
driver = GraphDatabase.driver("bolt://localhost:7687",
auth=("neo4j", "password"))
def load_triple_neo4j(triple: dict):
with driver.session() as session:
session.run("""
MERGE (s:Company {name: $subject})
MERGE (o:Entity {name: $object, type: $obj_type})
MERGE (s)-[r:RELATION {type: $predicate,
confidence: $confidence,
source: $source,
date: $date}]->(o)
""", **triple)
Apache Jena (RDF) loading via rdflib
from rdflib import Graph, URIRef, Literal, Namespace
from rdflib.namespace import RDF, RDFS, XSD
EX = Namespace("http://ma-kg.org/entity/")
PROP = Namespace("http://ma-kg.org/property/")
g = Graph()
def load_triple_rdf(triple: dict):
s = EX[triple["subject_uri"]]
p = PROP[triple["predicate_uri"]]
o = EX[triple["object_uri"]] if triple["obj_is_entity"] \
else Literal(triple["object"], datatype=XSD.string)
g.add((s, p, o))
# Serialize to Turtle for Jena TDB load
g.serialize("ma_kg.ttl", format="turtle")
# Load via Jena CLI: tdb2.tdbloader --loc=/path/to/tdb ma_kg.ttl
Step 9 — Iterative quality cycle
After the initial bulk load (all S&P 500 10-Ks, for example), run the quality cycle:9091
- Coverage audit: Count triples per company, per section, per predicate. Flag companies with fewer than expected triples as candidates for re-extraction or human annotation.
- Graph connectivity audit: Find orphaned nodes (degree = 0); trace back to parsing failures or entity linking misses.
- Consistency check via SPARQL (Jena):
# Find companies with conflicting revenue signals
SELECT ?company ?metric1 ?metric2
WHERE {
?company ex:hasRevenue ?metric1 .
?company ex:hasRevenue ?metric2 .
FILTER(?metric1 != ?metric2 && ?metric1 > ?metric2 * 1.5)
}
- Active learning queue: Low-confidence triples surface to analysts, who label them; labels retrain the spaCy RE model and LLM extraction prompts.9693
- Incremental updates: On each new 10-K filing (quarterly earnings trigger 10-Q), re-run the pipeline for that company only, diff new triples against existing ones, and apply a temporal update (add timestamped version, do not overwrite).9890
Tools summary
| Purpose | Recommended Tool |
|---|---|
| EDGAR filing access | sec-edgar-downloader Python package, EDGAR REST API |
| Document parsing | Docling (PDF/iXBRL), pdfplumber (tables), BeautifulSoup (HTML) |
| NER | spaCy + en_core_web_lg fine-tuned on SEC corpus (Jodie NER-10K) |
| Relation extraction | spaCy dependency parser + custom RE model |
| LLM extraction | GPT-4o / LLaMA 3 70B with structured JSON output |
| Entity linking | RapidFuzz + SEC CIK database, OpenRefine |
| Triple validation | Rule-based span check + OWL reasoner |
| RDF store | Apache Jena TDB2 + Fuseki SPARQL endpoint |
| LPG store | Neo4j + GDS library |
| Orchestration | Prefect or Apache Airflow for pipeline scheduling |
Code example for NER extraction from 10-K using Jodie spaCy model
Here is a complete, runnable code example covering the full pipeline from EDGAR download through Jodie NER extraction to a structured entity output. Since the Jodie model is an older spaCy v2 package, the example shows how to use it directly and how to replicate its approach with a modern spaCy v3 fine-tuned model when needed.
Installation
# Core dependencies
pip install sec-edgar-downloader beautifulsoup4 spacy
# Download spaCy base model (used standalone or as fallback)
python -m spacy download en_core_web_lg
# Install Jodie NER-10K model (spaCy v2 package from GitHub)
# Clone the repo and install the .tar.gz
git clone https://github.com/jodietheai/NER-10K.git
cd NER-10K
pip install en_Jodie-0.0.0.tar.gz
# For spaCy v3+ environments, install compatibility shim:
pip install spacy-legacy
Note: The Jodie model was built on spaCy v2. For spaCy v3+ environments (recommended for production), see the fine-tuning section at the bottom of this page.105
Step 1 — Download a 10-K from EDGAR
from sec_edgar_downloader import Downloader
import os
def download_10k(ticker: str, company_name: str,
email: str, save_dir: str = "./filings",
limit: int = 1) -> list:
"""
Download the most recent 10-K filings for a ticker.
Returns list of file paths to downloaded documents.
"""
dl = Downloader(company_name, email, save_dir)
dl.get("10-K", ticker, limit=limit, download_details=True)
# Collect downloaded HTML/HTM files
paths = []
for root, dirs, files in os.walk(save_dir):
for f in files:
if f.endswith((".htm", ".html", ".txt")):
paths.append(os.path.join(root, f))
return paths
# Usage
filing_paths = download_10k(
ticker="AAPL",
company_name="MyResearchFirm",
email="analyst@myresearchfirm.com"
)
print(f"Downloaded {len(filing_paths)} files")
Step 2 — Parse HTML and extract clean text
10-K filings are iXBRL/HTML; BeautifulSoup strips tags and inline XBRL annotations.106
from bs4 import BeautifulSoup
import re
def parse_10k_html(file_path: str) -> str:
"""
Extract clean text from 10-K HTML/iXBRL filing.
Preserves paragraph boundaries; strips boilerplate tags.
"""
with open(file_path, "r", encoding="utf-8", errors="replace") as f:
raw = f.read()
soup = BeautifulSoup(raw, "html.parser")
# Remove non-content tags
for tag in soup(["script", "style", "ix:header",
"ix:nonfraction", "ix:nonnumeric"]):
tag.decompose()
# Get text with paragraph spacing preserved
text = soup.get_text(separator="\n")
# Clean excessive whitespace
text = re.sub(r"\n{3,}", "\n\n", text)
text = re.sub(r"[ \t]{2,}", " ", text)
text = text.strip()
return text
def split_into_sections(text: str) -> dict:
"""
Split 10-K text into standard Item sections.
Returns dict keyed by section name.
"""
SECTION_PATTERNS = {
"business": r"(?i)(item\s+1\.?\s{0,5}business\b)",
"risk_factors": r"(?i)(item\s+1a\.?\s{0,5}risk\s+factors\b)",
"legal": r"(?i)(item\s+3\.?\s{0,5}legal\s+proceedings\b)",
"mda": r"(?i)(item\s+7\.?\s{0,5}management.{0,30}discussion\b)",
"financials": r"(?i)(item\s+8\.?\s{0,5}financial\s+statements\b)",
"executives": r"(?i)(item\s+10\.?\s{0,5}directors\b)",
"transactions": r"(?i)(item\s+13\.?\s{0,5}certain\s+relationships\b)",
}
anchors = {}
for name, pattern in SECTION_PATTERNS.items():
match = re.search(pattern, text)
if match:
anchors[name] = match.start()
sorted_keys = sorted(anchors, key=lambda k: anchors[k])
sections = {}
for i, key in enumerate(sorted_keys):
start = anchors[key]
end = (anchors[sorted_keys[i + 1]]
if i + 1 < len(sorted_keys) else len(text))
sections[key] = text[start:end]
return sections
Step 3 — Load the Jodie NER model
import spacy
def load_ner_model(use_jodie: bool = True):
"""
Load Jodie SEC-specific NER (spaCy v2) or fallback to en_core_web_lg.
For spaCy v3+, use a fine-tuned model instead (see Step 6).
"""
if use_jodie:
try:
# spaCy v2 style load by package name after pip install
nlp = spacy.load("en_Jodie")
print("Loaded Jodie SEC NER model")
except OSError:
print("Jodie model not found, falling back to en_core_web_lg")
nlp = spacy.load("en_core_web_lg")
else:
nlp = spacy.load("en_core_web_lg")
return nlp
nlp = load_ner_model(use_jodie=True)
Step 4 — Run NER extraction with provenance
from dataclasses import dataclass, field, asdict
from typing import Optional
@dataclass
class ExtractedEntity:
text: str
label: str
start_char: int
end_char: int
sentence: str
section: str
filing: str
confidence: Optional[float] = None
def extract_entities_from_section(
section_text: str,
section_name: str,
filing_id: str,
nlp,
max_chunk: int = 100_000 # spaCy default max_length guard
) -> list[ExtractedEntity]:
"""
Run NER over a 10-K section.
Chunks long sections to stay within spaCy's max_length.
Returns a list of ExtractedEntity objects with full provenance.
"""
# spaCy has a default max_length; chunk if needed
chunks = [section_text[i:i + max_chunk]
for i in range(0, len(section_text), max_chunk)]
entities = []
for chunk in chunks:
doc = nlp(chunk)
for sent in doc.sents:
sent_ents = [e for e in doc.ents
if e.start >= sent.start and e.end <= sent.end]
for ent in sent_ents:
entities.append(ExtractedEntity(
text = ent.text.strip(),
label = ent.label_,
start_char = ent.start_char,
end_char = ent.end_char,
sentence = sent.text.strip(),
section = section_name,
filing = filing_id,
))
return entities
Step 5 — Post-process: deduplicate and normalize
from collections import defaultdict
import re
# M&A-relevant entity labels from Jodie + spaCy defaults
MA_RELEVANT_LABELS = {
"ORG", # organizations (Jodie + spaCy)
"PERSON", # executives, investors
"GPE", # geopolitical entities (countries, states)
"MONEY", # financial figures
"PERCENT", # growth rates, margins
"DATE", # fiscal year references
"PRODUCT", # product/service names
"LAW", # regulations, legal acts
"NORP", # nationalities, political groups
# Jodie-specific labels (if model is loaded):
"COMPETITOR", "EXECUTIVE", "MARKET",
"LEGAL_RISK", "FINANCIAL_METRIC",
}
def normalize_entity_text(text: str) -> str:
"""Lowercase, strip legal suffixes for company matching."""
text = text.strip()
# Remove common corporate suffixes for matching
text = re.sub(
r"\b(Inc\.?|Corp\.?|LLC\.?|Ltd\.?|Co\.?|Group|Holdings?)\b",
"", text, flags=re.IGNORECASE
)
return re.sub(r"\s+", " ", text).strip().lower()
def filter_and_deduplicate(
entities: list[ExtractedEntity],
min_length: int = 2
) -> list[dict]:
"""
Filter to M&A-relevant labels, remove duplicates.
Returns deduplicated entities with occurrence count.
"""
seen = defaultdict(lambda: {"count": 0, "sentences": [],
"label": "", "sections": set()})
for ent in entities:
if ent.label not in MA_RELEVANT_LABELS:
continue
if len(ent.text) < min_length:
continue
key = (normalize_entity_text(ent.text), ent.label)
seen[key]["count"] += 1
seen[key]["label"] = ent.label
seen[key]["sections"].add(ent.section)
if len(seen[key]["sentences"]) < 3: # keep up to 3 example sentences
seen[key]["sentences"].append(ent.sentence)
result = []
for (norm_text, label), data in seen.items():
result.append({
"normalized_text": norm_text,
"label": label,
"mention_count": data["count"],
"sections": list(data["sections"]),
"example_sentences": data["sentences"],
})
# Sort by frequency — most prominent entities first
return sorted(result, key=lambda x: x["mention_count"], reverse=True)
Step 6 — Full pipeline runner
import json
from pathlib import Path
def run_10k_ner_pipeline(
ticker: str,
company_name: str,
email: str,
output_dir: str = "./output",
use_jodie: bool = True
) -> dict:
"""
End-to-end: download 10-K → parse → split sections →
run NER → deduplicate → save JSON.
"""
Path(output_dir).mkdir(parents=True, exist_ok=True)
# 1. Download
print(f"Downloading 10-K for {ticker}...")
paths = download_10k(ticker, company_name, email)
if not paths:
raise FileNotFoundError("No filing files downloaded.")
filing_path = paths[^8_0]
filing_id = f"{ticker}_10K_{Path(filing_path).stem}"
# 2. Parse
print("Parsing HTML...")
raw_text = parse_10k_html(filing_path)
sections = split_into_sections(raw_text)
print(f" Found {len(sections)} sections: {list(sections.keys())}")
# 3. Load model
nlp = load_ner_model(use_jodie=use_jodie)
# 4. Extract entities per section
all_entities = []
for section_name, section_text in sections.items():
print(f" Extracting entities from: {section_name} "
f"({len(section_text):,} chars)...")
ents = extract_entities_from_section(
section_text, section_name, filing_id, nlp
)
all_entities.extend(ents)
print(f" → {len(ents)} raw entities found")
# 5. Deduplicate and normalize
deduped = filter_and_deduplicate(all_entities)
print(f"\nTotal unique entities (M&A-relevant): {len(deduped)}")
# 6. Save
output = {
"filing_id": filing_id,
"ticker": ticker,
"sections_processed": list(sections.keys()),
"total_raw_entities": len(all_entities),
"unique_entities": len(deduped),
"entities": deduped,
}
out_path = f"{output_dir}/{filing_id}_entities.json"
with open(out_path, "w") as f:
json.dump(output, f, indent=2)
print(f"Saved to: {out_path}")
return output
# Run it
results = run_10k_ner_pipeline(
ticker="AAPL",
company_name="MyResearchFirm",
email="analyst@myresearchfirm.com",
use_jodie=True
)
# Preview top entities
for ent in results["entities"][:10]:
print(f" [{ent['label']:18s}] {ent['normalized_text']:35s} "
f"(mentioned {ent['mention_count']}x "
f"in: {', '.join(ent['sections'])})")
Example output:
[ORG ] apple inc. (mentioned 87x in: business, mda, financials)
[ORG ] google (mentioned 14x in: business, risk_factors)
[PERSON ] tim cook (mentioned 9x in: executives, mda)
[MONEY ] $383.3 billion (mentioned 7x in: financials, mda)
[ORG ] microsoft (mentioned 6x in: business, risk_factors)
[GPE ] united states (mentioned 31x in: business, risk_factors)
[LAW ] digital markets act (mentioned 5x in: risk_factors, legal)
[PERCENT ] 3.4% (mentioned 4x in: financials)
Step 7 — Fine-tuning for spaCy v3 (modern alternative to Jodie)
For production use on spaCy v3+, fine-tune en_core_web_lg on annotated 10-K examples using Prodigy or Label Studio.107105
# Create a spaCy training config
python -m spacy init config config.cfg --lang en --pipeline ner
# Edit config.cfg to set:
# [paths] train = ./train.spacy, dev = ./dev.spacy
# [components.ner.model] tok2vec source = en_core_web_lg
# Convert annotated data (from Prodigy/Label Studio JSONL export)
python -m spacy convert annotations.jsonl ./data --converter ner
# Fine-tune
python -m spacy train config.cfg --output ./sec_ner_model \
--paths.train ./data/train.spacy \
--paths.dev ./data/dev.spacy \
--initialize.vectors en_core_web_lg
# Load fine-tuned model
nlp = spacy.load("./sec_ner_model/model-best")
The fine-tuned v3 model supports the full spaCy component pipeline (sentencizer, dependency parser, coref) which integrates seamlessly with the relation extraction step in the broader KG pipeline.108107
https://arxiv.org/html/2502.09956v1 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://memgraph.com/blog/extract-entities-build-knowledge-graph-memgraph-spacy ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://support.prodi.gy/t/text-to-knowledge-graph-prodigy-spacy/4521 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://fairplus.github.io/the-fair-cookbook/content/recipes/interoperability/nlp2kg/creating-knowledge-graph-from-text.html ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://stackoverflow.com/questions/57291975/how-can-i-provide-a-relation-extraction-data-set-including-tuple-for-casual-infe ↩︎ ↩︎ ↩︎ ↩︎
https://labs.lamatic.ai/p/llm-vs-nlp/ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://corp.yonyx.com/customer-service/nlp-vs-llm/ ↩︎ ↩︎ ↩︎
https://github.com/robert-mcdermott/ai-knowledge-graph ↩︎ ↩︎ ↩︎
https://www.lettria.com/lettria-lab/introduction-to-knowledge-graph-completion-with-llms ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://www.linkedin.com/pulse/building-knowledge-graphs-using-llms-yugank-aman-v7jaf ↩︎ ↩︎ ↩︎
https://aclanthology.org/2025.genaik-1.10.pdf ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://www.reddit.com/r/MachineLearning/comments/1opohcg/p_generating_knowledge_graphs_from_unstructured/ ↩︎
https://milvus.io/ai-quick-reference/what-is-a-knowledge-graph-ontology ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://www.puppygraph.com/blog/knowledge-graph-vs-ontology ↩︎ ↩︎ ↩︎
https://www.falkordb.com/blog/understanding-ontologies-knowledge-graph-schemas/ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://enterprise-knowledge.com/wp-content/uploads/2020/01/Ontologies-vs.-Knowledge-Graphs.pdf ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://www.ontoforce.com/knowledge-graph/ontology ↩︎ ↩︎ ↩︎ ↩︎
https://pmc.ncbi.nlm.nih.gov/articles/PMC12649945/ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://www.nist.gov/document/nist-ai-rfi-cubrcinc002pdf ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://arxiv.org/html/2503.05388v1 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://www.scitepress.org/Papers/2013/45179/45179.pdf ↩︎ ↩︎ ↩︎ ↩︎
https://oa.upm.es/6115/1/CAEPIA09_-_Common_Pitfalls_in_Ontology_Development_-_final_version_fixed.pdf ↩︎ ↩︎ ↩︎ ↩︎
https://www.webology.org/2018/v15n2/a173.pdf ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://www.dremio.com/wiki/apache-jena/ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://www.tigergraph.com/blog/rdf-vs-property-graph-choosing-the-right-foundation-for-knowledge-graphs/ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://www.puppygraph.com/blog/property-graph-vs-rdf ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://christinemdraper.wordpress.com/2017/04/09/getting-started-with-rdf-sparql-jena-fuseki/ ↩︎ ↩︎
https://taylorandfrancis.com/knowledge/Engineering_and_technology/Computer_science/Apache_Jena/ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://homepages.inf.ed.ac.uk/libkin/teach/beijing2018/neo4j-beijing.pdf ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://docs.nebula-graph.io/3.3.0/1.introduction/0-1-graph-database/ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://enterprise-knowledge.com/wp-content/uploads/2020/01/Ontologies-vs.-Knowledge-Graphs.pdf ↩︎ ↩︎ ↩︎ ↩︎
https://milvus.io/ai-quick-reference/what-is-a-knowledge-graph-ontology ↩︎
https://www.wisecube.ai/blog/knowledge-graphs-rdf-or-property-graphs-which-one-should-you-pick/ ↩︎ ↩︎ ↩︎
https://docs.oracle.com/en/database/oracle/oracle-database/26/rdfrm/rdf-graph-support-apache-jena.html ↩︎
https://elixirforum.com/t/storing-system-information-in-a-graph-database/54159 ↩︎
https://dariastepanova.github.io/files/conferences/RW2018/paper/RW2018paper.pdf ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://community.sap.com/t5/technology-blog-posts-by-sap/knowledge-graphs-with-inductive-logic-programming/ba-p/13517645 ↩︎ ↩︎ ↩︎
https://www.sciencedirect.com/science/article/abs/pii/S0004370224000444 ↩︎ ↩︎
https://web.cs.ucla.edu/~yzsun/papers/2022_KDD_RLogic.pdf ↩︎
https://graph4ai.github.io/graph4nlp/tutorial/knowledge_graph_completion.html ↩︎ ↩︎ ↩︎ ↩︎
https://proceedings.neurips.cc/paper_files/paper/2021/hash/0fd600c953cde8121262e322ef09f70e-Abstract.html ↩︎ ↩︎ ↩︎
https://www.reddit.com/r/MachineLearning/comments/1eg674y/discussion_thoughts_on_knowledge_graphs_and_graph/ ↩︎
https://zbrain.ai/knowledge-graphs-for-agentic-ai/ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://arxiv.org/html/2508.17906v2 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://damir.cavar.me/Pubs/Mapping_SEC_Deep_NLP_Knowledge_Graph.pdf ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://developers.lseg.com/en/article-catalog/article/predicting-MnA-targets-using-ML-Unlocking-the-potential-of-NLP-variables ↩︎ ↩︎
https://www.falkordb.com/blog/understanding-ontologies-knowledge-graph-schemas/ ↩︎
https://www.nist.gov/document/nist-ai-rfi-cubrcinc002pdf ↩︎ ↩︎ ↩︎ ↩︎
https://www.artefact.com/blog/will-the-future-of-agentic-ai-rely-on-knowledge-graphs/ ↩︎
https://community.sap.com/t5/technology-blog-posts-by-sap/knowledge-graphs-with-inductive-logic-programming/ba-p/13517645 ↩︎ ↩︎
https://dariastepanova.github.io/files/conferences/RW2018/paper/RW2018paper.pdf ↩︎
https://web.cs.ucla.edu/~yzsun/papers/2022_KDD_RLogic.pdf ↩︎
https://proceedings.neurips.cc/paper_files/paper/2021/hash/0fd600c953cde8121262e322ef09f70e-Abstract.html ↩︎
https://theaiinnovator.com/how-graph-thinking-empowers-agentic-ai/ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
https://www.deloitte.com/us/en/services/consulting/blogs/business-operations-room/agentic-ai-in-manufacturing.html ↩︎
https://www.cio.com/article/4138732/the-transplantable-skeleton-why-agentic-ai-infrastructure-must-survive-corporate-surgery.html ↩︎
https://www.sciencedirect.com/science/article/pii/S027861252500216X ↩︎
https://www.linkedin.com/posts/amyhodler_i-wanted-to-share-the-recording-of-what-was-activity-7378891388731637761-I1KK ↩︎
https://www.techrxiv.org/users/854434/articles/1245167-intelligent-anti-money-laundering-transaction-pattern-recognition-system-based-on-graph-neural-networks ↩︎
https://towardsai.net/p/machine-learning/a-look-at-finreflectkg-ai-driven-knowledge-graph-in-finance ↩︎
https://damir.cavar.me/Pubs/Mapping_SEC_Deep_NLP_Knowledge_Graph.pdf ↩︎ ↩︎ ↩︎
https://aws.amazon.com/blogs/industries/agentic-graphrag-for-capital-markets/ ↩︎ ↩︎ ↩︎ ↩︎
https://stackoverflow.com/questions/59480001/extracting-text-section-from-edgar-10-k-filings-html ↩︎
https://journalwjarr.com/sites/default/files/fulltext_pdf/WJARR-2025-2517.pdf ↩︎ ↩︎ ↩︎ ↩︎
https://sagemaker-jumpstart-industry-pack.readthedocs.io/en/latest/notebooks/finance/notebook4/SEC_10K_10Q_8K_section_extraction.html ↩︎
https://codesignal.com/learn/courses/practical-applications-of-spacy-for-real-life-tasks/lessons/information-extraction-from-legal-documents-using-spacy ↩︎
https://intuitionlabs.ai/articles/llm-financial-document-analysis ↩︎ ↩︎
https://docs.snaplogic.com/agentcreator/agentcreator-use-cases/use-case-sec-filing.html ↩︎
https://huggingface.co/datasets/kritsadaK/EDGAR-CORPUS-Financial-Summarization/blob/main/README.md ↩︎
https://stackoverflow.com/questions/59480001/extracting-text-section-from-edgar-10-k-filings-html ↩︎
https://codesignal.com/learn/courses/practical-applications-of-spacy-for-real-life-tasks/lessons/information-extraction-from-legal-documents-using-spacy ↩︎ ↩︎
https://stackoverflow.com/questions/54855780/how-to-create-ner-pipeline-with-multiple-models-in-spacy ↩︎
https://stackoverflow.com/questions/74225258/downloading-all-10-k-filings-for-sec-edgar-in-python ↩︎
https://sec-api.io/docs/sec-filings-render-api/python-example ↩︎
https://github.com/osamadev/Named-Entity-Recognition-Using-Spacy/blob/master/NER_Spacy.ipynb ↩︎
https://www.reddit.com/r/LanguageTechnology/comments/1jv0aos/anyone_experienced_with_pushing_large_spacy_ner/ ↩︎