Knowledge Graphs
This session covers integrating knowledge graphs into agentic AI systems to enhance reasoning, memory, and factual grounding. It details the construction of graphs using both traditional NLP tools like spaCy for precision and Large Language Models for semantic flexibility. The session explains how formal ontologies serve as the structural blueprint for these systems, ensuring consistent vocabulary and logical constraints.
Furthermore, it compares RDF/SPARQL and LPG/Cypher database architectures, highlighting their respective strengths in semantic reasoning and high-performance graph traversals. Advanced techniques such as Inductive Logic Programming and Graph Neural Networks are introduced to discover latent patterns and verify connections within the data.
Finally, a practical merger and acquisition use case illustrates a multi-layered architecture where specialized agents utilize these technologies to identify and evaluate corporate targets.
| Speaker | Text |
|---|---|
| Alex | This is the brief on knowledge graphs for Eugenic AI. So, while you’re probably super comfortable with tabular data and vector stores for rag, modern AI agents actually need a persistent, structured, semantic memory to truly reason, and well, that’s exactly the gap knowledge graphs fill. First, let’s break down the core structure. You know how relational databases lock knowledge into these rigid rows and columns with really heavy joins? Well, a graph is totally different. It’s a mathematical structure made of nodes, like a person or a company, and edges, which represent the relationships between them. In a graph, relationships are literally first-class citizens, making it way easier to natively represent complex interconnected knowledge. Second, we’ve got to talk about the power of ontologies. This is what turns a basic graph into a true knowledge base. See, a traditional database schema just dictates data layout, right? But an ontology is a formal semantic blueprint that defines allowed classes, properties, and logical constraints. For agenic AI it acts as a shared world model, giving different agents a standardized vocabulary for symbolic reasoning. Though you definitely have to watch out for ontological bloat or overly complex rules. Finally, querying is fundamentally different, because graphs serve a totally different purpose than tables, you aren’t doing standard SQL joins or running vector similarity searches. Instead, you’re using pattern matching and path traversals with specialized languages like Sparkle for semantic reasoning, or cipher for multi-hop traversals. By combining the flexible processing of LLMs with the explicit semantic structure of knowledge graphs, you’re giving your agentic systems a truly durable and explainable foundation for real-world reasoning. |
| Speaker | Text |
|---|---|
| Alex | Welcome to today’s deep dive. Since you are, uh, deep into your data science graduate studies, you already know your way around machine learning, large language models, generative AI, and RA architectures. You know how to build the foundation of an agentic system, right? You’ve got the building blocks. Exactly. But today we are looking at the missing piece. We are talking about the critical layer that takes an AI agent from being just a, well, a really fluent text generator. To a system that is genuinely logical, grounded, and structurally reliable, |
| Sam | because let’s be honest, while LLMs are incredibly articulate, under the hood, they’re essentially probabilistic word predictors. |
| Alex | Yeah, they generate responses that can be structurally shallow or they just hallucinate when you push them into complex reasoning. And the fix for that structured semantically grounded knowledge |
| Sam | that is the core challenge right now in AI research. It’s about giving your system a foundation of verifiable facts, an actual world model, rather than, you know, relying purely on the statistical likelihood of what token comes next in a sequence. |
| Alex | And to figure out exactly how we do that, we have an incredibly dense, fascinating stack of source material today. We are looking at comprehensive lecture notes and architectural blueprints focused entirely on knowledge graphs for agentic systems. It’s a gold mine of a source. It really is. We are going to explore triple extraction, how to design robust ontologies, and we will anchor all of this by looking at a massive real world solution architecture for mergers and acquisitions. Specifically how an AI agent can analyze SEC 10K filings to find corporate acquisition targets. |
| Sam | The mergers and acquisitions use case is the perfect lens for this. I mean, it is a high stakes environment. You cannot afford an AI hallucinating a competitor or misinterpreting a revenue stream. The system has to be flawless, right? And getting to that level of precision is a journey from messy, unstructured text all the way to a highly sophisticated neurosymbolic reasoning engine. OK, |
| Alex | let’s unpack this. We know that to build a knowledge graph we need to extract information in the form of semantic triples, |
| Sam | right? A subject, a predicate, and an object like the classic example from the notes. Alan Turing is the node, worked it is the edge. Bletchley Park is the target node. Exactly. |
| Alex | But since you already know your way around text processing, you know that pulling these relationships out of something as dense as a 100 financial filing is notoriously difficult. Historically, we’ve relied on traditional. NLP pipelines for this. Why hasn’t that been enough? Well, |
| Sam | traditional NLP relies heavily on custom handcrafted rules. You might use a library like Spacey to build a custom named entity recognition model to identify companies and financial figures. Then you use dependency-based relation extraction to find the verb patterns linking those entities together. |
| Alex | And when it works, it is |
| Sam | fantastic. Oh, it’s incredibly predictable, fast, and cheap to run at scale. The precision and narrow domains is rock solid, |
| Alex | but the recall is terrible, right, because if a financial document uses convoluted phrasing, |
| Sam | which corporate lawyers love to do, right, |
| Alex | or it implies a relationship across three different sentences. Those rigid, handcrafted rules just miss it entirely. The system is blind to anything that doesn’t fit its exact template. That |
| Sam | is the fatal flaw, and it’s why so many developers immediately swing to the other extreme, which is LLM centric extraction. They just scrap the dependency rules completely and prompt an LLM. You feed it a passage from the 10K and ask it to output a JSON formatted list of triples, |
| Alex | and suddenly your semantic coverage goes through the roof. |
| Sam | Exactly. The LLM understands paraphrasing, it can bridge long-range dependencies across paragraphs, and it adapts to flexible schemas without you having to write a single regular expression. |
| Alex | I see why that semantic flexibility is a massive upgrade, but aren’t we just trading low recall for massive hallucination risks? And LLM might just invent a subsidiary that doesn’t exist just because it’s statistically usually appears in that context. How is that a reliable foundation for an agent? |
| Sam | Well, it isn’t if you do it blindly. What’s fascinating here is that the most robust agentic systems don’t force you to choose one or the other. The blueprints detail a hybrid sweet spot. So combining them, yes, you leverage traditional NLP like Spacey to do what it is undeniably best at. Identifying rigid sentence boundaries and performing robust strict entity extraction, you isolate the exact strings of text that represent companies, executive, or financial assets, |
| Alex | and then you hand those isolated pieces to the LLM. |
| Sam | Correct? You feed the LLM the sentence, but you also give it the strict entity bounds you just extracted. You prompt the LLM to extract the relationships between only those provided entities. |
| Alex | Ah, so you are essentially building heavy guardrails. By constraining the search space, the element cannot hallucinate new nodes in your graph, but you still harness its incredible pattern matching to understand the complex, messy relationships between the nodes that actually exist. |
| Sam | It gives the agentic system a highly reliable, semantically rich foundation of facts. It solves the extraction problem, but having a million perfectly extracted triples is completely useless if the AI agent doesn’t actually understand what those relationships mean in the real world. |
| Alex | How do we give that rulebook? That brings us to the core of the system. The ontology, what elevates a massive pile of nodes and edges into an actual reasoning engine. |
| Sam | An ontology is the formal semantic blueprint of your domain. A knowledge graph without an ontology is just a bunch of floating ambiguous data points. The ontology provides the shared meaning. |
| Alex | It establishes the classes, |
| Sam | yes, the types of entities that can exist like company or market. It establishes the properties which are the allowed relations, like has strategic overlap, and most importantly for AI agents, it establishes constraints. |
| Alex | The constraints are what fascinated me in the notes. Take the e-commerce domain as a simple example. If you have a property called place order, the ontology strictly dictates that the domain, the starting point of that relationship must be a customer, and the range, the end point, must be in order. |
| Sam | It physically cannot go from a product to a customer, |
| Alex | right? So if an LLM tries to generate an action or a query that violates that constraint, the graph instantly rejects it. |
| Sam | The graph acts as a logic gate. But designing that blueprint is an incredibly delicate process. You have to design the tee box or terminological box. Think of the tee box as the strict dictionary of your graph. It defines the rules of what a company is before you ever add a specific company like Microsoft into the system. You don’t just open a whiteboard and start drawing conceptual nodes. You have to start with competency |
| Alex | questions. Competency questions. These are the natural language problems your system absolutely must be able to resolve, |
| Sam | right? Yes, you work backward. In our M&A example, a competency question might be which companies in the sauce market have declining revenue and share an investor with their primary competitor. That specific question dictates exactly what classes, properties, and constraints your tee box needs to model. Nothing more, nothing less. But |
| Alex | the reality of enterprise data is messy. I imagine a lot of developers get this wrong. The sources mention several pitfalls identified by ontology evaluation tools like OAPS. |
| Sam | OPS is a fantastic diagnostic tool for this. It scans your tee box for structural errors. A very common pitfall it catches is a lack of annotations. That means classes are created but don’t have clear human readable labels or definitions attached to them. That makes the graph totally opaque to other developers or to an LLM trying to read the schema. Another massive issue is orphaned elements, |
| Alex | classes that a developer created for some edge case but never connected back to the main hierarchical tree. |
| Sam | Exactly. They just float there uselessly. |
| Alex | Ah yes, and my personal favorite pitfall, ontological bloat. This is when a developer tries to model the exact temperature of the coffee on the CEO’s desk just in case the M&A analysts might need to know it one day. |
| Sam | It overcomplicates the stigma and makes the entire reasoning system slow and fragile. |
| Alex | So the obvious question for someone studying generative AI, if building the tee box is this tedious, Can we just use LLMs to build the ontology for us? |
| Sam | You can use them to assist through a method called CQ by CQ that is feeding those competency questions into an LLM one by one and asking it to generate small modules of the ontology. But you have to treat the LLM’s output with extreme skepticism |
| Alex | because it hallucinates, |
| Sam | right? LLMs are prone to hallucinating logically inconsistent axioms. They might create a rule in module A that fundamentally contradicts a constraint in module B, so you have to |
| Alex | clean it up. The sources detail how we can use graph algorithms to mathematically optimize the LLM’s messy output. For instance, you can run minimal spanning trees to prune away structural redundancy, |
| Sam | and you can use community detection algorithms. These algorithms scan the structure. to find natural clusters of concepts, areas where nodes are densely connected to each other, but sparsely connected to the rest of the graph. When the algorithm finds those communities, it tells you that those concepts should probably be split out into their own separate modular ontologies to keep the system clean. |
| Alex | You also run centrality measures to identify overloaded concepts, nodes that have too many edges passing through them and are trying to do too much work in your model. |
| Sam | Exactly, it keeps the structure balanced. |
| Alex | So what does this all mean? We have this optimized, mathematically clean semantic blueprint. We know our constraints. We have our triples. But in the database world, there’s a massive holy war over how you actually store and query this data. We have RDF on one side and labeled property graphs on the other. Why does an AI agent care which one we use? |
| Sam | Because the database dictates how the agent can reason. Let’s look at the RDF side first. RDF stands for a Resource description framework usually queried using a language called Spar RQL and running on engines like Apache Jenna. RDF is built entirely for semantic interoperability and formal logical reasoning. Every single piece of data is strictly represented as a triple using global identifiers called IRIs or internationalized resource identifiers. |
| Alex | And because RDF is so mathematically formal, it supports things like OWL. The web ontology language. I see that mentioned constantly in the blueprints, but what does OWL actually allow the system to do fundamentally? |
| Sam | OWL gives the database itself the power to infer new facts without human intervention. It uses built-in RDFS and OWL reasoners. Let me give you an M&A example. Let’s say your ontology has a rule stating that a wholly owned subsidiary. is a subclass of corporate asset. In the raw text of the 100, it says Company A owns a wholly owned subsidiary, Company B. If your agent is running a query looking for all corporate assets of Company A, the RDF reasoner automatically knows to include Company B, even though the text never explicitly use the word asset. The logic is baked into the database |
| Alex | layer. That is incredibly powerful for strict compliance and logic. But I see everyone in the industry using tools like NeoforG. Where does that fit in? |
| Sam | Neoforge is a labeled property graph, or LPG. It uses a query language called cipher. The structural difference is massive. In RDF. Everything is a strict triple. In an LPG, both nodes and edges can. Multiple key value properties attached directly to them. So you |
| Alex | can attach metadata. |
| Sam | Yes, you can attach a timestamp, a confidence score, or an extraction source directly onto the strategic overlap edge. LPGs sacrifice some of that formal OWL logic in exchange for incredible developer ergonomics and lightning fast, high performance, multi-hop traversals. |
| Alex | So if I’m building a modern agentic system, which one do I choose? Do I want deep logic or fast traversals? |
| Sam | If we connect this to the bigger picture, you don’t choose. You build a hybrid architecture. The blueprints show that you keep your formal ontology, your tee box taxonomies, and your long-term semantic rules in the RDF layer. This gives your agents a strict, unshakable, logical foundation. But then you mirror your operational facts, the real-time agent interactions, the dynamically extracted entities from the daily news into an LPG like NeoforJ. This gives the agent the high-speed graph analytics required to actually run complex scoring algorithms in real time. |
| Alex | Here’s where it gets really interesting because you just mentioned scoring algorithms. We are now moving into the front tier of this field. Neurosymbolic AI. This is the fusion of that strict symbolic logic we just talked about with the intuitive pattern matching of neural networks. |
| Sam | This intersection is exactly what data science graduate students need to master. Let’s look at the symbolic side first. Inductive logic programming or ILP. ILP algorithms mine the knowledge graph to learn explicit human readable first order logic rules. |
| Alex | Explain how that works in practice. How does it learn a rule? The |
| Sam | ILP looks at the vast web of triples and spots structural consistencies. It might notice that every time Company X and Company Y share a primary investor and operate in the same geographic region, they end up competing. The ILP formalizes that into a rigid rule. Competitor of XY is implied by shares investor XY and as is region XY. |
| Alex | And the beauty of ILP is explainability. If the AI agent makes an assumption, it can print out that exact logic rule to tell a human user why it made that leap. But on the flip side of neurosymbolic AI, we have graph neural networks or GNNs like graph sage or indigo. How do GNNs approach the same graph differently than ILP? Where |
| Sam | ILP learns rigid boolean rules, GNNs learn continuous vector embeddings. A GNN looks at a node, say, a specific startup, and aggregates data from that node’s entire neighborhood. It looks at the startup’s investors, its founders, its patents. The GNN mathematically compresses all that rich structural neighborhood data into a dense vector embedding. |
| Alex | And once you have those embeddings, you can do things like link prediction. The GNN can predict missing edges in the graph. It can look at a new startup it has never seen before, generate an embedding based on its neighborhood, and predict with 90% confidence that a major tech giant is going to acquire it, purely based on its structural similarity to previous acquisitions. |
| Sam | Let’s tie this all together by walking step by step through the massive M&A solution architecture detailed in our sources. Imagine the full life cycle of an agentic AI designed to find acquisition targets. |
| Alex | Right, let’s trace the data. Step one, ingestion and extraction. The system pulls in thousands of SEC 10K filings and real-time financial news. It uses a custom spacy model. The sources mentioned one specifically tuned for this called Jody NER 10K. To extract entities like revenue, subsidiary, and CEO with absolute precision. |
| Sam | Simultaneously, it uses a tool like dockling to pull structured financial tables out of those dense PDFs. Then it passes those isolated entities to an LLM to dynamically extract the nuanced, complex relationships between them. |
| Alex | Step 2, the ontology. All of those extracted triples are mapped against the M&AT box. Any extraction that violates the rules like a product trying to acquire a company is instantly rejected by the RDF reasoner. The clean data is mirrored into Neo 4J. |
| Sam | Step 3, the reasoning layer. This is where the neurosymbolic fusion happens. The system runs a graph neural network over the Neo 4J database. The GNN calculates an acquisition likelihood score for every company, weighing the network topology, sentiment analysis from the news, and revenue trends. |
| Alex | But again, you don’t spend a billion dollars on an Acquisition just because a GNN spit out a high score. The human M&A analyst needs to know why. |
| Sam | Exactly why the inductive logic programming is running in parallel. The ILP provides the explainability layer. It translates the graph’s structural patterns into plain language rules. It tells the analyst target flagged score 92. Rationale. Target shares an investor with your competitor. Their supply chain heavily overlaps with your recent acquisitions, and their patent portfolio fills a known gap in your tee box. |
| Alex | Finally, we get to the agenda action. The entire system is overseen by an orchestrator agent. When a target hits that 92 score, the orchestrator routes a task to a specialized due diligence agent. This agent dives deep into the graph. It uses SPICO RQL to run formal semantic queries, checking for regulatory compliance in the RDF layer. And uses cipher to execute fast multi-hop traversals to map out the target’s entire executive network. The final output is a highly accurate, semantically grounded, and fully explainable dossier, |
| Sam | and that workflow proves the main thesis of these lecture notes. Building a powerful genic AI is not just about writing clever system prompts. It requires rigorous. design, hybrid extraction pipelines that balance precision and recall, and a highly sophisticated blend of structural, graph traversal, and neurosymbolic reasoning. So |
| Alex | what does this all mean for you as a data science grad student? Mastering this intersection of large language models and knowledge graphs is how you move beyond building simple reactive chatbots. It is the blueprint for architecting systems that can truly plan, verify and reason over wildly complex, shifting enterprise data. It gives your AI an actual understanding of the world it operates in. |
| Sam | It replaces blind statistical guessing with verifiable, explainable truth, |
| Alex | which leaves us with a fascinating paradox to consider. As we use LLMs to dynamically extract triples and build our knowledge graphs and then use those very same structured graphs to constrain, train, and ground our LLMs. We’re approaching a future where the distinction between the neural weights of the model and the semantic structure of the graph completely dissolves. Will tomorrow’s AI be a single living architecture where strict mathematical logic and neural intuition are perfectly indistinguishably fused? |
| Sam | That is the defining question for the next generation of artificial intelligence. |
| Alex | Something for you to mull over as you design your next agentic architecture. Thank you for joining us on this deep dive, and we hope this gives you a powerful new perspective on the future of reasoning and agentic memory. |
Presentation
Lecture Notes
Notebooks
- 01_Graph_Basics.ipynb — Introduction to directed, undirected, and weighted graphs using NetworkX, with graph algorithms (centrality, PageRank, shortest paths) applied to an agentic AI scenario.
- 02_Building_Knowledge_Graphs.ipynb — Building knowledge graphs from structured (CSV) and semi-structured (JSON) data, with querying and capability-based task routing.
- 03_Triple_Extraction_spaCy.ipynb — NLP-based triple extraction using spaCy for named entity recognition, dependency parsing, and pattern-based relation extraction.
- 04_Triple_Extraction_LLM.ipynb — LLM-powered triple extraction using single-pass, two-step, and schema-constrained approaches, with hallucination validation.
- 05_Ontology_Design.ipynb — Designing ontologies with RDF/OWL using rdflib, covering class hierarchies, properties, competency questions, and SPARQL queries.
- 06_Graph_Queries_SPARQL_vs_Cypher.ipynb — Side-by-side comparison of SPARQL and Cypher query languages on the same agentic AI dataset, with guidance on when to use each.
References
Foundational Knowledge Graphs & Semantic Web
- Hogan, A. et al. (2021). Knowledge Graphs. ACM Computing Surveys.
- Auer, S. et al. (2007). DBpedia: A Nucleus for a Web of Open Data. ISWC. Ehrlinger, L. & Wöß, W. (2016). Towards a Definition of Knowledge Graphs. SEMANTiCS.
Ontologies & Reasoning
- Gruber, T. (1995). Toward Principles for the Design of Ontologies.
- Studer, R., Benjamins, V., Fensel, D. (1998). Knowledge Engineering Principles.
- Baader, F. et al. (2010). The Description Logic Handbook.
Graph Query & Modeling
- Wood, P.T. (2012). Query Languages for Graph Databases. ACM SIGMOD.
- Pérez, J. et al. (2006). Semantics and Complexity of SPARQL. ISWC.