Syllabi Analysis Project

Syllabi Analysis — System Architecture

Variation A: Knowledge Graph Intelligence


The Syllabi Analysis DAIS is a Document-Driven Agentic Intelligence System that ingests the complete collection of class syllabi from all colleges and programs at Georgia State University, provided as PDF documents, and transforms them into a queryable semantic knowledge graph. The system is designed to serve academic program coordinators and curriculum analysts who need to ask complex questions that span hundreds of documents, such as which courses across the university cover a particular topic, how different colleges address the use of generative AI in their policies, what prerequisite chains lead to advanced courses, and where overlaps or gaps exist between programs.

1. Project Overview

System Name: Syllabi Analysis DAIS (Document-Driven Agentic Intelligence System)

Corpus: Class syllabi from all colleges and programs at Georgia State University (GSU), downloaded as PDF documents.

Objective: Build a semantic knowledge graph from GSU syllabi to enable querying about topics covered across courses and programs, how instructors address AI usage policies, prerequisite chains, learning outcomes, and cross-program coverage overlaps.


2. User Persona

Role: Academic Program Coordinator / Curriculum Analyst

Context: Works in GSU’s Office of Academic Affairs or a college-level curriculum committee. Responsible for reviewing and aligning curricula across programs, identifying gaps or redundancies, ensuring coverage of emerging topics (e.g., AI, data ethics), and supporting accreditation reviews.

Goals:

  • Understand which courses cover specific topics (e.g., “machine learning,” “regression analysis,” “business ethics”).
  • Compare how different programs address the same subject area.
  • Analyze AI usage policies across instructors and departments.
  • Identify prerequisite dependencies and potential curriculum gaps.
  • Support accreditation by mapping learning outcomes to courses.

Pain Points:

  • Manually reviewing hundreds of syllabi is infeasible.
  • No structured, queryable representation of syllabi content exists.
  • Cross-program comparisons require scanning documents from multiple colleges.

3. Key Use Cases

#Use CaseExample Query
1Topic Coverage Search“Which courses across GSU cover natural language processing?”
2AI Policy Comparison“How do instructors in the Robinson College of Business address the use of generative AI compared to the College of Arts & Sciences?”
3Cross-Program Overlap“What topics are shared between the MS in Analytics and the MS in Computer Science programs?”
4Prerequisite Chain Analysis“What is the prerequisite chain leading to advanced machine learning courses?”
5Learning Outcome Mapping“Which courses list ‘critical thinking’ or ‘data-driven decision making’ as a learning outcome?”
6Temporal/Policy Trend“Has the mention of AI usage policies increased across syllabi over the last three semesters?”

4. Conceptual Knowledge Graph Schema

Entities (Nodes)

Entity TypeAttributes
Coursecourse_code, title, credit_hours, level (undergrad/grad)
Instructorname, department, college
Programname, degree_type (BS, MS, MBA, PhD), college
Collegename (e.g., Robinson College of Business)
Topicname, category (e.g., “statistics”, “programming”, “ethics”)
Learning Outcomedescription, bloom_taxonomy_level
Textbooktitle, author, edition
AI Policypolicy_type (permitted, restricted, prohibited), details
Semesterterm, year
Assessment Methodtype (exam, project, paper, participation)

Relationships (Edges)

RelationshipFrom → ToAttributes
TAUGHT_BYCourse → Instructorsemester
BELONGS_TOCourse → Programrequired/elective
OFFERED_BYProgram → College
COVERS_TOPICCourse → Topicdepth (intro/intermediate/advanced), weeks_allocated
HAS_OUTCOMECourse → Learning Outcome
USES_TEXTBOOKCourse → Textbook
HAS_AI_POLICYCourse → AI Policysemester
HAS_PREREQUISITECourse → Course
OFFERED_INCourse → Semester
USES_ASSESSMENTCourse → Assessment Methodweight_percent

5. System Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                          USER INTERFACES                                │
│  ┌────────────────────────┐          ┌────────────────────────────────┐ │
│  │   Chat Interface       │          │   Batch Query Interface        │ │
│  │   (Web UI / API)       │          │   (CLI / REST endpoint)        │ │
│  └───────────┬────────────┘          └───────────────┬────────────────┘ │
└──────────────┼───────────────────────────────────────┼──────────────────┘
               │                                       │
               ▼                                       ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    ORCHESTRATION LAYER (LangGraph)                      │
│  ┌────────────────┐   ┌──────────────────┐   ┌──────────────────────┐   │
│  │ Query Router   │──▶│ Agent Coordinator│──▶│ Response Generator   │   │
│  │ Agent          │   │                  │   │ Agent                │   │
│  └────────────────┘   └──────────────────┘   └──────────────────────┘   │
└─────────────────────────────┬───────────────────────────────────────────┘
               ┌──────────────┼──────────────┐
               ▼              ▼              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                      QUERY & RETRIEVAL LAYER                            │
│  ┌─────────────────┐  ┌──────────────────┐  ┌───────────────────────┐   │
│  │ Graph Query     │  │ Vector Similarity│  │ Structured SQL        │   │
│  │ Agent           │  │ Search Agent     │  │ Query Agent           │   │
│  │ (Cypher → Neo4j)│  │ (Qdrant)         │  │ (PostgreSQL)          │   │
│  └────────┬────────┘  └────────┬─────────┘  └───────────┬───────────┘   │
└───────────┼─────────────────────┼────────────────────────┼──────────────┘
            ▼                     ▼                        ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           DATA STORES                                   │
│  ┌─────────────────┐  ┌──────────────────┐  ┌───────────────────────┐   │
│  │ Neo4j           │  │ Qdrant           │  │ PostgreSQL            │   │
│  │ (Knowledge      │  │ (Vector          │  │ (Metadata, Documents, │   │
│  │  Graph)         │  │  Embeddings)     │  │  AI Policies)         │   │
│  └─────────────────┘  └──────────────────┘  └───────────────────────┘   │
└───────────▲─────────────────────▲────────────────────────▲──────────────┘
            │                     │                        │
┌───────────┼─────────────────────┼────────────────────────┼──────────────┐
│           EXTRACTION & GRAPH CONSTRUCTION LAYER          │              │
│   ┌─────────────────┐  ┌──────────────────┐  ┌───────────┴───────────┐  │
│   │ Entity          │  │ Relationship     │  │ Graph Validation      │  │
│   │ Extraction      │──▶ Extraction       │──▶ & Deduplication       │  │
│   │ Agent           │  │ Agent            │  │ Agent                 │  │
│   └───────▲─────────┘  └────────▲─────────┘  └───────────────────────┘  │
└───────────┼─────────────────────┼───────────────────────────────────────┘
            │                     │
┌───────────┼─────────────────────┼───────────────────────────────────────┐
│           DOCUMENT PROCESSING LAYER                                     │
│  ┌─────────────────┐  ┌──────────────────┐  ┌───────────────────────┐   │
│  │ PDF Ingestion   │  │ Text Extraction  │  │ Syllabus Section      │   │
│  │ (PyPDF2 /       │─▶│ & Chunking       │─▶│ Classifier (LLM)      │   │
│  │  PDFMiner)      │  │                  │  │                       │   │
│  └─────────────────┘  └──────────────────┘  └───────────────────────┘   │
│                                                                         │
│  ┌─────────────────┐  ┌──────────────────┐                              │
│  │ Vector          │  │ Metadata         │                              │
│  │ Embedding       │  │ Extraction       │                              │
│  │ (Ollama)        │  │ (LLM + regex)    │                              │
│  └─────────────────┘  └──────────────────┘                              │
└─────────────────────────────────────────────────────────────────────────┘
            ▲                                          ▲
            │                                          │
┌───────────┴──────────────────────────────────────────┴──────────────────┐
│  INPUT: PDF Syllabi Directory (by College / Program / Semester)         │
└─────────────────────────────────────────────────────────────────────────┘

                    . . . . . . . . . . . . . . . . .
                    .   EXTERNAL: Ollama Endpoint   .
                    .   (Text Generation +          .
                    .    Embeddings)                .
                    . . . . . . . . . . . . . . . . .

6. Component Details

6.1 Document Processing Layer

ComponentTechnologyPurpose
PDF IngestionPyPDF2, PDFMinerRead PDF syllabi from input directory
Text Extraction & ChunkingPDFMiner, custom logicExtract text; chunk by syllabus section (course info, schedule, policies, outcomes)
Syllabus Section ClassifierLLM via external Ollama endpointClassify extracted text blocks into semantic sections: course description, topics/schedule, AI policy, grading, learning outcomes, textbooks, prerequisites
Metadata ExtractionLLM via external Ollama endpoint + regexExtract structured fields: course code, instructor name, semester, college, program
Vector EmbeddingOllama embedding endpoint (e.g., nomic-embed-text or mxbai-embed-large)Generate embeddings for text chunks for similarity search

6.2 Extraction & Graph Construction Layer

AgentPurpose
Entity Extraction AgentUses LLM to identify entities (courses, instructors, topics, outcomes, textbooks, AI policies) from classified syllabus sections
Relationship Extraction AgentInfers relationships (COVERS_TOPIC, HAS_PREREQUISITE, TAUGHT_BY, HAS_AI_POLICY, etc.) from extracted entities and context
Graph Validation & Deduplication AgentResolves duplicate entities (e.g., “Prof. Smith” vs “John Smith”), normalizes topic names, enforces schema consistency

6.3 Data Stores

StoreTechnologyContents
Knowledge GraphNeo4jEntities and relationships per schema above
Vector StoreQdrantText chunk embeddings for semantic similarity search
Relational StorePostgreSQLDocument metadata, raw text, AI policy details, evaluation logs

6.4 Query & Retrieval Layer

AgentPurpose
Graph Query AgentTranslates natural language questions into Cypher queries against Neo4j
Vector Similarity Search AgentPerforms semantic search over syllabus text chunks in Qdrant
Structured SQL Query AgentQueries PostgreSQL for metadata-heavy questions (e.g., “How many courses in the Robinson College have AI policies?”)

6.5 Orchestration Layer

ComponentPurpose
Query Router AgentAnalyzes incoming user query, determines which retrieval agent(s) to invoke (graph, vector, SQL, or combination)
Agent CoordinatorManages multi-agent execution flow; merges results from parallel retrievals
Response Generator AgentSynthesizes final natural-language answer from retrieved graph substructures and text snippets; includes citations back to specific syllabi

6.6 User Interfaces

InterfaceTechnologyPurpose
Chat InterfaceProvided Web UI → REST API (FastAPI)Interactive Q&A for curriculum analysts; shows answers with citations, graph visualizations
Batch Query InterfaceREST endpoint / CLI scriptAccepts JSON file of queries, returns structured outputs for evaluation

7. Technology Stack

LayerTechnology
LanguagePython 3.11+
LLM (text generation)External Ollama endpoint (on-prem, e.g., llama3.1, mistral)
LLM (embeddings)External Ollama endpoint (e.g., nomic-embed-text, mxbai-embed-large)
Agent FrameworkLangGraph — state-graph orchestration with conditional routing
LangChain Integrationlangchain-ollama (ChatOllama, OllamaEmbeddings) for LLM/embedding calls
PDF ProcessingPyPDF2, PDFMiner
Knowledge GraphNeo4j (+ neo4j Python driver)
Vector DatabaseQdrant (+ qdrant-client)
Relational DatabasePostgreSQL (+ psycopg2 / SQLAlchemy)
Web APIFastAPI
ContainerizationDocker + Docker Compose
CI/CDGitLab CI

8. LangGraph Agent Architecture

Why LangGraph: The system has two distinct multi-step pipelines — ingestion and query — each with conditional branching. LangGraph’s state-graph model maps naturally to both:

  • Nodes = agent steps (extract, classify, embed, query, respond)
  • Edges = transitions, including conditional routing (e.g., route query to graph vs. vector vs. SQL agent based on intent)
  • State = shared context passed between nodes (document metadata, extracted entities, query context, retrieved results)

8.1 Ingestion Graph

                     ┌──────────────────┐
                     │ START (PDF path) │
                     └────────┬─────────┘
                              │
                     ┌────────▼─────────┐
                     │  extract_text    │
                     │  PyPDF2/PDFMiner │
                     └────────┬─────────┘
                              │
                     ┌────────▼──────────────┐
                     │  classify_sections    │
                     │  LLM (Ollama)         │
                     └────────┬──────────────┘
                              │
            ┌─────────────────┼─────────────────┬──────────────────┐
            ▼                 ▼                 ▼                  ▼
   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
   │ extract      │  │ extract      │  │ generate     │  │ store        │
   │ entities     │  │ relations    │  │ embeddings   │  │ metadata     │
   │ (LLM)        │  │ (LLM)        │  │ (Ollama)     │  │ (PostgreSQL) │
   └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └──────────────┘
          │                 │                  │
          └────────┬────────┘                  │
                   ▼                           ▼
          ┌──────────────────┐        ┌───────────────┐
          │ validate_graph   │        │ store_vectors │
          │ & dedup (LLM)    │        │ (Qdrant)      │
          └────────┬─────────┘        └───────────────┘
                   │
                   ▼
          ┌──────────────────┐
          │ store_graph      │
          │ (Neo4j)          │
          └──────────────────┘

8.2 Query Graph

                     ┌─────────────────────┐
                     │  START (user query) │
                     └────────┬────────────┘
                              │
                     ┌────────▼───────────────┐
                     │  classify_intent       │
                     │  LLM determines type   │
                     └────────┬───────────────┘
                              │
            ┌─────────────────┼──────────────────┐
            │                 │                  │
       (graph query)   (semantic search)  (structured query)
            │                 │                  │
            ▼                 ▼                  ▼
   ┌──────────────┐  ┌──────────────┐  ┌────────────────┐
   │ graph_query  │  │ vector_search│  │ sql_query      │
   │ agent        │  │ agent        │  │ agent          │
   │ (Cypher →    │  │ (→ Qdrant)   │  │ (SQL →         │
   │  Neo4j)      │  │              │  │  PostgreSQL)   │
   └──────┬───────┘  └──────┬───────┘  └───────┬────────┘
          │                 │                  │
          └─────────────────┼──────────────────┘
                            ▼
                   ┌─────────────────┐
                   │  merge_results  │
                   │  combine & rank │
                   └────────┬────────┘
                            ▼
                   ┌─────────────────────┐
                   │  generate_response  │
                   │  LLM synthesizes    │
                   │  answer w/ citations│
                   └─────────────────────┘

The conditional edge after classify_intent can route to one, two, or all three retrieval agents depending on the query. For example, “Which courses cover NLP?” routes to the graph agent, while “Summarize the AI policy for CSC 4520” routes to vector search, and “How many Robinson College courses mention Python?” routes to SQL.


9. Data Flow Summary

PDF Syllabi
    │
    ▼
[1] Ingest & Extract Text (PyPDF2/PDFMiner)
    │
    ▼
[2] Classify Syllabus Sections (LLM)
    │  → Course Info, Topics/Schedule, AI Policy, Grading,
    │    Learning Outcomes, Textbooks, Prerequisites
    │
    ├──▶ [3a] Entity Extraction Agent (LLM)
    │         → Courses, Instructors, Topics, Outcomes, Policies
    │
    ├──▶ [3b] Relationship Extraction Agent (LLM)
    │         → COVERS_TOPIC, HAS_PREREQUISITE, TAUGHT_BY, etc.
    │
    ├──▶ [3c] Vector Embedding (Ollama embedding endpoint)
    │         → Chunk embeddings → Qdrant
    │
    └──▶ [3d] Metadata → PostgreSQL
              │
              ▼
[4] Graph Validation & Dedup Agent
    │  → Normalize topics, resolve instructor aliases
    │
    ▼
[5] Write to Neo4j (graph), Qdrant (vectors), PostgreSQL (metadata)
    │
    ▼
[6] User Query → Query Router → Graph/Vector/SQL Agents → Response Generator → Answer

10. Milestone Alignment

MilestoneDeliverables for Syllabi Analysis
M01This document: variation selection (A), persona, use cases, schema
M02PDF ingestion pipeline, text extraction, section classification, vector embeddings, metadata to PostgreSQL; architecture diagram; Docker Compose setup
M03Multi-agent pipeline (entity/relationship extraction → Neo4j); initial chat interface wired to query agents; basic batch query endpoint
M04Evaluation test set (50–100 queries about topics, AI policies, cross-program coverage); baseline metrics; error analysis and 3+ improvement ideas
M05Improvements (e.g., better topic normalization, hybrid graph+vector retrieval); ablation study (graph-only vs. vector-only vs. hybrid); iteration report
M06Deployed system (chat + batch); technical report (10–15 pages); demo video (5–10 min); in-class presentation

11. Evaluation Test Set Plan

Size: 50–100 queries

Categories:

CategoryExample QueryMetric
Topic lookup“Which courses cover regression analysis?”Precision, Recall, F1
AI policy extraction“What is the AI usage policy for ACCT 2101?”Exact match / LLM-judged accuracy
Cross-program comparison“Compare data visualization coverage in Analytics vs. CS programs”LLM-judged answer quality (1–5 scale)
Prerequisite reasoning“What prerequisites are needed before taking CSC 8820?”Graph path accuracy
Aggregation“How many courses in Robinson College mention Python?”Numerical accuracy
Temporal trend“Has AI policy language changed between Fall 2024 and Spring 2026?”LLM-judged quality

Reference answers will be manually authored by the team using a sample of 20–30 syllabi reviewed in full.


12. Containerization Plan

# docker-compose.yml (outline)
services:
  neo4j:
    image: neo4j:5
    ports: [7474, 7687]
    volumes:
      - neo4j_data:/data
  qdrant:
    image: qdrant/qdrant
    ports: [6333]
    volumes:
      - qdrant_data:/qdrant/storage
  postgres:
    image: postgres:18
    ports: [5432]
    volumes:
      - pg_data:/var/lib/postgresql/data
  app:
    build: ./app
    depends_on: [neo4j, qdrant, postgres]
    ports: [8000]
    environment:
      - OLLAMA_BASE_URL=http://<external-ollama-host>:11434   # external endpoint
      - OLLAMA_MODEL=llama3.1
      - OLLAMA_EMBED_MODEL=nomic-embed-text
      - NEO4J_URI=bolt://neo4j:7687
      - QDRANT_URL=http://qdrant:6333
      - POSTGRES_URL=postgresql://user:pass@postgres:5432/syllabi
  web-ui:
    build: ./web-ui
    depends_on: [app]
    ports: [3000]

volumes:
  neo4j_data:
  qdrant_data:
  pg_data:

Note: Ollama is not containerized locally — the system connects to an external on-prem Ollama endpoint for both text generation and vector embedding. All other components (Neo4j, Qdrant, PostgreSQL, application, web UI) run as Docker containers orchestrated by Docker Compose.