Syllabi Analysis Project

Syllabi Analysis — System Architecture

Variation A: Knowledge Graph Intelligence

The Syllabi Analysis DAIS is a Document-Driven Agentic Intelligence System that ingests the complete collection of class syllabi from all colleges and programs at Georgia State University, provided as PDF documents, and transforms them into a queryable semantic knowledge graph. The system is designed to serve academic program coordinators and curriculum analysts who need to ask complex questions that span hundreds of documents, such as which courses across the university cover a particular topic, how different colleges address the use of generative AI in their policies, what prerequisite chains lead to advanced courses, and where overlaps or gaps exist between programs.

1. Project Overview

System Name: Syllabi Analysis DAIS (Document-Driven Agentic Intelligence System)

Corpus: Class syllabi from all colleges and programs at Georgia State University (GSU), downloaded as PDF documents.

Objective: Build a semantic knowledge graph from GSU syllabi to enable querying about topics covered across courses and programs, how instructors address AI usage policies, prerequisite chains, learning outcomes, and cross-program coverage overlaps.

2. User Persona

Role: Academic Program Coordinator / Curriculum Analyst

Context: Works in GSU’s Office of Academic Affairs or a college-level curriculum committee. Responsible for reviewing and aligning curricula across programs, identifying gaps or redundancies, ensuring coverage of emerging topics (e.g., AI, data ethics), and supporting accreditation reviews.

Goals:

Understand which courses cover specific topics (e.g., “machine learning,” “regression analysis,” “business ethics”).
Compare how different programs address the same subject area.
Analyze AI usage policies across instructors and departments.
Identify prerequisite dependencies and potential curriculum gaps.
Support accreditation by mapping learning outcomes to courses.

Pain Points:

Manually reviewing hundreds of syllabi is infeasible.
No structured, queryable representation of syllabi content exists.
Cross-program comparisons require scanning documents from multiple colleges.

3. Key Use Cases

#	Use Case	Example Query
1	Topic Coverage Search	“Which courses across GSU cover natural language processing?”
2	AI Policy Comparison	“How do instructors in the Robinson College of Business address the use of generative AI compared to the College of Arts & Sciences?”
3	Cross-Program Overlap	“What topics are shared between the MS in Analytics and the MS in Computer Science programs?”
4	Prerequisite Chain Analysis	“What is the prerequisite chain leading to advanced machine learning courses?”
5	Learning Outcome Mapping	“Which courses list ‘critical thinking’ or ‘data-driven decision making’ as a learning outcome?”
6	Temporal/Policy Trend	“Has the mention of AI usage policies increased across syllabi over the last three semesters?”

4. Conceptual Knowledge Graph Schema

Entities (Nodes)

Entity Type	Attributes
Course	course_code, title, credit_hours, level (undergrad/grad)
Instructor	name, department, college
Program	name, degree_type (BS, MS, MBA, PhD), college
College	name (e.g., Robinson College of Business)
Topic	name, category (e.g., “statistics”, “programming”, “ethics”)
Learning Outcome	description, bloom_taxonomy_level
Textbook	title, author, edition
AI Policy	policy_type (permitted, restricted, prohibited), details
Semester	term, year
Assessment Method	type (exam, project, paper, participation)

Relationships (Edges)

Relationship	From → To	Attributes
TAUGHT_BY	Course → Instructor	semester
BELONGS_TO	Course → Program	required/elective
OFFERED_BY	Program → College	—
COVERS_TOPIC	Course → Topic	depth (intro/intermediate/advanced), weeks_allocated
HAS_OUTCOME	Course → Learning Outcome	—
USES_TEXTBOOK	Course → Textbook	—
HAS_AI_POLICY	Course → AI Policy	semester
HAS_PREREQUISITE	Course → Course	—
OFFERED_IN	Course → Semester	—
USES_ASSESSMENT	Course → Assessment Method	weight_percent

5. System Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                          USER INTERFACES                                │
│  ┌────────────────────────┐          ┌────────────────────────────────┐ │
│  │   Chat Interface       │          │   Batch Query Interface        │ │
│  │   (Web UI / API)       │          │   (CLI / REST endpoint)        │ │
│  └───────────┬────────────┘          └───────────────┬────────────────┘ │
└──────────────┼───────────────────────────────────────┼──────────────────┘
               │                                       │
               ▼                                       ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    ORCHESTRATION LAYER (LangGraph)                      │
│  ┌────────────────┐   ┌──────────────────┐   ┌──────────────────────┐   │
│  │ Query Router   │──▶│ Agent Coordinator│──▶│ Response Generator   │   │
│  │ Agent          │   │                  │   │ Agent                │   │
│  └────────────────┘   └──────────────────┘   └──────────────────────┘   │
└─────────────────────────────┬───────────────────────────────────────────┘
               ┌──────────────┼──────────────┐
               ▼              ▼              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                      QUERY & RETRIEVAL LAYER                            │
│  ┌─────────────────┐  ┌──────────────────┐  ┌───────────────────────┐   │
│  │ Graph Query     │  │ Vector Similarity│  │ Structured SQL        │   │
│  │ Agent           │  │ Search Agent     │  │ Query Agent           │   │
│  │ (Cypher → Neo4j)│  │ (Qdrant)         │  │ (PostgreSQL)          │   │
│  └────────┬────────┘  └────────┬─────────┘  └───────────┬───────────┘   │
└───────────┼─────────────────────┼────────────────────────┼──────────────┘
            ▼                     ▼                        ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           DATA STORES                                   │
│  ┌─────────────────┐  ┌──────────────────┐  ┌───────────────────────┐   │
│  │ Neo4j           │  │ Qdrant           │  │ PostgreSQL            │   │
│  │ (Knowledge      │  │ (Vector          │  │ (Metadata, Documents, │   │
│  │  Graph)         │  │  Embeddings)     │  │  AI Policies)         │   │
│  └─────────────────┘  └──────────────────┘  └───────────────────────┘   │
└───────────▲─────────────────────▲────────────────────────▲──────────────┘
            │                     │                        │
┌───────────┼─────────────────────┼────────────────────────┼──────────────┐
│           EXTRACTION & GRAPH CONSTRUCTION LAYER          │              │
│   ┌─────────────────┐  ┌──────────────────┐  ┌───────────┴───────────┐  │
│   │ Entity          │  │ Relationship     │  │ Graph Validation      │  │
│   │ Extraction      │──▶ Extraction       │──▶ & Deduplication       │  │
│   │ Agent           │  │ Agent            │  │ Agent                 │  │
│   └───────▲─────────┘  └────────▲─────────┘  └───────────────────────┘  │
└───────────┼─────────────────────┼───────────────────────────────────────┘
            │                     │
┌───────────┼─────────────────────┼───────────────────────────────────────┐
│           DOCUMENT PROCESSING LAYER                                     │
│  ┌─────────────────┐  ┌──────────────────┐  ┌───────────────────────┐   │
│  │ PDF Ingestion   │  │ Text Extraction  │  │ Syllabus Section      │   │
│  │ (PyPDF2 /       │─▶│ & Chunking       │─▶│ Classifier (LLM)      │   │
│  │  PDFMiner)      │  │                  │  │                       │   │
│  └─────────────────┘  └──────────────────┘  └───────────────────────┘   │
│                                                                         │
│  ┌─────────────────┐  ┌──────────────────┐                              │
│  │ Vector          │  │ Metadata         │                              │
│  │ Embedding       │  │ Extraction       │                              │
│  │ (Ollama)        │  │ (LLM + regex)    │                              │
│  └─────────────────┘  └──────────────────┘                              │
└─────────────────────────────────────────────────────────────────────────┘
            ▲                                          ▲
            │                                          │
┌───────────┴──────────────────────────────────────────┴──────────────────┐
│  INPUT: PDF Syllabi Directory (by College / Program / Semester)         │
└─────────────────────────────────────────────────────────────────────────┘

                    . . . . . . . . . . . . . . . . .
                    .   EXTERNAL: Ollama Endpoint   .
                    .   (Text Generation +          .
                    .    Embeddings)                .
                    . . . . . . . . . . . . . . . . .

6. Component Details

6.1 Document Processing Layer

Component	Technology	Purpose
PDF Ingestion	PyPDF2, PDFMiner	Read PDF syllabi from input directory
Text Extraction & Chunking	PDFMiner, custom logic	Extract text; chunk by syllabus section (course info, schedule, policies, outcomes)
Syllabus Section Classifier	LLM via external Ollama endpoint	Classify extracted text blocks into semantic sections: course description, topics/schedule, AI policy, grading, learning outcomes, textbooks, prerequisites
Metadata Extraction	LLM via external Ollama endpoint + regex	Extract structured fields: course code, instructor name, semester, college, program
Vector Embedding	Ollama embedding endpoint (e.g., `nomic-embed-text` or `mxbai-embed-large`)	Generate embeddings for text chunks for similarity search

6.2 Extraction & Graph Construction Layer

Agent	Purpose
Entity Extraction Agent	Uses LLM to identify entities (courses, instructors, topics, outcomes, textbooks, AI policies) from classified syllabus sections
Relationship Extraction Agent	Infers relationships (COVERS_TOPIC, HAS_PREREQUISITE, TAUGHT_BY, HAS_AI_POLICY, etc.) from extracted entities and context
Graph Validation & Deduplication Agent	Resolves duplicate entities (e.g., “Prof. Smith” vs “John Smith”), normalizes topic names, enforces schema consistency

6.3 Data Stores

Store	Technology	Contents
Knowledge Graph	Neo4j	Entities and relationships per schema above
Vector Store	Qdrant	Text chunk embeddings for semantic similarity search
Relational Store	PostgreSQL	Document metadata, raw text, AI policy details, evaluation logs

6.4 Query & Retrieval Layer

Agent	Purpose
Graph Query Agent	Translates natural language questions into Cypher queries against Neo4j
Vector Similarity Search Agent	Performs semantic search over syllabus text chunks in Qdrant
Structured SQL Query Agent	Queries PostgreSQL for metadata-heavy questions (e.g., “How many courses in the Robinson College have AI policies?”)

6.5 Orchestration Layer

Component	Purpose
Query Router Agent	Analyzes incoming user query, determines which retrieval agent(s) to invoke (graph, vector, SQL, or combination)
Agent Coordinator	Manages multi-agent execution flow; merges results from parallel retrievals
Response Generator Agent	Synthesizes final natural-language answer from retrieved graph substructures and text snippets; includes citations back to specific syllabi

6.6 User Interfaces

Interface	Technology	Purpose
Chat Interface	Provided Web UI → REST API (FastAPI)	Interactive Q&A for curriculum analysts; shows answers with citations, graph visualizations
Batch Query Interface	REST endpoint / CLI script	Accepts JSON file of queries, returns structured outputs for evaluation

7. Technology Stack

Layer	Technology
Language	Python 3.11+
LLM (text generation)	External Ollama endpoint (on-prem, e.g., `llama3.1`, `mistral`)
LLM (embeddings)	External Ollama endpoint (e.g., `nomic-embed-text`, `mxbai-embed-large`)
Agent Framework	LangGraph — state-graph orchestration with conditional routing
LangChain Integration	`langchain-ollama` (ChatOllama, OllamaEmbeddings) for LLM/embedding calls
PDF Processing	PyPDF2, PDFMiner
Knowledge Graph	Neo4j (+ `neo4j` Python driver)
Vector Database	Qdrant (+ `qdrant-client`)
Relational Database	PostgreSQL (+ `psycopg2` / SQLAlchemy)
Web API	FastAPI
Containerization	Docker + Docker Compose
CI/CD	GitLab CI

8. LangGraph Agent Architecture

Why LangGraph: The system has two distinct multi-step pipelines — ingestion and query — each with conditional branching. LangGraph’s state-graph model maps naturally to both:

Nodes = agent steps (extract, classify, embed, query, respond)
Edges = transitions, including conditional routing (e.g., route query to graph vs. vector vs. SQL agent based on intent)
State = shared context passed between nodes (document metadata, extracted entities, query context, retrieved results)

8.1 Ingestion Graph

                     ┌──────────────────┐
                     │ START (PDF path) │
                     └────────┬─────────┘
                              │
                     ┌────────▼─────────┐
                     │  extract_text    │
                     │  PyPDF2/PDFMiner │
                     └────────┬─────────┘
                              │
                     ┌────────▼──────────────┐
                     │  classify_sections    │
                     │  LLM (Ollama)         │
                     └────────┬──────────────┘
                              │
            ┌─────────────────┼─────────────────┬──────────────────┐
            ▼                 ▼                 ▼                  ▼
   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
   │ extract      │  │ extract      │  │ generate     │  │ store        │
   │ entities     │  │ relations    │  │ embeddings   │  │ metadata     │
   │ (LLM)        │  │ (LLM)        │  │ (Ollama)     │  │ (PostgreSQL) │
   └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └──────────────┘
          │                 │                  │
          └────────┬────────┘                  │
                   ▼                           ▼
          ┌──────────────────┐        ┌───────────────┐
          │ validate_graph   │        │ store_vectors │
          │ & dedup (LLM)    │        │ (Qdrant)      │
          └────────┬─────────┘        └───────────────┘
                   │
                   ▼
          ┌──────────────────┐
          │ store_graph      │
          │ (Neo4j)          │
          └──────────────────┘

8.2 Query Graph

                     ┌─────────────────────┐
                     │  START (user query) │
                     └────────┬────────────┘
                              │
                     ┌────────▼───────────────┐
                     │  classify_intent       │
                     │  LLM determines type   │
                     └────────┬───────────────┘
                              │
            ┌─────────────────┼──────────────────┐
            │                 │                  │
       (graph query)   (semantic search)  (structured query)
            │                 │                  │
            ▼                 ▼                  ▼
   ┌──────────────┐  ┌──────────────┐  ┌────────────────┐
   │ graph_query  │  │ vector_search│  │ sql_query      │
   │ agent        │  │ agent        │  │ agent          │
   │ (Cypher →    │  │ (→ Qdrant)   │  │ (SQL →         │
   │  Neo4j)      │  │              │  │  PostgreSQL)   │
   └──────┬───────┘  └──────┬───────┘  └───────┬────────┘
          │                 │                  │
          └─────────────────┼──────────────────┘
                            ▼
                   ┌─────────────────┐
                   │  merge_results  │
                   │  combine & rank │
                   └────────┬────────┘
                            ▼
                   ┌─────────────────────┐
                   │  generate_response  │
                   │  LLM synthesizes    │
                   │  answer w/ citations│
                   └─────────────────────┘

The conditional edge after classify_intent can route to one, two, or all three retrieval agents depending on the query. For example, “Which courses cover NLP?” routes to the graph agent, while “Summarize the AI policy for CSC 4520” routes to vector search, and “How many Robinson College courses mention Python?” routes to SQL.

9. Data Flow Summary

PDF Syllabi
    │
    ▼
[1] Ingest & Extract Text (PyPDF2/PDFMiner)
    │
    ▼
[2] Classify Syllabus Sections (LLM)
    │  → Course Info, Topics/Schedule, AI Policy, Grading,
    │    Learning Outcomes, Textbooks, Prerequisites
    │
    ├──▶ [3a] Entity Extraction Agent (LLM)
    │         → Courses, Instructors, Topics, Outcomes, Policies
    │
    ├──▶ [3b] Relationship Extraction Agent (LLM)
    │         → COVERS_TOPIC, HAS_PREREQUISITE, TAUGHT_BY, etc.
    │
    ├──▶ [3c] Vector Embedding (Ollama embedding endpoint)
    │         → Chunk embeddings → Qdrant
    │
    └──▶ [3d] Metadata → PostgreSQL
              │
              ▼
[4] Graph Validation & Dedup Agent
    │  → Normalize topics, resolve instructor aliases
    │
    ▼
[5] Write to Neo4j (graph), Qdrant (vectors), PostgreSQL (metadata)
    │
    ▼
[6] User Query → Query Router → Graph/Vector/SQL Agents → Response Generator → Answer

10. Milestone Alignment

Milestone	Deliverables for Syllabi Analysis
M01	This document: variation selection (A), persona, use cases, schema
M02	PDF ingestion pipeline, text extraction, section classification, vector embeddings, metadata to PostgreSQL; architecture diagram; Docker Compose setup
M03	Multi-agent pipeline (entity/relationship extraction → Neo4j); initial chat interface wired to query agents; basic batch query endpoint
M04	Evaluation test set (50–100 queries about topics, AI policies, cross-program coverage); baseline metrics; error analysis and 3+ improvement ideas
M05	Improvements (e.g., better topic normalization, hybrid graph+vector retrieval); ablation study (graph-only vs. vector-only vs. hybrid); iteration report
M06	Deployed system (chat + batch); technical report (10–15 pages); demo video (5–10 min); in-class presentation

11. Evaluation Test Set Plan

Size: 50–100 queries

Categories:

Category	Example Query	Metric
Topic lookup	“Which courses cover regression analysis?”	Precision, Recall, F1
AI policy extraction	“What is the AI usage policy for ACCT 2101?”	Exact match / LLM-judged accuracy
Cross-program comparison	“Compare data visualization coverage in Analytics vs. CS programs”	LLM-judged answer quality (1–5 scale)
Prerequisite reasoning	“What prerequisites are needed before taking CSC 8820?”	Graph path accuracy
Aggregation	“How many courses in Robinson College mention Python?”	Numerical accuracy
Temporal trend	“Has AI policy language changed between Fall 2024 and Spring 2026?”	LLM-judged quality

Reference answers will be manually authored by the team using a sample of 20–30 syllabi reviewed in full.

12. Containerization Plan

# docker-compose.yml (outline)
services:
  neo4j:
    image: neo4j:5
    ports: [7474, 7687]
    volumes:
      - neo4j_data:/data
  qdrant:
    image: qdrant/qdrant
    ports: [6333]
    volumes:
      - qdrant_data:/qdrant/storage
  postgres:
    image: postgres:18
    ports: [5432]
    volumes:
      - pg_data:/var/lib/postgresql/data
  app:
    build: ./app
    depends_on: [neo4j, qdrant, postgres]
    ports: [8000]
    environment:
      - OLLAMA_BASE_URL=http://<external-ollama-host>:11434   # external endpoint
      - OLLAMA_MODEL=llama3.1
      - OLLAMA_EMBED_MODEL=nomic-embed-text
      - NEO4J_URI=bolt://neo4j:7687
      - QDRANT_URL=http://qdrant:6333
      - POSTGRES_URL=postgresql://user:pass@postgres:5432/syllabi
  web-ui:
    build: ./web-ui
    depends_on: [app]
    ports: [3000]

volumes:
  neo4j_data:
  qdrant_data:
  pg_data:

Note: Ollama is not containerized locally — the system connects to an external on-prem Ollama endpoint for both text generation and vector embedding. All other components (Neo4j, Qdrant, PostgreSQL, application, web UI) run as Docker containers orchestrated by Docker Compose.