Interactive Course

How This RAG System Works

An inside look at the UChicago ADS Q&A chatbot — from user question to AI-generated answer, explained for everyone.

01 What It Does 02 Meet the Cast 03 How They Talk 04 Knowledge Factory 05 Finding the Needle 06 Going to Production 07 Measuring Success
01

What This App Does

A chatbot that answers questions about UChicago's Applied Data Science program — and the journey your question takes behind the scenes.

Imagine you're considering a master's degree

You visit the UChicago ADS website and have a question: "How much does the program cost?" Instead of digging through dozens of web pages, you type it into a chat box — and get an instant, accurate answer with links to the source pages.

That's what this app does. It's a chatbot powered by AI that has 'read' every page on the ADS program website and can answer questions about admissions, courses, tuition, careers, and more.

What happens when you ask a question

From the outside, it looks simple: type → answer. But behind the scenes, your question goes on quite a journey.

1
You type a question

In English, Chinese, or any language. The chat interface captures your words.

2
The system searches its knowledge

It hunts through 1,012 pre-processed text chunks from 147 web pages to find the most relevant information.

3
A judge re-checks the results

A specialized AI model re-scores the candidates, keeping only the top 5 most relevant chunks.

4
An AI composes the answer

Google's Gemini model reads the relevant chunks and writes a natural-language answer, word by word, streaming to your screen in real time.

5
Sources appear below

Only the pages the AI actually cited are shown — so you can verify every claim.

This pattern has a name: RAG

This approach — searching for relevant information first, then letting AI compose an answer from it — is called RAG (Retrieval-Augmented Generation).

💡
Why not just ask the AI directly?

Large language models like GPT or Gemini are trained on general internet data — they don't know specific, up-to-date details about UChicago's ADS program (like current tuition or the latest course list). RAG solves this by giving the AI the right information to read before it answers. Think of it as giving a brilliant student the textbook before the exam, rather than asking them to guess.

Check your understanding

Scenario

You're building a chatbot for your company's HR department. Employees ask questions like "How many vacation days do I get?" The answers change every year when new policies are published.

Should you use RAG or just let the AI answer from its training data?

Why does this system show source links below the answer?

02

Meet the Cast

Every system has its characters. Here are the main actors that work together to answer your questions.

Six actors, one mission

Think of this system like a hospital emergency room. Each specialist has a role — the triage nurse, the doctor, the lab — and they hand off patients (your question) to each other in a well-rehearsed sequence.

🖥️
Frontend (React)

The chat interface you see and type into — captures your question, displays the streaming answer

Backend API (FastAPI)

The coordinator — receives your question, orchestrates all the steps, streams the answer back

🔍
Retriever (FAISS + BM25)

The searcher — hunts through 1,012 chunks using two different strategies and combines results

⚖️
Reranker (Cross-Encoder)

The quality judge — re-scores search results with a smarter model, keeping only the best 5

🧠
LLM (Gemini 2.5 Flash)

The writer — reads the retrieved chunks and composes a natural-language answer

🏭
Chunk Pipeline (offline)

The prep cook — runs once to scrape, split, deduplicate, and embed all 147 web pages into searchable chunks

Where they live in the code

backend/ All server-side code
main.py The coordinator — FastAPI server + /ask endpoint
retrieval.py Hybrid search + reranking logic
prompt.py Prompt construction + citation parsing
onnx_models.py Fast CPU inference for embedder + reranker
embedder.py Gemini embedding wrapper (alternative)
eval.py Evaluation framework — test how good retrieval is
scripts/
prepare_chunks.py Offline pipeline: scrape → chunk → embed → index
frontend/src/ React chat interface
App.tsx Main chat logic + SSE streaming
components/
ChatMessage.tsx Renders messages + sources
ChatInput.tsx Text input box
SampleQuestions.tsx Sidebar with example questions

How the server boots up

When the server starts, it loads everything into memory so it's ready to answer questions instantly. This is the "opening the ER" moment — all equipment must be ready before the first patient arrives.

CODE
@asynccontextmanager
async def lifespan(app: FastAPI):
    embedder = OnnxEmbedder()
    cross_encoder = OnnxCrossEncoder()
    chunks = json.load(open("chunked_documents_dedup.json"))
    faiss_index = faiss.read_index("uchicago_ads_faiss_dedup.index")
    bm25 = BM25Okapi([tokenize_for_bm25(c["text"]) for c in chunks])
    llm = genai.Client(api_key=GOOGLE_API_KEY)
PLAIN ENGLISH

This code runs once when the server starts up (the "lifespan" of the app)...

Load the embedder — the model that converts text into numbers (vectors) for similarity search

Load the cross-encoder — the smarter judge that re-scores search results

Load all 1,012 pre-processed text chunks from a file

Load the pre-built search index — a structure optimized for lightning-fast similarity search

Build a keyword search index from all chunk texts (for the other search strategy)

Connect to Google's Gemini AI for generating answers

Check your understanding

Scenario

You want to add a new feature: showing the user how confident the system is in its answer (e.g., "High confidence" or "Low confidence").

Which file would you most likely modify?

03

How the Pieces Talk

Follow your question on its journey from typing to answer — and see how data flows through the system like packages in a shipping network.

The question's journey

Click "Next Step" to trace what happens when you ask "What are the admission requirements?"

🖥️
Frontend
Backend
🔍
Retriever
⚖️
Reranker
🧠
Gemini
Click "Next Step" to begin the journey
Step 0 / 8

If the components could text each other

Here's that same flow, but as a group chat. Watch how each component talks to the next one in the chain.

🔒 RAG Pipeline Group Chat
0 / 8 messages

How the answer appears word by word

You've probably noticed the answer types out like someone is writing it in real time. This isn't a trick — the LLM generates one word at a time, and each word is sent to your browser immediately via a technique called SSE.

CODE
for chunk in response:
    token = chunk.text
    full_answer += token
    yield f"data: {json.dumps({'type':'token','content':token})}\n\n"
PLAIN ENGLISH

For each word/token the AI generates, one at a time...

Grab the text of this token

Add it to the full answer we're building up

Immediately send it to the browser as a server-sent event — so the user sees it appear in real time

💡
Why stream instead of waiting?

If the system waited for the complete answer before sending it, the user would stare at a blank screen for 3-5 seconds. Streaming makes the AI feel conversational — like texting with someone who types fast. This is why ChatGPT, Gemini, and Claude all stream their answers too.

Check your understanding

Scenario

A user reports that answers sometimes show irrelevant source links below the response.

Based on what you've learned about the pipeline, where would you look first?

04

The Knowledge Factory

Before the system can answer any questions, it needs to "read" every page on the ADS website and organize that knowledge — like a librarian cataloging books before opening day.

From web pages to searchable knowledge

This process runs once, offline, before the app goes live. It transforms 147 raw web pages into 1,012 searchable text chunks.

1
Scrape the website

Download all 147 pages from the UChicago ADS site — admissions, courses, FAQ, faculty, careers, etc.

2
Label each page by topic

Automatically classify pages as "admission," "course," "career," "fee," etc. by counting keyword hits.

3
Smart chunking

Split pages into smaller pieces — but intelligently, respecting paragraph boundaries and FAQ structures.

4
Remove duplicates

Many pages repeat the same info. Deduplicate to go from 1,294 chunks down to 1,012.

5
Convert to vectors

Turn each text chunk into a 384-dimensional vector (a list of 384 numbers) that captures its meaning.

6
Build the search index

Store all vectors in a FAISS index — a special data structure that can search millions of vectors in milliseconds.

Why you can't just feed whole pages to the AI

Chunking is one of the most important decisions in a RAG system. Here's why this project does it so carefully:

📐

Too big = diluted

A full page about "Admissions" has 2,000 words. If the AI gets the whole thing, the answer about GRE requirements is buried in noise.

✂️

Too small = lost context

Splitting mid-sentence can falsely attribute info. A chunk saying "90% placement rate" from USF's page might be mistaken for UChicago's stat.

🎯

Just right = smart boundaries

This system splits at paragraph and FAQ item boundaries, and prefixes each chunk with [Page Title] so the AI always knows the source context.

Three chunking strategies for three page types

Not all pages look the same. The system detects the page structure and picks the right strategy — like a filing clerk who handles books, folders, and index cards differently.

CODE
def extract_content(url, html):
    if _has_accordion(soup):
        # FAQ pages, course lists, schedules
        return extract_accordion_items(soup)
    elif _FAQ_URL_RE.search(url):
        # Q&A-style pages
        return extract_faq_items(soup)
    else:
        # Regular prose pages
        return split_by_paragraphs(text)
PLAIN ENGLISH

Look at each web page and decide how to split it...

If the page has collapsible sections (like an accordion menu with Q&A items)...

→ Keep each accordion item as one chunk (preserves Q&A pairs intact)

If the URL suggests it's a FAQ page...

→ Extract question-answer pairs as individual chunks

Otherwise, for normal text pages...

→ Split at paragraph boundaries (double line breaks), never mid-sentence

The surprising power of deduplication

Here's a plot twist: removing duplicate chunks was the single biggest improvement to search quality — bigger than the embedding model, bigger than reranking, bigger than everything else combined.

FAISS only
0.455
+ BM25 hybrid
0.528
+0.073
+ Boost + rerank
0.663
+0.135
+ Deduplication
0.944
+0.281 🏆
💡
Why did dedup help so much?

When duplicate chunks exist, they hog the top-5 results. Imagine asking "What's the tuition?" and 3 of your 5 results are the same sentence from 3 different pages. That leaves only 2 slots for other relevant info. Removing duplicates freed up those slots for diverse, useful chunks. In software, this is called "data quality beats model quality" — and it's a lesson that applies everywhere.

Check your understanding

Scenario

You're building a RAG system for a cooking website. One recipe page has a long list of 20 recipes, each with a title, ingredients, and instructions. Your chunking currently splits the page every 800 characters.

What's likely to go wrong, and how would you fix it?

05

Finding the Needle

How the system searches 1,012 chunks to find the 5 most relevant ones — using two different strategies, seven boosting signals, and a final judge.

Two ways to find what you're looking for

Imagine searching for a book in a library. You could search the catalog by keyword (title, author) or by topic (books about similar subjects). This system does both — and combines the results.

📝

BM25 — Keyword Search

Looks for exact word matches: "tuition" finds chunks containing "tuition." Also expands synonyms: "tuition" → "cost," "fee," "price." Great for specific terms, but misses rephrased concepts.

🧮

FAISS — Meaning Search

Converts your question and every chunk into numbers (vectors), then finds the chunks whose numbers are closest to yours. Catches synonyms and rephrasings, but can miss rare proper nouns.

Combining results: Reciprocal Rank Fusion

Each search method returns its own ranking. RRF merges them into one master ranking where a chunk ranked high by both methods gets the best combined score.

CODE
for rank, idx in enumerate(faiss_order):
    rrf[idx] += 1.0 / (k + rank)
for rank, idx in enumerate(bm25_order):
    rrf[idx] += 1.0 / (k + rank)
# k=60 by default
PLAIN ENGLISH

Go through the FAISS rankings and give each chunk a score based on its position: 1st place gets 1/61, 2nd gets 1/62, etc.

Do the same for BM25 rankings — each chunk gets another score added to its total

A chunk ranked #1 in both lists gets 1/61 + 1/61 = 0.033. A chunk ranked #1 in one but #100 in the other gets 1/61 + 1/160 = 0.023. Being good in both wins.

Seven signals that fine-tune the ranking

After RRF fusion, the system applies extra signals — like a search engine giving bonus points to pages that match your intent perfectly.

Heading boost If a chunk's heading matches your query words (≥50% overlap), boost its score. "Core Courses" heading + "core courses" query = boost.
Label boost If the chunk's topic label (admission, course, career) matches what you're asking about, add 0.03 to score.
URL penalty Gently penalize the 2nd, 3rd chunk from the same page (score × 0.9ⁿ) to ensure diverse sources.
Synonym expansion "professor" also searches for "faculty," "instructor," "teacher" — domain-specific synonyms.
Stemming "courses" → "course," "studying" → "study" — so word forms don't prevent matches.
Stopword removal Ignores common words like "the," "is," "what" that don't carry meaning.
Page chunk swap If 3+ chunks from the same page are in the top results, replace fragments with the full page chunk.

The final judge: Cross-Encoder Reranking

The top 15 results go to a cross-encoder — a more powerful (but slower) model that reads the question and each chunk together to judge relevance. It keeps only the top 5.

CODE
def rerank(question, candidates, cross_encoder, top_k=5):
    pairs = [(question, c.text) for c in candidates]
    scores = cross_encoder.predict(pairs)
    ranked = sorted(zip(scores, candidates), reverse=True)
    return [c for _, c in ranked[:top_k]]
PLAIN ENGLISH

Take the question and the top 15 candidate chunks...

Create 15 pairs: (question + chunk 1), (question + chunk 2), etc.

Feed all 15 pairs into the cross-encoder model — it returns a relevance score for each

Sort by score, highest first

Return only the top 5 — these are the best-of-the-best chunks that go to the LLM

Why not use the cross-encoder for everything?

The cross-encoder is 10× slower than FAISS — it takes ~0.9 seconds to score 15 pairs, while FAISS searches all 1,012 chunks in 0.003 seconds. That's why the system uses a "funnel" approach: fast search first (1,012 → 15), then expensive re-scoring (15 → 5). This is a common pattern in search engines.

Check your understanding

Scenario

A user asks "How much does it cost?" and the system returns 3 chunks about tuition from the same page, plus 2 chunks about scholarships from a different page. The user only sees tuition info in the answer.

What engineering pattern in this system is designed to prevent this exact problem?

06

Going to Production

How the system is optimized for speed, packaged for deployment, and kept alive in the cloud — the engineering that turns a prototype into a product.

Making AI models run fast on cheap hardware

The cross-encoder and embedder models originally ran on PyTorch — great for research, but slow on the affordable CPU servers this app runs on. The solution: convert them to ONNX format.

2.0s
PyTorch (before)
0.9s
ONNX (after)
55%
Faster

How ONNX inference works

CODE
class OnnxCrossEncoder:
    def __init__(self):
        self.session = ort.InferenceSession(
            "models/cross_encoder.onnx",
            providers=["CPUExecutionProvider"])
        self.tokenizer = AutoTokenizer.from_pretrained(
            "models/cross_encoder_tokenizer")
PLAIN ENGLISH

Create the cross-encoder (the "final judge" from Module 5)...

Load the pre-exported ONNX model file (87MB, baked into the Docker image — no internet needed)

Tell it to use CPU (not GPU) — this is a budget-friendly server

Load the tokenizer — the tool that converts text into numbers the model understands

⚠️
A real-world gotcha this project discovered

The sentence-transformers library has a built-in ONNX backend, but it still calls HuggingFace's API at runtime to download model config — and that API is rate-limited on Cloud Run. The fix: export models locally and bake the .onnx files directly into the Docker image. No runtime downloads, no rate limits.

The deployment setup

The app runs on Google Cloud Platform with two services working together:

🐳

Cloud Run (Backend)

Runs the Python Docker container with FastAPI, models, and data. Scales from 0 to 3 instances automatically. 4GB memory, 4 CPUs.

🌐

Firebase Hosting (Frontend)

Serves the React app as static files via a global CDN. Fast loading from anywhere in the world, costs almost nothing.

The cold start problem (and a clever fix)

Cloud Run can scale to zero — meaning if nobody asks a question for a while, the server shuts down. When the next user arrives, it takes 20-30 seconds to restart (load models, build index). This is called cold start.

The fix? Cloud Scheduler pings the server's health endpoint every 5 minutes, keeping it warm — like leaving the car engine running in a parking lot so it's ready when you need it.

CODE
# From deploy.sh
gcloud scheduler jobs create http keepalive \
  --schedule="*/5 * * * *" \
  --uri="$SERVICE_URL/health" \
  --http-method=GET
PLAIN ENGLISH

Create a scheduled task called "keepalive"...

Run it every 5 minutes (cron syntax: "at minute 0, 5, 10, 15..." of every hour)

Send a simple GET request to the /health endpoint — just enough to keep the server awake

Check your understanding

Scenario

You're deploying a similar RAG system. Your cross-encoder takes 5 seconds per request on CPU. Users are complaining about slow responses.

What optimization approach from this project would you try first?

07

Measuring Success

How do you know if a RAG system actually works? This project built a rigorous evaluation framework — and the results are eye-opening.

The trap of "it looks right"

It's tempting to just ask a few questions, eyeball the answers, and call it done. But this project discovered something important: automated self-evaluation is dangerously optimistic.

🤖

Auto-evaluation

Recall@5 = 0.994 — looks amazing! But the system was grading its own homework. It marked chunks as "relevant" based on its own scores, not real relevance.

👤

Manual evaluation

Recall@5 = 0.944 — still great, but honest. A human read each of the 55 test queries and hand-labeled which chunks could actually answer the question. This exposed real gaps in career-related queries.

💡
The golden rule of evaluation

Never let a system grade its own work. In machine learning, this is called "data leakage" — the system uses information it shouldn't have during testing. The manual test set with 55 queries and hand-labeled answers is this project's "source of truth." If you're building AI tools, always invest in honest evaluation — it's the only way to know if your system actually works.

The scorecard

94.4%
Recall@5
97.3%
MRR
95.0%
NDCG@10
96.2%
Faithfulness

These metrics come from 55 manually annotated test queries across 7 categories (admission, course, fee, capstone, career, application, contact) and RAGAS end-to-end evaluation.

The complete picture

You've now seen every piece of the puzzle. Here's how they all fit together:

🏭
Offline: Knowledge Factory

147 pages → smart chunking → dedup → 1,012 chunks → vectors → FAISS index

Server startup: Load everything into memory

Embedder + cross-encoder (ONNX) + chunks + FAISS index + BM25 + Gemini connection

🔍
Per question: Hybrid search + RRF + 7 signals

BM25 keywords + FAISS vectors → fuse rankings → boost headings/labels → penalize same-URL repeats

⚖️
Rerank: Cross-encoder picks top 5

ONNX cross-encoder re-scores 15 → 5 candidates. Page chunk swap if 3+ from same page.

🧠
Generate: Gemini streams the answer

Prompt with [Doc1-5] context → stream tokens via SSE → parse citations → show only cited sources

What you can do now

After completing this course, you can:

🎯

Steer AI coding tools

Ask for specific patterns: "Use RRF to fuse BM25 and FAISS results," "Add a cross-encoder reranking step," "Export the model to ONNX for CPU inference."

🐛

Debug RAG problems

Know where to look: irrelevant sources → citation parser. Slow responses → cross-encoder latency. Wrong answers → chunking quality.

📐

Make architecture decisions

Evaluate tradeoffs: embedding model speed vs. accuracy, hard caps vs. soft penalties, PyTorch vs. ONNX, min-instances vs. keepalive pings.

Final challenge

Scenario

You're building a RAG system for a legal firm. Your retrieval metrics look great (Recall@5 = 0.95), but lawyers are complaining that answers sometimes include facts not found in the source documents.

Which metric should you check, and what might be the root cause?

If you could only make ONE improvement to a RAG system, which would give the biggest bang for the buck?

You did it! 🎓

You now understand how a production RAG system works — from web scraping to streaming answers. Use this knowledge to build better AI applications and debug them with confidence.