From .NET to AI Engineer — Part 4: I Built a RAG App Without an LLM

Part 4 of the series on staying current by adding AI. Days 11–13 — a project that fell short in exactly the right way, and taught me what RAG actually is.

By this point I could embed text and search it by meaning (Part 3). So I built something useful: a tool that takes my résumé and a job description and tells me how well they match — and what I should improve to fit the role better. Pure semantic search, no large language model anywhere in it.

It did half the job beautifully and fell flat on the other half. That gap was the best teacher in the whole series.

What worked, and what didn't

"Find the parts of this résumé that line up with this job description" — great. Semantic search is built for that. It surfaced the overlapping skills and experience even when the wording was completely different.

But the question I actually cared about was: "What's missing? What should I add, improve, or learn to match this role better?"

On that, it had nothing useful to offer. It could hand me the relevant passages from both documents, but it couldn't compare them, weigh the gap, or tell me what to do next. And the lightbulb finally went on:

Retrieval finds relevant text. It does not read it, reason about it, or tell you what it means.

Semantic search is a brilliant librarian who finds the right pages — but never reads them, never compares them, never advises you. "What should I improve?" needs someone to actually read both documents and reason about the difference. That someone is the LLM.

That's the whole point of RAG

RAG = Retrieval-Augmented Generation — and it's three steps, not two:

Retrieval — find the relevant chunks (the Part 3 machinery).
Augmented — add those retrieved chunks into the model's prompt as context.
Generation — the model reasons over that added context to produce the answer.

I'd built only the first step and called it RAG. It wasn't — it was search. The moment I retrieved the relevant parts of the résumé and the job description, fed them to a model as context, and asked "where's the gap, and what should I learn?", I got a real, useful answer. Retrieval did the finding; the model did the reasoning.

The debugging that nobody warns you about

Adding the LLM is also where the unglamorous, real-world friction lives. The errors I lost time to, so you don't have to:

The key isn't loaded. Setting a value in a .env file does nothing unless you actually load it (load_dotenv()) and the variable name matches exactly. Half my "authentication failed" errors were really "the key was never read."
Wrong chunks for the question. My queries kept pulling résumé chunks when I wanted job-description chunks. The fix: store a source tag on each chunk and filter by metadata at query time.

results = collection.query(
    query_embeddings=model.encode([question]).tolist(),
    n_results=5,
    where={"source": "job_description"},   # only search the JD's chunks
)

Misaligned data. When you zip together chunks and their metadata, one off-by-one and every chunk gets the wrong label. Check alignment early.

None of these are conceptual — they're the papercuts of wiring a real system together. Expect them, and they stop being scary.

When do you even need RAG?

The biggest thing this project taught me is when not to reach for RAG at all. If a document fits comfortably in the model's context window, just send the whole thing — no retrieval needed. RAG earns its complexity only when one of these is true:

Scale — too much text to fit in context
Freshness — the data changes and you don't want to retrain anything
Privacy — the data has to stay in your own store
Cost — sending only relevant chunks is cheaper than sending everything
Citations — you need to point back to the source

A related realization: when you chat with an AI assistant and it "remembers" what you just said, that's usually not retrieval at all — it's the whole conversation sitting in the context window. Retrieval is what you add when the knowledge is too big, too fresh, or too private to fit there.

The takeaway

Build the broken version on purpose. Stripping the LLM out of RAG and watching exactly which questions fall short teaches you, in an afternoon, what a dozen architecture diagrams can't: retrieval finds, generation reasons, and you need both. It's the single most clarifying thing I did in the whole series.

The 3-day plan (if you want to follow along)

Day	Time	Learn (theory)	Build	Why it matters	Reference	Output
11	~6h	How retrieval differs from reasoning	Job-matcher with pure semantic search	Seeing search succeed and fail shows you what it is	(your project repo)	A working résumé↔JD relevance matcher
12	~7h	Wiring in an LLM; API keys and `.env` gotchas	Feed retrieved chunks to a model; debug the auth/loading errors	The papercuts of real integration — better to meet them now	provider quickstart docs	The matcher explaining the gap and what to improve
13	~6h	When to use RAG vs. just sending the whole doc	Add metadata filtering; write up the decision rule	Knowing when not to use RAG is half the skill	—	Finished project pushed to GitHub with a README

What I used to learn this

Your chosen model provider's quickstart (for keys and the first call)
The ChromaDB query/filtering docs: https://docs.trychroma.com/
LlamaIndex: https://developers.llamaindex.ai/python/framework/understanding/rag/
MCP: https://modelcontextprotocol.io/docs/getting-started/intro

Related code: https://github.com/ashaniwale-codestack/job-matching-engine

Next up — Part 5: RAG & Orchestration, where I made this conversational with LangChain and memory, and learned how to actually measure whether a RAG system is any good.