Codepth.dev

From .NET to AI Engineer — Part 5: Making RAG Conversational

Part 5 of the series on staying current by adding AI, from a .NET background. Days 14–15 of the journey — turning a one-shot RAG script into something you can actually have a conversation with, and learning how to tell whether it's working.

After Part 4 I understood what RAG is. This short, dense stage was about doing it properly: composing the pieces into a clean pipeline, making it remember a conversation, and — the part most tutorials skip — measuring whether the answers are actually any good.

Two days, two big ideas: orchestration and evaluation.

Theory

Prompt engineering is just designing an interface

Coming from .NET, the useful reframe was this: a prompt is an API contract with a fuzzy, probabilistic service. You set the system prompt to define the role and the rules, you give examples when behavior matters, and you're explicit about the output shape you want. Vague prompt, vague results — same as a vague spec.

The single most important instruction for RAG: tell the model to answer only from the provided context, and to say it doesn't know when the context doesn't cover it. That one line is most of what stops a RAG system from making things up.

LangChain and LCEL: composing the pipeline

LangChain is the orchestration layer that wires retrieval, prompting, and the model together. Its expression language (LCEL) uses a pipe operator that will feel oddly familiar — it reads like a Unix pipe or a LINQ chain:

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_template(
    "Answer using ONLY this context. If it's not here, say you don't know.\n\n"
    "{context}\n\nQuestion: {question}"
)

chain = prompt | llm | StrOutputParser()
answer = chain.invoke({"context": retrieved_text, "question": user_question})

prompt | llm | parser — data flows left to right through each stage. Composition you already think in.

Memory: turning Q&A into conversation

A bare RAG call has no idea what you asked a moment ago. Memory is just passing the prior turns back into the next prompt so follow-ups like "and what about the second one?" make sense. Conceptually trivial; the libraries mostly handle the bookkeeping. (It's also worth knowing MCP, an emerging standard for feeding tools and context to models, and LlamaIndex, a framework specialized for the RAG/retrieval side — both worth a look as you go deeper.)

Evaluation: how do you know it's good?

This is the part I almost skipped and shouldn't have. "It looks right" is not a metric. The RAG-specific ones worth knowing:

  • Context precision/recall — did retrieval fetch the right chunks, and all of them?
  • Faithfulness — does the answer actually follow from the retrieved context, or did the model wander off and invent?

Separating these tells you where a bad answer went wrong: a faithfulness failure means the model misbehaved, but a recall failure means retrieval never gave it the right material in the first place — and no amount of prompt-tweaking fixes that. Tools like RAGAS automate these checks.

Build: chat with your documents

The Stage 4 project upgraded the Part 4 engine into a multi-turn assistant: ask a question, get an answer grounded in your documents, ask a follow-up that depends on the previous answer, and have it hold the thread — with a quick evaluation pass to confirm the answers were faithful to the sources.

The takeaway

Orchestration makes RAG usable; evaluation makes it trustworthy. As engineers we'd never ship a service with no tests and no logs, yet it's strangely tempting to ship an AI feature on vibes because the demo looked good. Resist that. Measure retrieval and generation separately, and you'll always know which half to fix.

The 2-day plan (if you want to follow along)

Day Time Learn (theory) Build Why it matters Reference Output
14 ~6h Prompt engineering; LangChain & LCEL; memory Compose a RAG chain; add conversation memory Orchestration turns a one-shot script into a real assistant OpenAI prompt-engineering guide; LangChain docs A multi-turn "chat with your docs"
15 ~6h RAG evaluation — precision, recall, faithfulness Add an evaluation pass; debug a wrong answer to its cause "Looks right" isn't a metric — measure it like any service RAGAS; LlamaIndex docs A RAG app with an evaluation report

What I used to learn this

Related code: https://github.com/ashaniwale-codestack/llm-agents-experiments

Next up — Part 6: Agents & Tool Use, where the model stops just answering and starts deciding what to do — and I learned that the real difference between a chatbot and an agent is who's in control.