From .NET to AI Engineer — Part 2: Where My SQL Finally Paid Off

Part 2 of the series on staying current by adding AI, from a .NET background. We're at Days 4–7 of the journey — the data engineering stage, where years of writing SQL stopped being baggage and turned into an advantage.

If Stage 1 was about unlearning an instinct, Stage 2 was the opposite — it was the moment a whole pile of things I already knew suddenly transferred over. AI runs on data, and getting data into a clean, queryable shape is plumbing. As a backend developer, I'd been doing plumbing for years.

This stage took four days. The theory was mostly "here's how the thing you already know works in Python," and the build was a proper ETL pipeline.

Theory

Analytical SQL is still SQL

The first pleasant surprise: window functions. If you've used ROW_NUMBER(), RANK(), or running totals in T-SQL, you already know these. They're how you compute things like a moving average in the database, before the data ever reaches Python:

SELECT symbol, trade_date, close,
       AVG(close) OVER (
           PARTITION BY symbol
           ORDER BY trade_date
           ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
       ) AS moving_avg_7
FROM prices;

That's the same 7-day moving average from Part 1, computed in SQL instead of Pandas. Knowing which layer should do a calculation — database vs. application — is judgment you already have.

SQLAlchemy, four ways (and how it maps to .NET)

SQLAlchemy is the Python data-access library, and the thing that confused me at first is that it offers several styles at once. Once I mapped them onto the .NET world, it clicked:

Raw / textual SQL — you write the SQL string yourself. Like ADO.NET with a SqlCommand.
Core (expression language) — you build queries with Python expressions instead of strings. Close to a lightweight query builder.
ORM, declarative — you define classes that map to tables and work with objects. This is Entity Framework's world.

The raw style, done safely with bound parameters (never string-concatenate user input — same rule as always):

from sqlalchemy import create_engine, text

engine = create_engine("postgresql+psycopg://user:pass@localhost/market")

with engine.connect() as conn:
    rows = conn.execute(
        text("SELECT symbol, close FROM prices WHERE close > :min_price"),
        {"min_price": 100},
    )

And the ORM style, which will feel like home if you've used EF:

from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column

class Base(DeclarativeBase):
    pass

class Price(Base):
    __tablename__ = "prices"
    id: Mapped[int] = mapped_column(primary_key=True)
    symbol: Mapped[str]
    close: Mapped[float]

Pulling data, and handling too much of it

Real pipelines pull from APIs (requests is the standard library for that) and frequently deal with more data than fits comfortably in memory. The answer is chunking — process the data in batches:

import pandas as pd

for batch in pd.read_csv("huge_feed.csv", chunksize=100_000):
    transform_and_load(batch)

Parquet, and validating data like you'd test code

Two more tools earned permanent spots. Parquet is a columnar, compressed, typed file format — far faster and smaller than CSV for analytical data, and it remembers column types so you don't re-guess them every load. And Great Expectations lets you write assertions about your data — "this column is never null," "prices are positive" — so a bad upstream feed fails loudly instead of poisoning everything downstream. Think of it as unit tests for data.

Build: an API-to-database pipeline

The Stage 2 project tied it together: pull market data from an API with requests, transform and clean it with Pandas (in chunks), validate it with Great Expectations, and load it into PostgreSQL through SQLAlchemy. Extract, transform, load — the classic shape, built with production habits rather than a one-off script.

The takeaway

This is the stage where I stopped feeling like a beginner. The tools have new names, but the concepts — querying, schemas, batching, validation, safe parameterization — are the ones you've been using for years. If you're a backend developer, lean into this stage. It's your home advantage, and a clean data layer is what every later AI stage quietly depends on.

The 4-day plan (if you want to follow along)

Day	Time	Learn (theory)	Build	Why it matters	Reference	Output
4	~5h	Analytical SQL — window functions, CTEs	Write queries that compute moving averages and rankings in SQL	Pushing work into the database is a judgment call you already have	Pandas "Comparison with SQL" docs; window-function guides	A query producing rolling metrics
5	~6h	SQLAlchemy — raw vs Core vs ORM; the `requests` library	Connect to PostgreSQL; pull data from an API	This is EF/ADO.NET in Python — your data-access instincts transfer	SQLAlchemy 2.0 tutorial; Requests docs	Data fetched from an API into a DataFrame
6	~6h	Chunking large data; Parquet; data validation	Process a large file in batches; save as Parquet; add Great Expectations checks	Real feeds are big and messy — batching and validation keep you sane	Pandas scaling docs; Great Expectations docs	A validated Parquet dataset
7	~7h	Consolidate Stage 2	The full ETL pipeline: API → transform → validate → PostgreSQL	A clean data layer is what every later AI stage depends on	—	Pipeline pushed to GitHub with a README

What I used to learn this

Pandas docs — Comparison with SQL: https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html
The SQLAlchemy 2.0 tutorial: https://docs.sqlalchemy.org/en/20/tutorial/
Requests quickstart: https://requests.readthedocs.io/
Great Expectations docs: https://docs.greatexpectations.io/

Related code: https://github.com/ashaniwale-codestack/stock_alert_intelligence_system

Next up — Part 3: Embeddings & Vector Search, where text turns into numbers a machine can actually compare, and "AI" starts to feel real.