From .NET to AI Engineer — Part 2: Where My SQL Finally Paid Off
Part 2 of the series on staying current by adding AI, from a .NET background. We're at Days 4–7 of the journey — the data engineering stage, where years of writing SQL stopped being baggage and turned into an advantage.
If Stage 1 was about unlearning an instinct, Stage 2 was the opposite — it was the moment a whole pile of things I already knew suddenly transferred over. AI runs on data, and getting data into a clean, queryable shape is plumbing. As a backend developer, I'd been doing plumbing for years.
This stage took four days. The theory was mostly "here's how the thing you already know works in Python," and the build was a proper ETL pipeline.
Theory
Analytical SQL is still SQL
The first pleasant surprise: window functions. If you've used ROW_NUMBER(), RANK(), or running totals in T-SQL, you already know these. They're how you compute things like a moving average in the database, before the data ever reaches Python:
SELECT symbol, trade_date, close,
AVG(close) OVER (
PARTITION BY symbol
ORDER BY trade_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7
FROM prices;
That's the same 7-day moving average from Part 1, computed in SQL instead of Pandas. Knowing which layer should do a calculation — database vs. application — is judgment you already have.
SQLAlchemy, four ways (and how it maps to .NET)
SQLAlchemy is the Python data-access library, and the thing that confused me at first is that it offers several styles at once. Once I mapped them onto the .NET world, it clicked:
- Raw / textual SQL — you write the SQL string yourself. Like ADO.NET with a
SqlCommand. - Core (expression language) — you build queries with Python expressions instead of strings. Close to a lightweight query builder.
- ORM, declarative — you define classes that map to tables and work with objects. This is Entity Framework's world.
The raw style, done safely with bound parameters (never string-concatenate user input — same rule as always):
from sqlalchemy import create_engine, text
engine = create_engine("postgresql+psycopg://user:pass@localhost/market")
with engine.connect() as conn:
rows = conn.execute(
text("SELECT symbol, close FROM prices WHERE close > :min_price"),
{"min_price": 100},
)
And the ORM style, which will feel like home if you've used EF:
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
class Base(DeclarativeBase):
pass
class Price(Base):
__tablename__ = "prices"
id: Mapped[int] = mapped_column(primary_key=True)
symbol: Mapped[str]
close: Mapped[float]
Pulling data, and handling too much of it
Real pipelines pull from APIs (requests is the standard library for that) and frequently deal with more data than fits comfortably in memory. The answer is chunking — process the data in batches:
import pandas as pd
for batch in pd.read_csv("huge_feed.csv", chunksize=100_000):
transform_and_load(batch)
Parquet, and validating data like you'd test code
Two more tools earned permanent spots. Parquet is a columnar, compressed, typed file format — far faster and smaller than CSV for analytical data, and it remembers column types so you don't re-guess them every load. And Great Expectations lets you write assertions about your data — "this column is never null," "prices are positive" — so a bad upstream feed fails loudly instead of poisoning everything downstream. Think of it as unit tests for data.
Build: an API-to-database pipeline
The Stage 2 project tied it together: pull market data from an API with requests, transform and clean it with Pandas (in chunks), validate it with Great Expectations, and load it into PostgreSQL through SQLAlchemy. Extract, transform, load — the classic shape, built with production habits rather than a one-off script.
The takeaway
This is the stage where I stopped feeling like a beginner. The tools have new names, but the concepts — querying, schemas, batching, validation, safe parameterization — are the ones you've been using for years. If you're a backend developer, lean into this stage. It's your home advantage, and a clean data layer is what every later AI stage quietly depends on.
The 4-day plan (if you want to follow along)
| Day | Time | Learn (theory) | Build | Why it matters | Reference | Output |
|---|---|---|---|---|---|---|
| 4 | ~5h | Analytical SQL — window functions, CTEs | Write queries that compute moving averages and rankings in SQL | Pushing work into the database is a judgment call you already have | Pandas "Comparison with SQL" docs; window-function guides | A query producing rolling metrics |
| 5 | ~6h | SQLAlchemy — raw vs Core vs ORM; the requests library |
Connect to PostgreSQL; pull data from an API | This is EF/ADO.NET in Python — your data-access instincts transfer | SQLAlchemy 2.0 tutorial; Requests docs | Data fetched from an API into a DataFrame |
| 6 | ~6h | Chunking large data; Parquet; data validation | Process a large file in batches; save as Parquet; add Great Expectations checks | Real feeds are big and messy — batching and validation keep you sane | Pandas scaling docs; Great Expectations docs | A validated Parquet dataset |
| 7 | ~7h | Consolidate Stage 2 | The full ETL pipeline: API → transform → validate → PostgreSQL | A clean data layer is what every later AI stage depends on | — | Pipeline pushed to GitHub with a README |
What I used to learn this
- Pandas docs — Comparison with SQL: https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html
- The SQLAlchemy 2.0 tutorial: https://docs.sqlalchemy.org/en/20/tutorial/
- Requests quickstart: https://requests.readthedocs.io/
- Great Expectations docs: https://docs.greatexpectations.io/
Related code: https://github.com/ashaniwale-codestack/stock_alert_intelligence_system
Next up — Part 3: Embeddings & Vector Search, where text turns into numbers a machine can actually compare, and "AI" starts to feel real.