What Is AI, ML, and Generative AI?
Before diving in, here's the hierarchy — each term is a subset of the one above it.
When people say "AI" today, they usually mean Large Language Models (LLMs) — a specific type of generative AI trained on text. This guide focuses primarily on LLMs and how they evolve into autonomous agents.
Understanding the Model Zoo
Not all models are built the same. Here's a taxonomy that helps you know what's what.
Base / Foundation Models
Trained on massive text to predict next words. Broad knowledge but unfocused — "raw intelligence" that needs tuning to be useful.
Llama base, GPT base weightsInstruction-Tuned Models
Base models fine-tuned to follow instructions and converse. This is what makes Claude and ChatGPT helpful — they understand what you're asking.
Claude, ChatGPT, GeminiReasoning Models
Trained via reinforcement learning to "think step by step." They spend extra compute on hard problems — math, logic, coding — for higher accuracy.
OpenAI o3, DeepSeek-R1Mixture-of-Experts (MoE)
Many specialized "expert" sub-networks. Only a few activate per input — faster and cheaper while maintaining quality.
Llama 4 Scout, Qwen3-235BPre-trained Only
Learned from raw text. Knows a lot but unpredictable — might complete your sentence or give a random fact. Not optimized for dialogue.
Stage: Pre-training onlyFine-Tuned / SFT
Shown curated Q&A pairs (Supervised Fine-Tuning) after pre-training. Learns how to respond helpfully like an assistant.
Pre-train → SFTRLHF-Aligned
Humans rank model outputs. Model learns to prefer higher-rated responses. Makes models safe, honest, and helpful.
Pre-train → SFT → RLHFRLVR (Verifiable Rewards)
2025 breakthrough: instead of human ratings, feedback from verifiable answers (math, code tests). Dramatically improves reasoning.
Pre-train → SFT → RLVRText-Only LLMs
Process and generate text. Most common type.
GPT-4, Claude, LlamaMultimodal
Handle text + images + audio + video.
GPT-4o, Gemini, ClaudeImage Gen
Generate images from text descriptions.
DALL-E 3, Midjourney, FluxAudio
Generate/understand speech and music.
Whisper, Suno, ElevenLabsVideo
Create or edit video from text prompts.
Sora, Runway, VeoCode
Specialized for programming tasks.
Codex, Claude CodeThe 3-Stage Pipeline
Building an LLM is like training an expert: broad education → specialization → learning judgment.
Pre-Training
Fine-Tuning (SFT)
Alignment (RLHF/RLVR)
Analogy: A student who read every book in the library but hasn't learned how to answer exam questions.
Analogy: That student takes a course on "how to give good answers in interviews."
Analogy: A mentor watches their answers and says "this one's better, that one needs work."
Almost all modern LLMs use the Transformer architecture (2017). Its key innovation is the attention mechanism — the model looks at all parts of the input simultaneously to decide which parts are most relevant. This enables understanding context across long passages.
What Models Can't Do (Yet)
Understanding the boundaries is as important as knowing capabilities. These aren't bugs — they're inherent properties.
Hallucinations
Generate plausible but factually wrong text. They predict likely words, not true words. They pattern-match, they don't "know."
Knowledge Cutoff
Only know training data up to a specific date. Can't access real-time info unless connected to tools (search, APIs).
Context Window Limits
Can only process a fixed amount of text at once. Even 200K-token windows have practical tracking limits.
Non-Deterministic
Same question, different answers. LLMs sample from probability distributions. Makes testing harder.
Weak at Precise Math
Word predictors, not calculators. Struggle with arithmetic and formal logic — though reasoning models are improving this.
No True Understanding
Manipulate statistical patterns, not concepts. Lack common sense, embodied experience, and genuine world knowledge.
These limitations are exactly why we build agents — systems wrapping a model with tools, memory, and guardrails. The model is the brain; the agent is the whole body. And the industry keeps solving these limitations — context windows keep growing, reasoning keeps improving, and tool integration keeps deepening.
How Agents Are Built
An agent wraps a model in a loop with memory, tools, and the ability to act — turning a chatbot into an autonomous worker.
Agent = Model + Loop + Memory + Tools. The loop gives the model multiple chances to think, act, observe results, and adjust. This is what gives AI systems agency.
The 4 Core Components
1. The Model (Brain)
The LLM doing the reasoning. Interprets requests, selects tools, generates responses. The reasoning engine at center.
Claude, GPT-4, Gemini, Llama2. Memory (State)
Short-term: Current conversation context window.
Long-term: Persisted facts, preferences, learned info stored externally and retrieved when needed.
3. Tools (Hands)
External capabilities: web search, DB queries, code execution, APIs. Tools give the model data (what it can see) and actions (what it can do).
Search, APIs, code runners, MCP servers4. The Loop / Orchestrator
The workflow driving Reason → Act → Observe. Manages step count, stopping conditions, error handling. Turns a single model call into autonomous workflow.
LangGraph, CrewAI, custom codeAgent Composition Patterns
Single Agent
One model in a loop with tools. Good for focused tasks: Q&A, coding, analysis.
SimplestMulti-Agent
Specialized agents collaborate: one researches, one writes, one reviews. They pass messages.
CrewAI, AutoGenOrchestrator + Workers
A "boss" breaks tasks into subtasks and delegates. Workers execute; orchestrator synthesizes.
Most scalableMCP: USB-C for AI
MCP is an open standard (by Anthropic, Nov 2024) that standardizes how AI models connect to external tools and data. Adopted by OpenAI, Google, and all major platforms by mid-2025. Full spec and SDKs at modelcontextprotocol.io.
The 3 Server Capabilities
Every MCP server exposes exactly three categories. Think of them as: nouns, verbs, and templates.
Resources (Nouns)
Read-only data the model can access. Files, database rows, API responses. Each resource has a URI, name, and MIME type. The model reads them for context.
Data the model can SEETools (Verbs)
Actions the model can invoke. Each tool has a name, description, and JSON input schema. The model decides when to call them. Examples: run SQL query, send email, create file.
Actions the model can DOPrompts (Templates)
Reusable prompt templates the host can render. Pre-built workflows like "analyze these logs" or "summarize this doc." Users or agents select the right prompt for the task.
Workflows the model can FOLLOWTransport Layer
MCP servers communicate via standard transports — no proprietary protocols needed:
Before MCP, every AI app needed custom integrations for every tool — an M×N problem. MCP reduces this to M+N: build one server for your tool, and every AI app that speaks MCP can use it. GitHub, Google Drive, Slack, Postgres — all accessible through a single standard. The MCP ecosystem now has thousands of community servers.
A2A: Agents Talking to Agents
If MCP connects agents to tools, A2A connects agents to each other. Launched by Google in April 2025, now governed by the Linux Foundation with 150+ partner organizations. Spec at a2a-protocol.org.
MCP = Agent ↔ Tool communication (how an agent uses a database, API, or file system).
A2A = Agent ↔ Agent communication (how a research agent delegates to a data analysis agent from a different vendor).
Together, they form the "nervous system" of multi-agent enterprise systems.
A2A Core Components
Agent Card
A JSON file at /.well-known/agent.json that describes an agent's name, capabilities, skills, supported modalities, auth requirements, and endpoint URL. Think of it as a resume that lets other agents discover what you can do.
Tasks
The unit of work. A client agent sends a task; the remote agent processes it. Tasks have lifecycle states: submitted → working → completed/failed. Supports long-running async work.
Work managementMessages & Parts
Agents communicate via messages containing "parts" — text, files, structured data, or forms. Supports multi-modal exchange. Agents negotiate which content types they understand.
CommunicationArtifacts
The output of a completed task. Could be generated text, files, data, or any deliverable. Artifacts are returned to the client agent for use in its own workflow.
DeliverablesThe Art of Feeding the Right Info
If the LLM is a CPU and its context window is RAM, context engineering is the art of loading exactly the right data into memory — coined by Andrej Karpathy & Tobi Lütke (Shopify CEO) in mid-2025.
Building Blocks of Context
RAG — Retrieval-Augmented Generation
The most common context engineering technique. Two phases, both continuous — Phase 1 runs whenever new data arrives, not just once.
Documents
(fixed-size, semantic, or recursive)
Model
BGE, Cohere Embed
Store
Weaviate, Chroma
Query
embedding model
+ keyword hybrid
+ reranking
Assembly
to fit window
Answer
retrieved context
Search Types Within (and Beyond) RAG
Retrieval has evolved dramatically. Here are the 5 approaches you'll encounter — from simplest to most sophisticated:
1. Vector Search
Compare embedding similarity. Finds semantically related content even without exact keyword matches. Good for natural-language queries over document corpora.
Dense retrieval2. Keyword / BM25
Traditional text matching. Excellent for exact terms, names, code identifiers, error messages. Fast, explainable, cheap.
Sparse retrieval3. Hybrid + Rerank
Combine vector + keyword results, then use a reranker model to sort by true relevance. The production-standard RAG setup for most knowledge bases.
Best of both worlds4. GraphRAG Evolving
Microsoft's approach: the LLM extracts a knowledge graph from documents (entities + relationships), builds community hierarchies, and queries across the graph. Handles multi-hop questions vector search can't — like "what are our termination rights if the supplier fails quality standards for 3 quarters?" The March 2026 release (LazyGraphRAG) dramatically reduced indexing costs. Strong for legal, medical, and enterprise knowledge work.
Knowledge graphs · Multi-hop reasoning5. Agentic Search New paradigm
Instead of pre-indexing everything into a vector store, the agent explores at runtime using tools like grep, glob, and read. It's iterative: query → result → "not quite" → refined query → better result → synthesize. Made famous by Claude Code — Boris Cherny (creator) publicly explained why Anthropic dropped RAG + vector DB in favor of agentic search: simpler, no staleness, no privacy issues, and "outperformed everything else by a lot."
How Claude Code uses it: When asked to understand a codebase, Claude Code doesn't run embedding search. It explores — lists directories, greps for patterns, reads key files, follows imports, checks tests and docs, builds understanding iteratively. No pre-indexing required. The model's reasoning drives the search strategy.
When to use what: Agentic search wins for code exploration, fresh data, security-sensitive contexts. Vector RAG/GraphRAG still wins for large static corpora, concept search, multi-hop reasoning. Most production systems now use hybrid approaches — agentic as the backbone, with semantic index only where needed.
Runtime exploration · No indexing · IterativeAI That Writes Code
Coding agents go beyond autocomplete — they read your entire codebase, make multi-file changes, run tests, and iterate. The hottest category in AI tooling right now.
Claude Code (Anthropic)
Terminal-based agent powered by Claude Opus/Sonnet. Runs locally — reads your filesystem, executes in your terminal, uses your git. Deep codebase awareness. Highest SWE-bench score (80.9%). Supports Agent Teams for multi-agent workflows.
Local · Terminal · 4% of all GitHub commitsCodex (OpenAI)
Dual-mode: cloud agent (async tasks in sandboxed containers via ChatGPT) + Codex CLI (local terminal). Powered by GPT-5.3-Codex. Leads Terminal-Bench 2.0 at 77.3%. Very token-efficient — 4x fewer tokens than Claude Code.
Cloud + Local · Async tasksUse Claude Code for architecture and complex features (higher quality). Use Codex for autonomous/async tasks and cost-sensitive workloads (more efficient). Use OpenCode if you want open-source freedom and multi-provider flexibility. Use Cursor if you prefer IDE-based workflow. Many developers use multiple tools.
AI That Lives in Your Daily Tools
The next frontier: agents that integrate with your messaging apps (Telegram, WhatsApp, Slack), manage your calendar, email, and tasks — proactively, not just when you ask.
OpenClaw (née Clawdbot)
Started as Clawdbot — a weekend WhatsApp relay by Peter Steinberger (PSPDFKit founder) in late 2025. Renamed to Moltbot after a trademark complaint, settled on OpenClaw in Jan 2026. Hit 145K+ GitHub stars and kicked off the entire personal-agent movement. Self-hosted, connects to 23+ messaging platforms, supports MCP natively, 700+ community-contributed skills on ClawHub, three-tier memory. Creator joined OpenAI in Feb 2026; project moved to a non-profit foundation.
MIT · 145K+ stars · BYOKNanobot
Ultra-lightweight Clawdbot-style agent from the HKUDS lab — only ~4,000 lines of Python (vs OpenClaw's 430K+). Designed for full auditability: you can read the entire codebase in an afternoon. Runs on hardware as small as a Raspberry Pi (191MB RAM). Supports Claude, GPT, DeepSeek, and local models via Ollama/vLLM. Persistent memory with knowledge graph, web search, sub-agents, Telegram + WhatsApp.
Open source · Auditable · Edge-readyThe Broader Ecosystem
Zeroclaw (minimal Rust-based infrastructure), OneClaw / OpenClaw Launch / MyClaw (managed hosting for OpenClaw), Open Interpreter (local code execution agent), and AutoGPT (the original autonomous agent, April 2023). The space is evolving fast — new projects appear weekly, many forking or riffing on OpenClaw's design.
Evolving spaceWhat Makes Personal Agents Different
Frameworks & Tools for Building AI
The frameworks and infrastructure you'll encounter when building AI applications.
LangChain / LangGraph
Most popular LLM framework. LangChain for chains of model calls; LangGraph adds stateful graph-based agent workflows with branching and error recovery.
Python & TypeScriptCrewAI
Multi-agent framework with role-based agents (researcher, writer, reviewer) that collaborate on tasks. Great for complex workflows.
Python · Role-based agentsAutoGen (Microsoft)
Multi-agent conversations framework. Agents code, debug, and discuss with each other. Strong at collaborative problem-solving.
Python · Multi-agent chatBAML (BoundaryML)
A domain-specific language for getting structured outputs from LLMs reliably. Write typed function contracts in .baml files; BAML generates type-safe clients for Python, TypeScript, Ruby, Go, and more. Uses Schema-Aligned Parsing (SAP) to handle messy LLM output — works even with models that don't support native tool-calling. Built-in VS Code playground for live prompt testing.
Vector Databases
Store and search embeddings for RAG. Find "similar" documents by meaning, not keywords. Core infrastructure for retrieval.
Pinecone, Weaviate, ChromaDB, pgvectorLlamaIndex
Framework connecting LLMs to your data. Handles parsing, chunking, indexing, retrieval. The "data plumbing" for AI apps.
RAG pipelines · Data connectorsEmbedding Models
Convert text to numerical vectors capturing meaning. Essential for RAG — query embedded and compared to document embeddings.
text-embedding-3, BGE, Cohere EmbedMCP Servers (Ecosystem)
Thousands of community-built servers exposing tools via the MCP standard. Google, Slack, GitHub, Postgres — a growing ecosystem.
Open standard · UniversalCloud AI Platforms
Managed services for running models. GPUs, APIs, hosting, scaling without managing infrastructure.
AWS Bedrock, Azure AI, Google Vertex AIModel APIs
Simplest way to use AI: HTTP call, text in, response back. No GPU needed — pay per token.
Anthropic, OpenAI, Google APIsOpen Source & Self-Hosting
Run models locally or on your servers. Full control & privacy, but requires GPUs and ops expertise.
Ollama, vLLM, Hugging Face, llama.cppObservability & Eval
Monitor agent behavior, trace tool calls, evaluate quality. Essential for debugging production agents.
LangSmith, Arize, BraintrustWho's Who: Labs & Platforms
Current as of April 2026 — this space moves weekly
| Organization | Latest Flagship Models | Known For |
|---|---|---|
| Anthropic | Claude Opus 4.6, Sonnet 4.6, Haiku 4.5 | Safety-first, 1M context (beta), adaptive thinking, top coding (SWE-bench), MCP protocol, Claude Code |
| OpenAI | GPT-5.4, o3, GPT-5.3-Codex | Consumer AI leader, reasoning models, broadest ecosystem, Codex coding agent |
| Google DeepMind | Gemini 3.1 Ultra / Pro / Flash-Lite | Top reasoning scores (GPQA, ARC-AGI), cheapest output pricing, A2A protocol, 1M context |
| Meta | Llama 4 Scout / Maverick | Open-source leader, 10M context window (largest open-weight), MoE architecture |
| xAI | Grok 4.20 | Real-time X/Twitter data, multi-agent workflows, 1M context |
| DeepSeek | DeepSeek-R1, V3, V4 | Pioneered RLVR, open weights reasoning, exceptional cost efficiency |
| Alibaba / Qwen | Qwen 3.5 (0.8B–235B) | Strongest open multilingual, native multimodal, Apache 2.0 |
| Zhipu AI | GLM-5 / GLM-5.1 | Top open-weight coding (94.6% of Opus 4.6 perf), MIT license |
| Mistral | Mistral Large, Mixtral | European lab, efficient open models, MoE pioneer |
| Cloud | AI Service | Offers |
|---|---|---|
| AWS | Amazon Bedrock | Claude, Llama, Mistral + fine-tuning + RAG |
| Azure | Azure OpenAI | GPT-4 on Azure, enterprise security, Copilot |
| Google Cloud | Vertex AI | Gemini + open models + ML pipelines |
| Hugging Face | Hub + Inference | Open model hub, 500K+ models, community |
Your Learning Path
From "I know nothing" to "I can build agents" — pick your starting point.
Courses and articles teach concepts. Daily use teaches capabilities and limitations. Pick a real problem from your work or life and try to solve it with Claude, ChatGPT, or Gemini every single day. Notice where the model shines and where it breaks. Each time you hit a wall, you'll naturally invent a workaround — better prompting, adding context, chaining steps, adding tools.
That's how everyone becomes genuinely productive with AI. No amount of theory replaces the intuition you build by using it on your actual problems. The people getting the most out of AI today aren't the ones who read the most papers — they're the ones who used it the most. In every industry, the individuals who figure out workarounds for today's AI limitations are the ones transforming their productivity and, eventually, their field.
Stage 1 — Understand the Basics (Week 1–2)
Stage 2 — Use AI Effectively (Week 2–4)
Stage 3 — Build with AI (Week 4–8)
Stage 4 — Go Deep (Ongoing)
Two High-Leverage Career Paths
If you're a software engineer wondering where to focus, the AI boom is creating two distinct high-demand tracks. Both are hiring aggressively, both will be central to every industry for the next decade, and both reward deep technical skill.
Track 1 — AI Infrastructure Engineering
The "picks and shovels" of the AI boom. Every frontier lab, every hyperscaler, and every serious AI startup is bottlenecked on infrastructure. If you enjoy systems programming, distributed systems, and making hardware sing, this is where the highest-impact (and highest-paid) work is — and the problems get harder every year as models scale.
Foundational layers
Hardware & Accelerators
Understand the silicon: NVIDIA H100/H200/B200, GB200 NVL72, upcoming Rubin/Vera, AMD MI300X / MI325X / MI450X, Google TPU v5/v6, AWS Trainium/Inferentia. Know memory bandwidth, tensor cores, FP8/FP4 quantization, HBM3e/HBM4, and why GPU choice drives the entire cluster design.
NVIDIA · AMD · TPU · Custom siliconHigh-Performance Networking
Training a 400B+ model means moving terabytes of gradients between thousands of GPUs. You need: InfiniBand (NDR 400G, XDR 800G), RDMA, ConnectX-7/8 NICs, RoCE v2 on Ethernet, and NVLink/NVSwitch for intra-node. Topology matters: fat-tree, rail-optimized, dragonfly. Latency in nanoseconds — not milliseconds.
InfiniBand · RDMA · NVLink · RoCEOrchestration & Scheduling
Kubernetes with the NVIDIA GPU Operator for inference, Slurm or Kueue for batch training, Ray for distributed Python. Gang scheduling, topology-aware placement, job preemption, and multi-tenant isolation are hard problems with real revenue impact.
K8s · Slurm · Ray · NomadObservability & Reliability
GPU clusters fail constantly — a single flaky NIC can halt a training run worth millions. Instrument everything: Prometheus + Grafana, DCGM for GPU health, OpenTelemetry for tracing, NCCL debugging, checkpoint/restart, and fault-tolerant training (see Llama 3 paper's failure analysis — 419 interruptions in 54 days).
Prometheus · DCGM · OTel · NCCLTwo distinct specializations: Training vs Inference
These look similar but are completely different disciplines with different bottlenecks, different hardware preferences, and increasingly different teams. Pick one to go deep in — or know both and be very rare.
AI Model Training Infrastructure
The challenge: Synchronously coordinate 10,000+ GPUs for weeks without the whole training run collapsing. Gradients, weights, optimizer states — all moving at line rate across a purpose-built network.
Key frameworks: Megatron-LM, DeepSpeed, PyTorch FSDP, NeMo, Nanotron. 3D parallelism = data parallel + tensor parallel + pipeline parallel, plus sequence / expert parallelism for MoE.
Specific challenges: communication overhead (AllReduce bottleneck), stragglers, hardware failures mid-run, gradient explosion, checkpoint I/O, MFU (Model FLOPs Utilization) optimization. A 1% MFU improvement on a $100M training run = $1M saved.
Megatron · DeepSpeed · FSDP · 3D parallelismAI Model Inference Infrastructure
The challenge: Serve millions of concurrent users with unpredictable request shapes, strict latency SLOs (TTFT, ITL), while keeping GPU utilization high and cost per token low. Volume is exploding faster than training.
Key frameworks: vLLM, TensorRT-LLM, SGLang, NVIDIA Dynamo, Triton Inference Server, llm-d. Techniques: PagedAttention, continuous batching, speculative decoding, prefix caching, quantization (INT8, FP8, FP4).
Every optimization = real $ saved. A 2× inference throughput gain on a production model can save millions per month.
vLLM · TensorRT-LLM · SGLang · DynamoLLM inference has two fundamentally different phases that most engineers lump together:
① Prefill — Process the entire input prompt at once. Massively parallel, compute-bound, FLOPs-limited. This is where the KV cache gets built. Determines TTFT (time-to-first-token).
② Decode — Generate tokens one at a time, autoregressively. Sequential, memory-bandwidth-bound, HBM-limited. Determines ITL (inter-token latency).
Running both on the same GPU causes brutal interference — prefill jobs starve decode batches, decode waits starve prefill. Disaggregated inference (vLLM, DistServe, NVIDIA Dynamo) splits them across separate GPU pools — often H100s for compute-heavy prefill, H200s with more HBM for memory-bound decode. The KV cache is then shipped between them over high-speed interconnect (NIXL, RDMA). Real-world gains: 70%+ throughput improvements, ~90% faster TTFT. This is where a lot of 2026 inference engineering work is concentrated.
Why this track is a great bet
Models keep getting larger, inference demand is exponential, and there's a global shortage of engineers who can operate 10,000+ GPU clusters. Compensation at frontier labs is among the highest in tech. The work is durable — the physics of moving bits between accelerators doesn't get disrupted by the next model release. Every 6 months brings new hardware, new parallelism schemes, new bottlenecks to solve.
Scarce skills · Durable demandWho to follow to learn this field
SemiAnalysis (Dylan Patel) — the definitive research firm on AI hardware, datacenter economics, supply chain, and training cost analysis. 200K+ subscribers including every major AI lab. Also runs InferenceMAX / InferenceX (continuously updated open inference benchmark across all GPUs) and ClusterMAX (cluster performance research). Subscribe first — this is the single highest-signal source in the field.
Podcasts & deep interviews: Dwarkesh Patel Podcast (long-form AI/infra interviews), Latent Space (swyx — engineering-focused AI), Asianometry (semiconductor history & business, YouTube).
Technical deep-dives: Chips and Cheese (microarchitecture reviews), Making Deep Learning Go Brrrr (Horace He), How to Scale Your Model (Google DeepMind — free book), LLM Inference Handbook (BentoML), the PyTorch blog, and the NVIDIA Developer blog.
Landmark papers: Llama 3 paper (training infra reality check), DistServe (disaggregated serving), Mooncake (KVCache architecture).
SemiAnalysis · InferenceMAX · Dwarkesh · Latent SpaceTrack 2 — AI Engineering (Building Agents)
The "application layer" of AI. Every industry — legal, medical, finance, real estate, manufacturing, education, customer support — needs domain-specific agents that actually work in production. This is where your business-domain knowledge compounds with AI skills to create enormous leverage.
Agent Design & Orchestration
Master the patterns: ReAct loops, planner-executor, orchestrator-workers, multi-agent teams. Know when to use LangGraph vs CrewAI vs writing your own loop. Design for graceful degradation, retries, and human-in-the-loop checkpoints. A great agent architect can save a company 100x the cost of a naive implementation.
LangGraph · CrewAI · Custom loopsContext Engineering & RAG
As Karpathy put it, "the delicate art and science of filling the context window." Hybrid search, reranking, GraphRAG, agentic search, memory design, prompt compaction. The difference between a demo that impresses and an agent that works in production is almost entirely here.
RAG · Agentic search · MemoryIntegration Layer (MCP & A2A)
Build MCP servers that expose your company's data and APIs safely to any AI. Design A2A-compatible agents that can collaborate across vendor boundaries. OAuth flows, permission scoping, rate limits, audit logs — boring work that gates every enterprise AI deployment.
MCP · A2A · OAuth · PermissionsEvals, Observability & Safety
Production agents need systematic eval pipelines — not vibes. LangSmith, Braintrust, Arize for tracing + offline eval. Design golden datasets, LLM-as-judge pipelines, regression tests. Security: prompt injection defenses, tool sandboxing, PII handling. The most valuable AI engineers are the ones who ship reliably.
Evals · Tracing · GuardrailsStructured Outputs & Reliability
LLMs are sloppy by default. Tools like BAML, Instructor, and JSON Schema-based function calling turn free-form text into reliable typed data. Retry policies, fallback models, schema validation — the engineering discipline that makes LLM output trustable for downstream systems.
BAML · Instructor · Type safetyUsing AI at every step
The highest-output engineers use AI throughout their workflow: Claude Code or Cursor for building, AI for code review, AI for writing docs, AI for debugging, AI for learning new libraries. They're not "prompt engineers" — they're AI-augmented engineers who ship 3–5× more than their peers. Build this habit now.
AI-native workflow · MultiplierWhich track is right for you?
| AI Infrastructure | AI Engineering | |
|---|---|---|
| You love | Low-level systems, distributed computing, hardware, performance | Product building, APIs, user problems, domain workflows |
| Typical stack | C++/Rust/Go, CUDA, Kubernetes, NCCL, Linux internals | Python/TypeScript, LLM APIs, vector DBs, web frameworks |
| Day to day | Debug why training is 30% slower, tune NCCL topology, GPU failures | Ship agents that solve real user problems, iterate on prompts/evals |
| Who hires | Frontier labs, hyperscalers, AI chip companies, HPC orgs | Every company in every industry — the hiring pool is wider |
| Background fit | Systems, HPC, networking, SRE, performance engineering | Full-stack, backend, product engineering, domain expertise |
| 10-year outlook | Exponentially growing demand as models and inference scale | Every industry will need this — from legal tech to robotics software |
Many of the best engineers move between these tracks. Infrastructure experience makes you a dramatically better AI engineer (you understand the physics of what's slow and expensive). Application experience makes you a better infrastructure engineer (you know what users actually need). Start with wherever your current skills overlap most — then deliberately stretch into the other as you grow.
From Agentic AI to Physical AI
We're in the early stage of agentic AI. What's happening in software today is a preview of what will happen in every industry over the coming decade.
The Pattern: Coding Agents Are the Canary
Software engineering is the first industry being transformed by AI agents — and it's happening in real time, so we can watch the pattern unfold. What's happening here will repeat in every other knowledge-work field.
The same arc is coming for legal research, medical diagnostics, financial analysis, scientific research, creative work, customer operations, and more. Each field will see:
① A wave of productivity gains as AI agents automate the repetitive and tedious → ② A painful adjustment period with new failure modes, ethical questions, and regulatory gaps → ③ Emergence of new roles, new standards, and new kinds of work that didn't exist before → ④ Ultimately: more output, more opportunity, but requiring different skills.
If you work in any industry, the question isn't whether AI agents are coming for it — it's when and how you position yourself to ride the wave instead of being flattened by it.
The Next Frontier: Physical AI / Embodied AI
Beyond agents that live in software, the next phase brings AI into the physical world — robots, autonomous vehicles, drones, smart factories. Many researchers call 2026 the "GPT-3 moment" for robotics.
Vision-Language-Action Models
The new architecture powering physical AI. Models that perceive (vision), understand (language), and act (motion). Companies like Figure AI, Agility Robotics, and Tesla are training humanoids with these. Few-shot learning is reaching precision robotics — instead of months of programming, robots learn from a handful of demonstrations.
VLA · Embodied reasoningWorld Models
AI systems that simulate how environments evolve. They let agents predict outcomes before acting — critical for robots, autonomous vehicles, and digital twins. Combined with reinforcement learning, they enable "learning by imagining" rather than learning only by doing.
Simulation · PredictionIndustrial & Service Robots
Shipments of autonomous mobile robots are projected to grow 6× by 2030. Manufacturing, logistics, healthcare, and hospitality will all see robots co-reasoning alongside humans — flexible enough for low-volume, high-mix work that rigid automation could never handle.
Near-term deploymentHealthcare & Science
AI-assisted robotic surgery already reduces operative time ~25% and complications ~30%. Embodied AI research in healthcare grew 7× between 2019–2024. Labs are being automated, drug discovery accelerated, patient care augmented. The productivity boost to science could be civilization-scale.
Transformative · Long-termWe're still in the early innings of the agentic AI wave — the space is evolving weekly, the industry is still solving basic limitations, and model capabilities keep increasing. Every limitation discussed in Section 4 is being actively chipped away at by frontier labs. What seems cutting-edge today will feel primitive in 18 months.
The playbook is simple: start now, use it daily, build real things, stay curious. The people building intuition today — by shipping projects, hitting walls, and inventing workarounds — are the ones who will shape how AI transforms their industries tomorrow.
Flashcard Quiz
Click to flip. See how much you retained.
Glossary
- LLM
- Large Language Model — neural network trained on text to generate language.
- Token
- A chunk of text (~¾ of a word). Models process and price by tokens.
- Context Window
- Total text a model can see at once, measured in tokens.
- Transformer
- Architecture behind all modern LLMs. Uses attention mechanisms.
- Attention
- Mechanism letting models weigh which input parts are most relevant.
- Pre-Training
- Phase where a model learns language from massive datasets.
- SFT
- Supervised Fine-Tuning — training on curated Q&A examples.
- RLHF
- Reinforcement Learning from Human Feedback.
- RLVR
- RL from Verifiable Rewards — alignment via automated checks.
- Embedding
- Numerical vector representing text meaning. Similar = nearby.
- RAG
- Retrieval-Augmented Generation — fetching docs for context.
- Hallucination
- Plausible-sounding but factually wrong model output.
- Agent
- Model + loop + tools + memory = autonomous system.
- MCP
- Model Context Protocol — open standard for AI ↔ tool connections.
- A2A
- Agent-to-Agent Protocol — open standard for agent ↔ agent comms.
- Agent Card
- JSON file describing an A2A agent's capabilities & endpoint.
- BAML
- Domain-specific language for type-safe structured LLM outputs.
- MoE
- Mixture of Experts — routes inputs to specialized sub-networks.
- Function Calling
- Model's ability to invoke external tools with formatted requests.
- Vector Database
- DB optimized for storing/searching embeddings by similarity.
- Context Engineering
- Curating everything a model sees to optimize output quality.
- Temperature
- Parameter: 0 = deterministic, 1 = creative/random.
- Few-Shot
- Providing examples in the prompt to teach a pattern.
- ReAct
- Reason + Act pattern — the standard agent loop architecture.
- Prefill
- Inference phase that processes the full input prompt in parallel. Compute-bound. Sets TTFT.
- Decode
- Inference phase that generates tokens one-at-a-time. Memory-bandwidth-bound. Sets ITL.
- KV Cache
- Cached key/value tensors from attention layers, reused during decode to avoid recomputation.
- TTFT / ITL
- Time-To-First-Token / Inter-Token Latency — the two latency metrics that matter for inference.
- Disaggregated Serving
- Running prefill and decode on separate GPU pools, shipping KV cache between them.
- 3D Parallelism
- Data + tensor + pipeline parallelism combined — how frontier models are trained across 1000s of GPUs.
- MFU
- Model FLOPs Utilization — what % of a GPU's peak FLOPs your training run actually uses. Gold metric.