The Machine That Predicts the Next Word: How LLMs Actually Work
From the attention formula to the economics of inference, a technical operator's guide to the architecture, training pipeline, and cost structure behind large language models.
The Core Insight
Everything about a large language model flows from one deceptively simple objective: predict the next token. There is no reasoning module, no knowledge store, no semantic engine in the classical sense. A neural network, having absorbed enough text to build an extraordinarily rich statistical map of how language works, uses that map at inference time to produce the most probable continuation of whatever you hand it. Understanding the mechanics is prerequisite to making sound engineering and product decisions with these systems.
The Architecture That Changed Everything
The transformer architecture was introduced in the 2017 paper "Attention Is All You Need" by Ashish Vaswani et al., solving a bottleneck that had stalled progress for years. Recurrent neural networks processed sequences one element at a time, making them slow and difficult to parallelize. By the 512th token, earlier context was largely gone.
The transformer's breakthrough was self-attention: each token attends to every other token in the same sequence, enabling efficient parallelization and better modeling of long-range dependencies. The operation is Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V, where each token computes Query, Key, and Value projections and scores itself against all others simultaneously. All tokens communicate in parallel, allowing efficient context understanding.
Multi-head attention extends this by running multiple attention mechanisms in parallel, enabling the model to focus on different aspects of the input at once. One head may track syntactic structure; another resolves pronouns; a third attends to topical coherence across distant passages. Outputs are concatenated and projected back to the residual stream.
One unavoidable constraint: the transformer relies on positional encodings to inject sequence order information, since pure attention has no inherent sense of which token came first. Without them, "the cat sat on the mat" and "the mat sat on the cat" become identical.
The Training Pipeline
Deployment requires three phases.
Phase 1: Pre-training. The model trains on vast, diverse text using next-token prediction as the objective. Parameters—ranging from millions to hundreds of billions—adjust via gradient descent and backpropagation to minimize prediction error. This is where broad language competence and world knowledge accumulate. Training compute for frontier models doubles roughly every five months; dataset sizes grow every eight months; power requirements climb annually.
Phase 2: Instruction fine-tuning. The pre-trained model learns from curated, labeled examples of instruction-following. This converts a next-token predictor into something that actually answers questions.
Phase 3: RLHF. Reinforcement Learning from Human Feedback optimizes for helpfulness, harmlessness, and honesty using human rater preferences as the signal. Major labs—OpenAI, Anthropic, Google DeepMind, Meta—now treat this as standard. The tradeoff is real: optimizing for rater preference can breed sycophancy and suppress outputs that are correct but unpopular.
The Cost Structure
Training cost claims circulate everywhere and are almost always incomplete.
The amortized hardware and energy cost for frontier model training has grown at 2.4x per year since 2016. A detailed breakdown shows hardware represents 47-67% of total development cost, R&D staff 29-49%, and energy the remaining 2-6%.
The widely reported figures capture only the successful final run. DeepSeek's $5.6 million claim for its 671-billion parameter model covers just the final training run on 2,048 H800 GPUs and excludes R&D, failed experiments, hardware acquisition, and infrastructure. Direct comparison to full-cost estimates from other labs is apples-to-oranges.
Anthropic CEO Dario Amodei has stated that current frontier training spans $100 million to $1 billion. Stanford's AI Index 2025 estimates GPT-4's training compute at approximately $78-100 million.
Inference is where most organizations actually pay. In production, most spend far more on inference than training—roughly 80% of AI budget on inference versus 20% on training. Organizations fixated on headline training costs routinely miscalculate total cost of ownership.
The genuine good news: inference pricing has collapsed. The cost of querying a GPT-3.5-equivalent model on MMLU fell from $20.00 per million tokens in November 2022 to $0.07 by October 2024—a 280-fold reduction in 18 months. Depending on the task, price drops have ranged from 9 to 900 times per year. Hardware gains, quantization, speculative decoding, distillation, and open-weight model competition are all driving the decline.
Capabilities and Their Limits
In-context learning sets modern LLMs apart: the model performs new tasks from examples in the prompt at inference time, with no weight update, no gradient computation. It generalizes to unseen task formats from pattern alone. This unlocks deployment flexibility—you don't retrain to shift behavior.
Emergent capabilities (multi-step reasoning, code generation, summarization) surface from scaling the next-token objective across massive data and parameters. They weren't explicitly programmed, and they can't be reliably predicted before the model is trained—a consequential unsolved problem.
Two structural limits warrant operator attention:
Hallucination is architectural. The model generates statistically probable continuations. If that continuation happens to be a plausible but wrong chemical formula or made-up fact, so be it. No fine-tuning regimen has eliminated this, because the mechanism producing fluent accurate text is identical to the one producing fluent inaccurate text.
Context scaling is quadratic. Self-attention's compute cost scales as O(n^2) with sequence length. Long contexts at inference remain expensive even with extensions to 128k+ tokens. Coherence at the far edges of a long context is an unsolved engineering problem.
What Operators Should Take Away
LLMs are next-token predictors built on a parallelizable attention architecture that, since 2017, has scaled in ways prior NLP systems couldn't match. Training layers broad knowledge, instruction-following, and preference alignment atop that core objective. Inference costs are plummeting—roughly 10x decrease per year across performance tiers is a reasonable approximation. But frontier training costs compound upward, and running a model in production typically costs far more over its lifetime than training it cost upfront.
For engineers building on these systems: account for quadratic attention cost before extending context. Treat hallucination as an architectural problem requiring retrieval, verification, or structured output. Size inference budgets against actual production token volumes, not training headlines.
- Attention Is All You Need (Vaswani et al., 2017) — arxiv
- Stanford HAI 2025 AI Index Report — Chapter 1: Research and Development
- How Much Does It Cost to Train Frontier AI Models? — Epoch AI
- LLM Inference Price Trends — Epoch AI
- Inference Unit Economics: The True Cost Per Million Tokens — Introl
- Welcome to LLMflation — Andreessen Horowitz
- Context Collapse: In-Context Learning and Model Collapse — arxiv
- Navigating the Landscape of Large Language Models — arxiv
- Transformer Model Explained: Attention Is All You Need | Alex Xu posted on the topic | LinkedIn
- The Transformer Revolution: How “Attention Is All You Need” Changed AI Forever | by Sebastian Buzdugan | Medium
- Attention Is All You Need - A Deep Dive into the Revolutionary Transformer Architecture | Towards AI
- Transformer Architecture: “Attention is All You Need” | by Mehmet Ozkaya | Medium
- Review of “Attention Is All You Need (Vaswani et al., 2017)”
- Detecting Insincere Questions from Text: A Transfer Learning Approach
- Transformer Architecture: How Attention Changed AI | Introl Blog
- adaptNMT: an open-source, language-agnostic development environment for Neural Machine Translation
- Optimizing Inference Costs: The Complete Guide | Mirantis
- Artificial Intelligence Index Report 2025 CHAPTER 1: Research and Development
- Report by the Stanford Institute (HAI): State of AI Index 2025
- The AI Price Collapse Is Real. Your Excuse to Wait Is Not. | by Jan Horecny | Medium
- AI Index 2025: Inference costs plummet, hardware trends ...
- Stanford AI Index: Inference costs drop 280x in 2 years
- Stanford AI Index 2025 — Grokipedia
- Token Burnout: Why AI Costs Are Climbing and How Product Leaders Can Prototype Smarter
- LLM Inference Cost 2026: Complete Pricing Guide
- Photons = Tokens: The Physics of AI and the Economics of Knowledge
- The LLM Cost Paradox: How "Cheaper" AI Models Are Breaking Budgets
- Densing Law of LLMs
- Observations About LLM Inference Pricing | MIRI TGT
- Tiered Super-Moore's Law: Price Evolution, Production Frontiers, and Market Competition in Large Language Model Inference Services