◆ NOISE IN → SIGNAL OUT◆ READALCHEMIST.COM◆ FREE / NO PAYWALL◆ NOISE IN → SIGNAL OUT◆ READALCHEMIST.COM◆ FREE / NO PAYWALL
THE DIGITAL ALCHEMIST
AIIMPACT 8

The Machine That Predicts the Next Word: How LLMs Actually Work

From the attention formula to the economics of inference, a technical operator's guide to the architecture, training pipeline, and cost structure behind large language models.

2026-06-076 MIN READ#LLMs · #Transformers · #Attention · #RLHF · #Inference · #Training Costs · #Architecture

The Core Insight

Everything about a large language model flows from one deceptively simple objective: predict the next token. There is no reasoning module, no knowledge store, no semantic engine in the classical sense. A neural network, having absorbed enough text to build an extraordinarily rich statistical map of how language works, uses that map at inference time to produce the most probable continuation of whatever you hand it. Understanding the mechanics is prerequisite to making sound engineering and product decisions with these systems.

The Architecture That Changed Everything

The transformer architecture was introduced in the 2017 paper "Attention Is All You Need" by Ashish Vaswani et al., solving a bottleneck that had stalled progress for years. Recurrent neural networks processed sequences one element at a time, making them slow and difficult to parallelize. By the 512th token, earlier context was largely gone.

The transformer's breakthrough was self-attention: each token attends to every other token in the same sequence, enabling efficient parallelization and better modeling of long-range dependencies. The operation is Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V, where each token computes Query, Key, and Value projections and scores itself against all others simultaneously. All tokens communicate in parallel, allowing efficient context understanding.

Multi-head attention extends this by running multiple attention mechanisms in parallel, enabling the model to focus on different aspects of the input at once. One head may track syntactic structure; another resolves pronouns; a third attends to topical coherence across distant passages. Outputs are concatenated and projected back to the residual stream.

One unavoidable constraint: the transformer relies on positional encodings to inject sequence order information, since pure attention has no inherent sense of which token came first. Without them, "the cat sat on the mat" and "the mat sat on the cat" become identical.

The Training Pipeline

Deployment requires three phases.

Phase 1: Pre-training. The model trains on vast, diverse text using next-token prediction as the objective. Parameters—ranging from millions to hundreds of billions—adjust via gradient descent and backpropagation to minimize prediction error. This is where broad language competence and world knowledge accumulate. Training compute for frontier models doubles roughly every five months; dataset sizes grow every eight months; power requirements climb annually.

Phase 2: Instruction fine-tuning. The pre-trained model learns from curated, labeled examples of instruction-following. This converts a next-token predictor into something that actually answers questions.

Phase 3: RLHF. Reinforcement Learning from Human Feedback optimizes for helpfulness, harmlessness, and honesty using human rater preferences as the signal. Major labs—OpenAI, Anthropic, Google DeepMind, Meta—now treat this as standard. The tradeoff is real: optimizing for rater preference can breed sycophancy and suppress outputs that are correct but unpopular.

The Cost Structure

Training cost claims circulate everywhere and are almost always incomplete.

The amortized hardware and energy cost for frontier model training has grown at 2.4x per year since 2016. A detailed breakdown shows hardware represents 47-67% of total development cost, R&D staff 29-49%, and energy the remaining 2-6%.

The widely reported figures capture only the successful final run. DeepSeek's $5.6 million claim for its 671-billion parameter model covers just the final training run on 2,048 H800 GPUs and excludes R&D, failed experiments, hardware acquisition, and infrastructure. Direct comparison to full-cost estimates from other labs is apples-to-oranges.

Anthropic CEO Dario Amodei has stated that current frontier training spans $100 million to $1 billion. Stanford's AI Index 2025 estimates GPT-4's training compute at approximately $78-100 million.

Inference is where most organizations actually pay. In production, most spend far more on inference than training—roughly 80% of AI budget on inference versus 20% on training. Organizations fixated on headline training costs routinely miscalculate total cost of ownership.

The genuine good news: inference pricing has collapsed. The cost of querying a GPT-3.5-equivalent model on MMLU fell from $20.00 per million tokens in November 2022 to $0.07 by October 2024—a 280-fold reduction in 18 months. Depending on the task, price drops have ranged from 9 to 900 times per year. Hardware gains, quantization, speculative decoding, distillation, and open-weight model competition are all driving the decline.

Capabilities and Their Limits

In-context learning sets modern LLMs apart: the model performs new tasks from examples in the prompt at inference time, with no weight update, no gradient computation. It generalizes to unseen task formats from pattern alone. This unlocks deployment flexibility—you don't retrain to shift behavior.

Emergent capabilities (multi-step reasoning, code generation, summarization) surface from scaling the next-token objective across massive data and parameters. They weren't explicitly programmed, and they can't be reliably predicted before the model is trained—a consequential unsolved problem.

Two structural limits warrant operator attention:

Hallucination is architectural. The model generates statistically probable continuations. If that continuation happens to be a plausible but wrong chemical formula or made-up fact, so be it. No fine-tuning regimen has eliminated this, because the mechanism producing fluent accurate text is identical to the one producing fluent inaccurate text.

Context scaling is quadratic. Self-attention's compute cost scales as O(n^2) with sequence length. Long contexts at inference remain expensive even with extensions to 128k+ tokens. Coherence at the far edges of a long context is an unsolved engineering problem.

What Operators Should Take Away

LLMs are next-token predictors built on a parallelizable attention architecture that, since 2017, has scaled in ways prior NLP systems couldn't match. Training layers broad knowledge, instruction-following, and preference alignment atop that core objective. Inference costs are plummeting—roughly 10x decrease per year across performance tiers is a reasonable approximation. But frontier training costs compound upward, and running a model in production typically costs far more over its lifetime than training it cost upfront.

For engineers building on these systems: account for quadratic attention cost before extending context. Treat hallucination as an architectural problem requiring retrieval, verification, or structured output. Size inference budgets against actual production token volumes, not training headlines.

Sources
  1. Attention Is All You Need (Vaswani et al., 2017) — arxiv
  2. Stanford HAI 2025 AI Index Report — Chapter 1: Research and Development
  3. How Much Does It Cost to Train Frontier AI Models? — Epoch AI
  4. LLM Inference Price Trends — Epoch AI
  5. Inference Unit Economics: The True Cost Per Million Tokens — Introl
  6. Welcome to LLMflation — Andreessen Horowitz
  7. Context Collapse: In-Context Learning and Model Collapse — arxiv
  8. Navigating the Landscape of Large Language Models — arxiv
  9. Transformer Model Explained: Attention Is All You Need | Alex Xu posted on the topic | LinkedIn
  10. The Transformer Revolution: How “Attention Is All You Need” Changed AI Forever | by Sebastian Buzdugan | Medium
  11. Attention Is All You Need - A Deep Dive into the Revolutionary Transformer Architecture | Towards AI
  12. Transformer Architecture: “Attention is All You Need” | by Mehmet Ozkaya | Medium
  13. Review of “Attention Is All You Need (Vaswani et al., 2017)”
  14. Detecting Insincere Questions from Text: A Transfer Learning Approach
  15. Transformer Architecture: How Attention Changed AI | Introl Blog
  16. adaptNMT: an open-source, language-agnostic development environment for Neural Machine Translation
  17. Optimizing Inference Costs: The Complete Guide | Mirantis
  18. Artificial Intelligence Index Report 2025 CHAPTER 1: Research and Development
  19. Report by the Stanford Institute (HAI): State of AI Index 2025
  20. The AI Price Collapse Is Real. Your Excuse to Wait Is Not. | by Jan Horecny | Medium
  21. AI Index 2025: Inference costs plummet, hardware trends ...
  22. Stanford AI Index: Inference costs drop 280x in 2 years
  23. Stanford AI Index 2025 — Grokipedia
  24. Token Burnout: Why AI Costs Are Climbing and How Product Leaders Can Prototype Smarter
  25. LLM Inference Cost 2026: Complete Pricing Guide
  26. Photons = Tokens: The Physics of AI and the Economics of Knowledge
  27. The LLM Cost Paradox: How "Cheaper" AI Models Are Breaking Budgets
  28. Densing Law of LLMs
  29. Observations About LLM Inference Pricing | MIRI TGT
  30. Tiered Super-Moore's Law: Price Evolution, Production Frontiers, and Market Competition in Large Language Model Inference Services
← back to the feed
NVDA 208.64 ▲ 1.73%AAPL 301.54 ▼ 1.89%MSFT 411.74 ▼ 1.18%GOOGL 363.31 ▼ 1.42%AMZN 245.22 ▼ 0.33%META 585.39 ▼ 1.28%TSLA 408.95 ▲ 4.59%AMD 490.33 ▲ 5.14%AVGO 396.60 ▲ 2.82%PLTR 136.47 ▲ 0.69%COIN 162.11 ▲ 6.37%MSTR 127.20 ▲ 5.61%NVDA 208.64 ▲ 1.73%AAPL 301.54 ▼ 1.89%MSFT 411.74 ▼ 1.18%GOOGL 363.31 ▼ 1.42%AMZN 245.22 ▼ 0.33%META 585.39 ▼ 1.28%TSLA 408.95 ▲ 4.59%AMD 490.33 ▲ 5.14%AVGO 396.60 ▲ 2.82%PLTR 136.47 ▲ 0.69%COIN 162.11 ▲ 6.37%MSTR 127.20 ▲ 5.61%