AIIMPACT 8

The Machine That Predicts the Next Word: How LLMs Actually Work

From the attention formula to the economics of inference, a technical operator's guide to the architecture, training pipeline, and cost structure behind large language models.

2026-06-076 MIN READ#LLMs · #Transformers · #Attention · #RLHF · #Inference · #Training Costs · #Architecture

The Core Insight

Everything about a large language model flows from one deceptively simple objective: predict the next token. There is no reasoning module, no knowledge store, no semantic engine in the classical sense. A neural network, having absorbed enough text to build an extraordinarily rich statistical map of how language works, uses that map at inference time to produce the most probable continuation of whatever you hand it. Understanding the mechanics is prerequisite to making sound engineering and product decisions with these systems.

The Architecture That Changed Everything

The transformer architecture was introduced in the 2017 paper "Attention Is All You Need" by Ashish Vaswani et al., solving a bottleneck that had stalled progress for years. Recurrent neural networks processed sequences one element at a time, making them slow and difficult to parallelize. By the 512th token, earlier context was largely gone.

The transformer's breakthrough was self-attention: each token attends to every other token in the same sequence, enabling efficient parallelization and better modeling of long-range dependencies. The operation is Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V, where each token computes Query, Key, and Value projections and scores itself against all others simultaneously. All tokens communicate in parallel, allowing efficient context understanding.

Multi-head attention extends this by running multiple attention mechanisms in parallel, enabling the model to focus on different aspects of the input at once. One head may track syntactic structure; another resolves pronouns; a third attends to topical coherence across distant passages. Outputs are concatenated and projected back to the residual stream.

One unavoidable constraint: the transformer relies on positional encodings to inject sequence order information, since pure attention has no inherent sense of which token came first. Without them, "the cat sat on the mat" and "the mat sat on the cat" become identical.

The Training Pipeline

Deployment requires three phases.

Phase 1: Pre-training. The model trains on vast, diverse text using next-token prediction as the objective. Parameters—ranging from millions to hundreds of billions—adjust via gradient descent and backpropagation to minimize prediction error. This is where broad language competence and world knowledge accumulate. Training compute for frontier models doubles roughly every five months; dataset sizes grow every eight months; power requirements climb annually.

Phase 2: Instruction fine-tuning. The pre-trained model learns from curated, labeled examples of instruction-following. This converts a next-token predictor into something that actually answers questions.

Phase 3: RLHF. Reinforcement Learning from Human Feedback optimizes for helpfulness, harmlessness, and honesty using human rater preferences as the signal. Major labs—OpenAI, Anthropic, Google DeepMind, Meta—now treat this as standard. The tradeoff is real: optimizing for rater preference can breed sycophancy and suppress outputs that are correct but unpopular.

The Cost Structure

Training cost claims circulate everywhere and are almost always incomplete.

The amortized hardware and energy cost for frontier model training has grown at 2.4x per year since 2016. A detailed breakdown shows hardware represents 47-67% of total development cost, R&D staff 29-49%, and energy the remaining 2-6%.

The widely reported figures capture only the successful final run. DeepSeek's $5.6 million claim for its 671-billion parameter model covers just the final training run on 2,048 H800 GPUs and excludes R&D, failed experiments, hardware acquisition, and infrastructure. Direct comparison to full-cost estimates from other labs is apples-to-oranges.

Anthropic CEO Dario Amodei has stated that current frontier training spans $100 million to $1 billion. Stanford's AI Index 2025 estimates GPT-4's training compute at approximately $78-100 million.

Inference is where most organizations actually pay. In production, most spend far more on inference than training—roughly 80% of AI budget on inference versus 20% on training. Organizations fixated on headline training costs routinely miscalculate total cost of ownership.

The genuine good news: inference pricing has collapsed. The cost of querying a GPT-3.5-equivalent model on MMLU fell from $20.00 per million tokens in November 2022 to $0.07 by October 2024—a 280-fold reduction in 18 months. Depending on the task, price drops have ranged from 9 to 900 times per year. Hardware gains, quantization, speculative decoding, distillation, and open-weight model competition are all driving the decline.

Capabilities and Their Limits

In-context learning sets modern LLMs apart: the model performs new tasks from examples in the prompt at inference time, with no weight update, no gradient computation. It generalizes to unseen task formats from pattern alone. This unlocks deployment flexibility—you don't retrain to shift behavior.

Emergent capabilities (multi-step reasoning, code generation, summarization) surface from scaling the next-token objective across massive data and parameters. They weren't explicitly programmed, and they can't be reliably predicted before the model is trained—a consequential unsolved problem.

Two structural limits warrant operator attention:

Hallucination is architectural. The model generates statistically probable continuations. If that continuation happens to be a plausible but wrong chemical formula or made-up fact, so be it. No fine-tuning regimen has eliminated this, because the mechanism producing fluent accurate text is identical to the one producing fluent inaccurate text.

Context scaling is quadratic. Self-attention's compute cost scales as O(n^2) with sequence length. Long contexts at inference remain expensive even with extensions to 128k+ tokens. Coherence at the far edges of a long context is an unsolved engineering problem.

What Operators Should Take Away

LLMs are next-token predictors built on a parallelizable attention architecture that, since 2017, has scaled in ways prior NLP systems couldn't match. Training layers broad knowledge, instruction-following, and preference alignment atop that core objective. Inference costs are plummeting—roughly 10x decrease per year across performance tiers is a reasonable approximation. But frontier training costs compound upward, and running a model in production typically costs far more over its lifetime than training it cost upfront.

For engineers building on these systems: account for quadratic attention cost before extending context. Treat hallucination as an architectural problem requiring retrieval, verification, or structured output. Size inference budgets against actual production token volumes, not training headlines.

Sources

← back to the feed