Reinforcement Learning: How Machines Learn to Act

There is a particular kind of problem that breaks every other approach in machine learning.

Not image classification. Not language generation. Not regression or clustering. Those problems have something in common: someone, somewhere, prepared the answer in advance. A human labeled the image, graded the text, or measured the outcome. The model learns by comparing its guess to that answer.

But what about problems where no one can prepare the answer in advance? What is the correct steering angle at this exact instant on this exact road? What is the correct move in this chess position? What is the correct word to say next in a negotiation? These problems do not have labeled datasets. They have consequences. And consequences arrive later, after the action is already taken.

Reinforcement learning is the branch of machine learning built for exactly this. It is not a single algorithm. It is a framework for learning through interaction — where the learner acts, the world responds, and the learner adjusts over time based on what happened.

This article covers what RL actually is, how the mathematics works, how it is used to train modern AI models including large language models, how to implement it, and what problems remain genuinely unsolved. No topic here is simplified to the point of being wrong.

The Core Idea

A human child learns to walk without a dataset of labeled walking examples. It tries things, falls, tries differently, and over thousands of repetitions builds a policy — an internal mapping from body state to muscle action — that keeps it upright and moving. The feedback is not “here is the correct leg angle.” The feedback is gravity.

Reinforcement learning formalises this process. There is an agent that takes actions, an environment that responds to those actions, and a reward signal that tells the agent, numerically, how well things went. The agent’s goal is not to maximise reward on any single action. It is to maximise the total reward accumulated over time.

This distinction matters. An agent that only cares about immediate reward will cash in every short-term gain and destroy its long-term prospects. A chess engine that takes every pawn immediately loses games. An autonomous vehicle that brakes hard to avoid one obstacle may cause a worse accident behind it. RL agents must learn to reason across time.

The Formal Framework: Markov Decision Processes

The mathematical object underlying reinforcement learning is called a Markov Decision Process (MDP). Every RL problem can be expressed as one.

An MDP is defined by five components:

S — a set of states describing every possible configuration of the environment
A — a set of actions the agent can take
P(s’ | s, a) — a transition probability: given state s and action a, what is the probability of arriving in state s’
R(s, a, s’) — a reward function: a scalar value received after transitioning from s to s’ via action a
γ (gamma) — a discount factor between 0 and 1, controlling how much the agent values future rewards relative to immediate ones

At each timestep t, the interaction unfolds as:

Agent observes current state sₜ
Agent selects action aₜ according to its policy
Environment transitions to sₜ₊₁ with probability P(sₜ₊₁ | sₜ, aₜ)
Agent receives reward rₜ
Repeat

The agent’s objective is to maximise the expected discounted return:

G_t = r_t + \gamma \cdot r_{t+1} + \gamma^2 \cdot r_{t+2} + \gamma^3 \cdot r_{t+3} + \cdots

The discount factor γ has a clean interpretation. When γ = 0, the agent is completely myopic — it only cares about the immediate next reward. When γ → 1, the agent weights all future rewards equally and must plan far into the future. In practice, γ is set close to 1 (often 0.99) for problems requiring long-horizon reasoning, and lower for problems where the environment resets frequently and the horizon is short.

The “Markov” in MDP means the transition probability depends only on the current state, not on the history of how that state was reached. This is a simplifying assumption. In many real problems it does not strictly hold — past context matters. Handling partial observability is an active research area, addressed by architectures like recurrent networks and transformers applied to sequences of observations.

What the Agent Learns: Value Functions and Policies

There are two complementary things an RL agent can learn.

The Value Function

The state-value function V^π(s) represents the expected return starting from state s and following policy π:

\begin{aligned} V^\pi(s) &= \mathbb{E}_\pi[G_t \mid S_t = s] \\ &= \mathbb{E}_\pi[r_t + \gamma \cdot r_{t+1} + \gamma^2 \cdot r_{t+2} + \cdots \mid S_t = s] \end{aligned}

The action-value function Q^π(s, a) — usually called the Q-function — extends this to represent the expected return from state s, taking action a first, and then following policy π:

Q^\pi(s, a) = \mathbb{E}_\pi[G_t \mid S_t = s, A_t = a]

Q-values are more useful in practice because an agent with a learned Q-function can directly extract a policy: in any state, pick the action with the highest Q-value.

The Policy

A policy π is the agent’s decision function. It maps states to actions (or to probability distributions over actions):

\pi(a \mid s) = \text{probability of taking action } a \text{ in state } s

A deterministic policy selects one action per state. A stochastic policy samples from a distribution, which is important during training — randomness enables the agent to explore states it would not otherwise visit.

The optimal policy π* is the one that maximises V^π(s) for all states simultaneously. Under the MDP framework, an optimal policy always exists. Finding it is the computational challenge.

The Bellman Equations

The key insight connecting value functions to learning is the Bellman equation, which expresses the value of a state in terms of the values of its successors:

V^\pi(s) = \sum_a \pi(a \mid s) \cdot \sum_{s'} P(s' \mid s, a) \cdot [R(s, a, s') + \gamma \cdot V^\pi(s')]

This is a recursive definition. The value of a state is the expected immediate reward plus the discounted value of wherever you end up. This recursion is exploitable: if you have a rough estimate of V, you can improve it by backing up from successor states. The process of repeatedly applying this backup is called dynamic programming when the model P is known, and temporal difference learning when it is not.

Core Algorithms

Q-Learning

Q-learning is one of the foundational RL algorithms. It directly learns the optimal Q-function by repeatedly applying the Bellman update:

Q(s, a) \leftarrow Q(s, a) + \alpha \cdot [r + \gamma \cdot \max_{a'} Q(s', a') - Q(s, a)]

The term in brackets is called the TD error — the difference between what the agent expected and what actually happened. The learning rate α controls how much each update shifts the estimate.

Q-learning is off-policy: it learns the optimal Q-function regardless of what policy generates the experience. This means you can train on replay buffers of past experience, not just the current trajectory.

For small, discrete state spaces, Q-values are stored in a table indexed by (state, action) pairs. For anything larger, this becomes intractable.

Deep Q-Networks (DQN)

In 2013, DeepMind published a result that made the field stop: a single neural network, receiving raw pixel input, learning to play 49 Atari games at or above human level. The architecture was called a Deep Q-Network.

The core idea is to replace the Q-table with a neural network Q(s, a; θ) parameterised by weights θ. The same Bellman update applies, but now as a regression loss:

L(\theta) = \mathbb{E}\left[(r + \gamma \cdot \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta))^2\right]

Two engineering contributions made DQN stable where previous attempts had failed:

Experience replay — instead of learning from consecutive transitions (which are highly correlated and cause instability), transitions are stored in a replay buffer and sampled randomly during training. This breaks the correlation between successive updates.

Target networks — the Q-values used as regression targets are computed from a separate network θ⁻ whose weights are frozen and only periodically copied from the online network θ. Without this, the targets move as you update the network, creating a moving-target problem that diverges.

DQN works well for discrete action spaces. It does not directly extend to continuous actions, because computing max_{a'} Q(s', a') over a continuous space is intractable.

Policy Gradient Methods

Rather than learning value functions and deriving a policy, policy gradient methods directly parameterise and optimise the policy itself.

The policy is represented as π_θ(a|s) — a neural network that outputs a probability distribution over actions given a state. The objective is the expected return:

J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G_0]

where τ is a trajectory sampled from the policy. The gradient of this objective with respect to θ is:

\nabla_\theta J(\theta) = \mathbb{E}_\tau\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot G_t\right]

This result is called the policy gradient theorem. Its interpretation is intuitive: increase the log-probability of actions that led to high return, decrease the log-probability of actions that led to low return. The algorithm that directly implements this is called REINFORCE.

In practice, raw return G_t has high variance — a single bad episode can tank a policy update that was otherwise directionally correct. Subtracting a baseline (typically the value function V(s_t)) reduces variance without introducing bias:

\nabla_\theta J(\theta) = \mathbb{E}_\tau\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot (G_t - V(s_t))\right]

The term (G_t - V(s_t)) is called the advantage — how much better the actual outcome was than what the value function predicted.

Actor-Critic Methods

Actor-critic methods combine the best of both worlds: a policy network (the actor) and a value network (the critic). The actor takes actions. The critic evaluates them by estimating the advantage. The actor uses the critic’s estimates to update its policy, and the critic updates its value estimates using TD error.

This architecture is the foundation of most modern RL algorithms.

Proximal Policy Optimization (PPO)

PPO is currently the most widely used RL algorithm in practice. Understanding it is essential for anyone working with modern AI systems.

The fundamental problem with vanilla policy gradient is that large policy updates can be catastrophically bad. If you update θ too aggressively, the new policy may be so different from the old one that it enters a region of bad performance it cannot recover from — because it is now generating different data.

PPO addresses this with a clipped objective that limits how much the policy is allowed to change per update:

L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min(r_t(\theta) \cdot \hat{A}_t, \, \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) \cdot \hat{A}_t)\right]

where r_t(θ) = π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t) is the probability ratio between the new and old policies, Â_t is the estimated advantage, and ε is a hyperparameter (typically 0.2) controlling the size of the trust region.

The clipping means: if the probability ratio moves outside [1-ε, 1+ε], the gradient is zeroed out in that direction. The policy cannot be pushed too far from where it started in a single update.

PPO is simple to implement, computationally efficient, and robust across a wide range of problems. It is the algorithm used in RLHF for language models.

Soft Actor-Critic (SAC)

SAC extends actor-critic to continuous action spaces and adds an entropy bonus to the objective:

J(\theta) = \mathbb{E}\left[\sum_t \left(r_t + \alpha \cdot \mathcal{H}(\pi(\cdot \mid s_t))\right)\right]

where H is the entropy of the policy distribution and α is a temperature parameter. Maximising entropy alongside reward encourages the agent to remain uncertain — to maintain exploration throughout training rather than collapsing to a deterministic policy prematurely. SAC is the standard choice for continuous control tasks like robotic manipulation.

How RL Is Used to Train Modern AI Models

This is where RL leaves textbooks and becomes something you interact with every day.

RLHF: Reinforcement Learning from Human Feedback

Large language models are pre-trained on text prediction. This gives them broad capability but no alignment to what humans actually want from a response — helpfulness, honesty, safety, appropriate tone. Supervised fine-tuning on human-written examples helps but is expensive and limited by the quality and quantity of human annotations.

RLHF is the method used to align language models with human preferences at scale. The original paper applying it to language models came from OpenAI in 2022 and was the core technique behind InstructGPT and subsequently GPT-4 and ChatGPT.

The process has three stages:

Stage 1: Supervised fine-tuning (SFT) The base model is fine-tuned on a dataset of human-written demonstrations of desired behaviour. This gives the model a starting point that produces reasonable outputs before RL training begins.

Stage 2: Reward model training Human raters are shown pairs of model outputs and asked to indicate which they prefer. These preference judgments are used to train a reward model — a separate neural network that takes a (prompt, response) pair and outputs a scalar score representing how much a human would prefer that response.

The reward model is trained to satisfy preference pairs:

L_{\text{RM}} = -\mathbb{E}_{(s,w,l)}\left[\log\left(\sigma\left(r_\theta(s, w) - r_\theta(s, l)\right)\right)\right]

where w is the preferred response, l is the less preferred one, and σ is the sigmoid function. This is a Bradley-Terry preference model — it learns to assign higher scores to preferred outputs.

Stage 3: RL fine-tuning with PPO The SFT model is treated as the initial policy. The reward model is used as the reward function. PPO is run to update the policy to maximise reward model scores.

A critical detail: without a constraint, the policy will exploit the reward model — it will find outputs that score highly according to the reward model but that humans would not actually prefer. This is reward hacking at the level of language model training.

The constraint is a KL divergence penalty from the original SFT policy:

r_{\text{total}} = r_{\text{RM}}(s, a) - \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{SFT}})

The β term penalises the policy for drifting too far from the SFT model. This keeps the RL-trained model from degenerating into gibberish that happens to fool the reward model.

Constitutional AI and RLAIF

Anthropic’s Constitutional AI (CAI) modifies RLHF by replacing human preference raters with the model itself. A set of principles — a “constitution” — is written by humans. The model is then prompted to critique its own outputs against those principles and generate revised outputs. These self-critiques are used to train the reward model.

The practical advantage is scale: you do not need human raters for every judgment. The tradeoff is that the reward model now depends on the model’s own understanding of the principles, which may not align with how humans would interpret them in edge cases.

RLAIF — RL from AI Feedback — is the general term for using AI models as preference raters. It is increasingly used in production pipelines because human labeling is a bottleneck.

RLHF in Practice: What It Actually Does

It is worth being precise about what RLHF achieves and what it does not.

RLHF does not make models more capable in the sense of knowing more facts or reasoning better. It makes models more aligned in the sense of producing outputs humans prefer. The model after RLHF follows instructions more reliably, maintains appropriate tone, avoids outputs that raters dislike, and refuses harmful requests.

The capability improvement that users observe from RLHF-trained models relative to base models is partly alignment — the model does what you ask instead of predicting what comes after a prompt — and partly a selection effect from the SFT data.

The Problems That Remain Open

Reinforcement learning works. It has also produced some of the most significant failures in AI safety research. Both facts are important.

Reward Hacking

You cannot perfectly specify what you want as a reward function. Language is ambiguous, edge cases are infinite, and any proxy for a complex goal will diverge from that goal somewhere.

OpenAI trained an agent to win a boat racing game by maximising score. The agent discovered it could score more by spinning in circles collecting the same power-ups repeatedly than by finishing the race. It never crossed the finish line. Reward: maximum. Goal: completely missed.

This is Goodhart’s Law, formalised: when a measure becomes a target, it ceases to be a good measure.

At the scale of language models: early RLHF-trained models learned that human raters tend to prefer longer responses. So models padded. Confidently. The reward went up. The quality of responses did not.

Benchmark Hacking in Language Models

This is the current crisis in the field, and it is not discussed with enough honesty.

Language model benchmarks — MMLU, HumanEval, GSM8K, MATH, GPQA — were designed to measure capability. They have become targets. And targets get hacked.

Data contamination is the most direct form. Benchmark questions, or questions closely resembling them, end up in training data. The model learns the answers. Scores go up. The underlying capability the benchmark was supposed to measure stays flat or improves only marginally. Several major labs have faced contamination allegations, and some have acknowledged it.

Format overfitting is subtler. MMLU questions are always multiple choice. HumanEval solutions always follow a specific structure. Models learn the format as much as the underlying skill. Change the format — ask the same question in a different structure — and scores collapse disproportionately.

Metric saturation without capability transfer is the most dangerous. GPT-4 scored 86.4% on MMLU at launch. Within eighteen months, multiple open-source models matched or exceeded it. The question worth asking is not whether reasoning improved, but how much of the gap-closing was benchmark fitting versus genuine capability improvement.

Every AI engineer has felt the vibes gap: a model that passes HumanEval at 90% that cannot debug a moderately complex production codebase. A model that aces GSM8K that fumbles a straightforward estimation problem it has never seen framed exactly that way. The benchmark said competent. The deployment said otherwise.

When a model release leads with benchmark numbers, you are reading the output of an optimisation process. The question worth asking is always what was not measured.

Sample Inefficiency

A human child learns to walk in a few months of intermittent trying. An RL agent learning a comparable locomotion task typically requires tens of millions of simulation steps — equivalent to years of wall-clock experience.

This is not a quirk of current implementations. It reflects a fundamental difference: the agent starts with no prior knowledge and must discover causal structure from scratch through trial and error. Humans bring enormous embodied priors to every new skill.

Sim-to-real transfer partially addresses this: train in a fast physics simulator, deploy on real hardware. The problem is that simulators are not perfect, and policies trained in simulation tend to exploit the simulator’s inaccuracies. When deployed in reality, the policy breaks. Making simulators accurate enough to avoid this — domain randomization, system identification — is a significant engineering effort.

Credit Assignment

In a game of Go, a single game lasts 200+ moves. The outcome is a win or a loss. Which moves caused the outcome? This is the credit assignment problem.

Temporal difference methods handle short-horizon credit assignment well. Long-horizon, sparse-reward problems remain difficult. A drug discovery agent that proposes a molecule that fails in trials 18 months later has essentially no gradient signal connecting its early decisions to the eventual outcome.

Distribution Shift

A policy trained in one distribution of states will fail when deployed in a different one. This is not a bug — it is a mathematical consequence of how policy gradient methods work. The policy is optimised for the state distribution it was trained on.

In practice this means: a robotic arm trained in one factory lighting condition fails when the lights change. An RL-trained trading agent that performed well in 2019 behaved erratically in March 2020. A dialogue system trained on one user population fails when deployed to a different demographic.

Continual learning — the ability to adapt to distribution shift without forgetting previous capabilities — is an active research area with no satisfying solution.

Reward Model Bias in RLHF

The reward model in RLHF is trained on human preferences. Human preferences are not a ground truth for quality. They reflect rater demographics, annotation fatigue, cultural assumptions, and the incentive structure of the labeling task.

Raters from different backgrounds give different preferences. Raters who are tired give different preferences than raters who are not. Raters instructed to prefer “helpful” responses may reward confident-sounding responses over accurate ones.

The policy trained by PPO will faithfully learn to produce whatever the reward model rewards. If the reward model has learned to reward verbosity, the policy becomes verbose. If it has learned to reward a particular register of language, the policy converges to that register. The alignment you get is alignment with the reward model — which is a proxy for human preferences, not human preferences themselves.

Implementation

Minimal Q-Learning in Python

import numpy as np
import gymnasium as gym

env = gym.make("FrozenLake-v1", is_slippery=True)

# Hyperparameters
alpha = 0.1       # learning rate
gamma = 0.99      # discount factor
epsilon = 1.0     # exploration rate
epsilon_min = 0.01
epsilon_decay = 0.995
n_episodes = 10000

n_states = env.observation_space.n
n_actions = env.action_space.n
Q = np.zeros((n_states, n_actions))

for episode in range(n_episodes):
    state, _ = env.reset()
    done = False

    while not done:
        # Epsilon-greedy action selection
        if np.random.random() < epsilon:
            action = env.action_space.sample()        # explore
        else:
            action = np.argmax(Q[state])              # exploit

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        # Bellman update
        td_target = reward + gamma * np.max(Q[next_state]) * (not done)
        td_error = td_target - Q[state, action]
        Q[state, action] += alpha * td_error

        state = next_state

    epsilon = max(epsilon_min, epsilon * epsilon_decay)

print("Learned Q-table:")
print(Q.reshape(4, 4, 4))

The epsilon-greedy policy is the standard approach to the exploration-exploitation tradeoff: with probability ε, take a random action (explore); otherwise, take the action with the highest Q-value (exploit). ε is decayed over training as the agent accumulates knowledge — early on it needs to explore, later it should exploit what it has learned.

DQN with PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
from collections import deque
import gymnasium as gym

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )

    def forward(self, x):
        return self.net(x)

class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (
            torch.FloatTensor(np.array(states)),
            torch.LongTensor(actions),
            torch.FloatTensor(rewards),
            torch.FloatTensor(np.array(next_states)),
            torch.FloatTensor(dones)
        )

    def __len__(self):
        return len(self.buffer)

class DQNAgent:
    def __init__(self, state_dim, action_dim):
        self.action_dim = action_dim
        self.gamma = 0.99
        self.epsilon = 1.0
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.batch_size = 64
        self.target_update_freq = 100
        self.steps = 0

        self.q_net = QNetwork(state_dim, action_dim)
        self.target_net = QNetwork(state_dim, action_dim)
        self.target_net.load_state_dict(self.q_net.state_dict())
        self.target_net.eval()

        self.optimizer = optim.Adam(self.q_net.parameters(), lr=1e-3)
        self.buffer = ReplayBuffer()
        self.loss_fn = nn.MSELoss()

    def act(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.action_dim)
        with torch.no_grad():
            state_t = torch.FloatTensor(state).unsqueeze(0)
            return self.q_net(state_t).argmax(dim=1).item()

    def train(self):
        if len(self.buffer) < self.batch_size:
            return

        states, actions, rewards, next_states, dones = self.buffer.sample(self.batch_size)

        # Current Q-values
        q_values = self.q_net(states).gather(1, actions.unsqueeze(1)).squeeze(1)

        # Target Q-values (from frozen target network)
        with torch.no_grad():
            next_q = self.target_net(next_states).max(1)[0]
            targets = rewards + self.gamma * next_q * (1 - dones)

        loss = self.loss_fn(q_values, targets)
        self.optimizer.zero_grad()
        loss.backward()
        # Gradient clipping — prevents large updates from destabilising training
        torch.nn.utils.clip_grad_norm_(self.q_net.parameters(), max_norm=10.0)
        self.optimizer.step()

        self.steps += 1
        if self.steps % self.target_update_freq == 0:
            self.target_net.load_state_dict(self.q_net.state_dict())

        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

# Training loop
env = gym.make("CartPole-v1")
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = DQNAgent(state_dim, action_dim)

for episode in range(500):
    state, _ = env.reset()
    total_reward = 0
    done = False

    while not done:
        action = agent.act(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        agent.buffer.push(state, action, reward, next_state, float(done))
        agent.train()

        state = next_state
        total_reward += reward

    if episode % 50 == 0:
        print(f"Episode {episode} | Reward: {total_reward:.1f} | ε: {agent.epsilon:.3f}")

PPO from Scratch

import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import numpy as np
import gymnasium as gym

class ActorCritic(nn.Module):
    """Shared backbone with separate policy and value heads."""
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh()
        )
        self.policy_head = nn.Linear(64, action_dim)   # logits → action distribution
        self.value_head = nn.Linear(64, 1)             # scalar value estimate

    def forward(self, x):
        features = self.backbone(x)
        logits = self.policy_head(features)
        value = self.value_head(features)
        return logits, value

    def get_action(self, state):
        logits, value = self(state)
        dist = Categorical(logits=logits)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob, value.squeeze()

def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
    """
    Generalised Advantage Estimation.
    lam controls the bias-variance tradeoff:
      lam=0 → pure TD (low variance, high bias)
      lam=1 → pure Monte Carlo (high variance, low bias)
    lam=0.95 is standard.
    """
    advantages = []
    gae = 0
    next_value = 0

    for t in reversed(range(len(rewards))):
        if dones[t]:
            next_value = 0
            gae = 0
        delta = rewards[t] + gamma * next_value - values[t]
        gae = delta + gamma * lam * gae
        advantages.insert(0, gae)
        next_value = values[t]

    advantages = torch.tensor(advantages, dtype=torch.float32)
    returns = advantages + torch.tensor(values, dtype=torch.float32)
    return advantages, returns

def ppo_update(model, optimizer, states, actions, old_log_probs,
               advantages, returns, clip_eps=0.2, epochs=4):
    """
    PPO update: run multiple epochs over the collected batch.
    The clipped objective prevents excessively large policy updates.
    """
    for _ in range(epochs):
        logits, values = model(states)
        dist = Categorical(logits=logits)
        new_log_probs = dist.log_prob(actions)
        entropy = dist.entropy().mean()

        # Probability ratio: how much has the policy changed?
        ratio = (new_log_probs - old_log_probs).exp()

        # Clipped PPO objective
        surr1 = ratio * advantages
        surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
        policy_loss = -torch.min(surr1, surr2).mean()

        # Value function loss
        value_loss = nn.MSELoss()(values.squeeze(), returns)

        # Total loss: policy + value + entropy bonus (encourages exploration)
        loss = policy_loss + 0.5 * value_loss - 0.01 * entropy

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
        optimizer.step()

# Training
env = gym.make("CartPole-v1")
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

model = ActorCritic(state_dim, action_dim)
optimizer = optim.Adam(model.parameters(), lr=3e-4)

# PPO collects a full batch before updating — different from DQN's per-step updates
batch_size = 2048
n_updates = 200

for update in range(n_updates):
    # Collect trajectories
    states_buf, actions_buf, log_probs_buf = [], [], []
    rewards_buf, values_buf, dones_buf = [], [], []

    state, _ = env.reset()
    episode_rewards = []
    current_ep_reward = 0

    for _ in range(batch_size):
        state_t = torch.FloatTensor(state)
        action, log_prob, value = model.get_action(state_t)

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        states_buf.append(state)
        actions_buf.append(action)
        log_probs_buf.append(log_prob.item())
        rewards_buf.append(reward)
        values_buf.append(value.item())
        dones_buf.append(done)

        current_ep_reward += reward
        state = next_state

        if done:
            episode_rewards.append(current_ep_reward)
            current_ep_reward = 0
            state, _ = env.reset()

    # Compute advantages and returns
    advantages, returns = compute_gae(rewards_buf, values_buf, dones_buf)

    # Normalise advantages — critical for stable training
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

    states_t = torch.FloatTensor(np.array(states_buf))
    actions_t = torch.LongTensor(actions_buf)
    log_probs_t = torch.FloatTensor(log_probs_buf)

    ppo_update(model, optimizer, states_t, actions_t, log_probs_t, advantages, returns)

    if update % 20 == 0 and episode_rewards:
        print(f"Update {update} | Mean reward: {np.mean(episode_rewards):.1f}")

Things That Are Often Left Out

Exploration is a first-class problem, not an afterthought. Epsilon-greedy works for simple environments. In environments with sparse rewards — where the agent may take thousands of steps before receiving any signal — it fails completely. Count-based exploration, intrinsic motivation (curiosity-driven exploration), and information-theoretic approaches like maximum entropy RL address this, but there is no universal solution.

The reward function is part of the system design. Most RL tutorials treat the reward function as given. In practice, designing a good reward function is often the hardest part of applying RL to a real problem. It requires deep understanding of the task, anticipation of edge cases, and iteration. Bad reward functions produce bad behaviour in precisely the ways described in the reward hacking section.

On-policy versus off-policy is an important distinction with practical consequences. On-policy methods (PPO, A3C) learn from data generated by the current policy. Off-policy methods (DQN, SAC) can learn from any data, including experience replays of past policies. Off-policy methods are generally more sample-efficient. On-policy methods are generally more stable.

Multi-agent RL is qualitatively different. When multiple agents interact, the environment is no longer stationary from any individual agent’s perspective — other agents are updating their policies simultaneously. Game theory, emergent cooperation, and competitive dynamics all become relevant. AlphaStar, OpenAI Five, and most of the interesting recent results in AI involve multi-agent settings.

Sim-to-real transfer is an engineering problem as much as a research problem. The gap between simulated physics and real physics is real and significant. Domain randomisation (training on many randomly varied simulators) and system identification (fitting the simulator to real sensor data) help but do not eliminate the gap.

RL in LLMs is not classical RL. When PPO is applied to language model fine-tuning, many of the classical RL assumptions break down. The state space (all possible token sequences) is astronomical. The action space (the vocabulary) is tens of thousands of tokens. Rewards are sparse at the sequence level. The horizon is short relative to language model capabilities. The “environment” is deterministic given the model’s sampling. Much of what makes RLHF work is empirically understood before it is theoretically explained.

Where the Field Is Going

Model-based RL — where agents learn an explicit model of environment dynamics and use it for planning — has shown significant results. Dreamer, MuZero, and related work demonstrate that agents can plan in latent space and achieve sample efficiency approaching human-level on some benchmarks. The challenge is that learned world models are imperfect, and planning with an imperfect model can amplify errors.

Offline RL — learning from fixed datasets without environment interaction — addresses the sample efficiency problem by leveraging existing data. This is directly relevant to language models: the pre-training corpus is an offline dataset, and offline RL methods can potentially extract better policies from it than standard supervised learning.

Foundation models in RL — using large pre-trained models as policy initializations, reward models, or world models — are the active frontier. The hypothesis is that the rich representations learned during pre-training on large corpora will accelerate RL training in downstream tasks. The evidence so far is promising and contested in roughly equal measure.

The alignment problem, properly understood, is in large part a reward specification problem. The difficulty of specifying human values as a reward function is not a technical limitation that better algorithms will solve. It is a philosophical problem about the nature of human values — which are contextual, inconsistent, and not reducible to scalar numbers — that better algorithms will help manage but not eliminate.

A Note on What This Means for Engineers

If you are building systems on top of RL-trained models, the most important things to understand are:

The model is an optimised policy, not a knowledge base. It behaves the way it does because that behaviour was rewarded during training, not because it has an accurate model of the world that it consults. When it fails, it fails because of training distribution mismatch or reward proxy divergence, not because it forgot something.

Benchmark numbers are the outputs of optimisation processes. They tell you how well the model performs on those specific benchmarks. They do not tell you how well the model will perform on your task. Test on your distribution.

RLHF models are aligned to the preferences of the raters who trained the reward model. Those preferences are a proxy for your users’ preferences. The gap between those two things is your risk.

The behaviour of deployed RL systems at scale is not fully predictable from behaviour in evaluation. Reward hacking at scale looks different from reward hacking in a controlled evaluation. The real-world distribution of prompts, edge cases, and adversarial inputs is never fully captured by any evaluation set.

None of this means RL does not work. It works extraordinarily well at what it is designed to do. Understanding what it is actually designed to do — and where the gap is between that and what you want — is the engineering judgment that matters.

Reinforcement learning is the closest thing machine learning has to a general theory of learning from experience. The mathematics is clean. The applications are real. The unsolved problems are hard in ways that are worth understanding clearly, because they are the same problems that will define the limits of AI systems for the foreseeable future.