Learning by Doing

Reinforcement learning (RL) is the AI paradigm where an agent learns by interacting with an environment and receiving feedback in the form of rewards or penalties. Unlike supervised learning (which needs labeled examples) or unsupervised learning (which finds patterns in data), RL learns from consequences.

The agent takes actions, observes outcomes, and adjusts its strategy (called a policy) to maximize cumulative reward over time. It is the same principle that governs how humans learn to ride a bicycle — you try, fail, adjust, and eventually succeed.

Key Concepts

Agent: the learner and decision-maker. Environment: the world the agent interacts with. State: the current situation. Action: what the agent can do. Reward: feedback signal after each action.

The agent's goal is to learn a policy — a mapping from states to actions — that maximizes total expected reward. This can be simple (go left when near a wall) or incredibly complex (how to play chess at a grandmaster level).

Famous Successes

RL is behind some of AI's most dramatic achievements. AlphaGo defeated the world champion at Go. AlphaStar reached top-tier play in StarCraft II. OpenAI Five beat pro teams at Dota 2.

Beyond games, RL optimizes data center cooling (saving millions in energy costs), controls robotic manipulation, and tunes recommendation algorithms. Perhaps most importantly, RLHF (reinforcement learning from human feedback) is the technique used to align large language models, making chatbots helpful rather than harmful.

When to Consider RL

RL works best in sequential decision-making problems where the outcome depends on a series of actions. If you have a clear reward signal and a simulable environment, RL can discover strategies no human would design.

However, RL is sample-inefficient (it needs many iterations), sensitive to reward design, and hard to debug. For most business problems, supervised learning is simpler. But for optimization, control, and alignment, RL is irreplaceable.