Introduction to Reinforcement Learning
Reinforcement Learning (RL) is a framework for sequential decision-making, where an agent learns to achieve goals by interacting with an environment. Unlike supervised learning, where correct outputs are provided for every input, in RL the agent must explore, experiment, and learn from feedback in the form of rewards.
Core Concepts: Agent, Environment, Reward, and Policy
Reinforcement learning relies on four fundamental components that define how an agent interacts with its environment. These components work together to allow the agent to learn from experience, make decisions, and achieve its objectives over time.
Agent
The agent is the decision-maker in a reinforcement learning system. It observes the current state of the environment, selects actions based on its policy, and learns from the outcomes of those actions. The agent’s goal is to maximize cumulative rewards over time, improving its decisions as it gains more experience.
Agents can be simple, like a program moving through a grid, or complex, like a self-driving car or an AI playing a strategy game. They rely on feedback from the environment (rewards and new states) to refine their behavior.
Example:
- A robot navigating a maze, learning which paths lead to the goal efficiently
- A software program playing chess, deciding which moves increase its chances of winning
Environment
The environment includes everything outside the agent that the agent interacts with. It defines the rules of the task, provides feedback in the form of rewards, and transitions the agent from one state to another based on its actions. The environment can be simple, like a grid world, or complex, like a real-world robot setting or a video game.
Example: In a grid world, the environment consists of:
- The layout of the grid
- Obstacles that block movement
- Goal locations that provide rewards
- Any rules for movement or interactions
The environment responds to the agent’s actions by updating the state and giving rewards, forming the basis for the agent’s learning process.
Reward
The reward is a numerical signal provided by the environment that indicates the immediate success or failure of the agent’s action. Rewards serve as feedback to guide the agent toward desirable behavior and away from undesirable actions. By maximizing cumulative rewards over time, the agent learns which actions lead to better outcomes.
Rewards can be positive (encouraging certain behavior), negative (penalizing mistakes), or neutral (no immediate effect). The design of the reward system is critical, as it directly shapes the agent’s learning and decision-making.
Example: A robot navigating a maze might receive:
- +1 for reaching the goal
- 0 for moving to an empty square
- -1 for hitting a wall
The agent uses these rewards to evaluate its actions and adjust its policy to achieve long-term success.
Policy
The policy defines the agent’s behavior by specifying how it chooses actions in each state. It is essentially the agent’s “strategy” for interacting with the environment. A good policy balances immediate rewards with long-term gains, helping the agent make decisions that maximize cumulative reward over time.
Policies can be of two main types:
- Deterministic: The agent always chooses the same action in a given state. For example, in a maze, if the robot is in the top-left corner, it always moves right.
- Stochastic: The agent selects actions according to a probability distribution, allowing for exploration and adaptability.
Purpose of the Policy:
- Guides the agent’s actions based on current state information
- Balances exploration (trying new actions) and exploitation (choosing the best-known action)
- Determines how quickly and effectively the agent learns to achieve its goal
Example: In a grid-world maze, the policy maps each square to an action (up, down, left, right) in a way that helps the robot reach the goal efficiently while avoiding obstacles.
Episodic vs Continuing Tasks
Reinforcement learning tasks can generally be divided into two main types: episodic and continuing. Understanding the difference between these types is important because it affects how the agent evaluates rewards, learns from experience, and updates its strategy over time.
Episodic Tasks
Episodic tasks have a clearly defined beginning and end, forming what is called an episode. Once the agent reaches a terminal state, the environment resets, and a new episode begins.
Characteristics:
- Defined start and end points
- Returns are computed per episode
- Learning can reset after each episode
Example:
- A single game of chess
- Navigating a maze from start to goal
Continuing Tasks
Continuing tasks do not have terminal states. The agent interacts with the environment indefinitely, and learning occurs over an ongoing stream of experience.
Characteristics:
- No natural endpoints
- Returns are computed over an ongoing horizon
- Policies are evaluated continuously
Example:
- Managing a stock portfolio
- Controlling a robot in continuous operation
Return and Discount Factor
The return is the total reward an agent expects to accumulate from a given time step onward. To account for uncertainty and the relative value of immediate versus future rewards, a discount factor γ is used:
Gₜ = Rₜ₊₁ + γ Rₜ₊₂ + γ² Rₜ₊₃ + …
γ ranges from 0 to 1:
- γ ≈ 0 → agent focuses on immediate rewards
- γ ≈ 1 → agent considers long-term rewards
Example: In a game, small rewards for moving toward the goal are balanced with a large reward at the goal. Discounting ensures the agent values both short-term and long-term outcomes.
The Markov Property
A state satisfies the Markov property if the future depends only on the current state and not on the sequence of past states.
Formula:
P(St+1 | St) = P(St+1 | S1, S2, ..., St)
Intuition:
The agent does not need to remember the full history of past states to make optimal decisions. The current state contains all relevant information needed to predict the next state.
Example:
- In chess, the next board configuration depends only on the current arrangement of pieces, not on the moves that led there.
- In a robot navigating a maze, the optimal next move depends only on the robot’s current position, not on the exact path it took to get there.
Markov Decision Processes (MDPs)
Most reinforcement learning problems are formalized as Markov Decision Processes (MDPs). An MDP provides a mathematical framework for modeling decision-making in environments where outcomes are partly random and partly under the control of an agent. It is defined by the tuple (S, A, P, R, γ):
- S (States): The set of all possible states the agent can be in. Each state represents a unique configuration of the environment.
- A (Actions): The set of all actions the agent can take. Actions influence how the agent transitions between states.
- P (Transition Probabilities): A function that defines the probability of moving to a new state given the current state and action. Formally, P(s’ | s, a) is the probability that action “a” in state “s” will lead to state “s’”.
- R (Reward Function): A function that assigns a numerical reward for each state-action pair (or state transition). Rewards provide feedback to the agent about the desirability of its actions.
- γ (Discount Factor): A number between 0 and 1 that determines the importance of future rewards relative to immediate rewards.
Goal: The agent’s objective is to find a policy, a strategy mapping states to actions, that maximizes the expected cumulative return (sum of discounted rewards) over time.
Example: A robot navigating a grid world:
- Each square on the grid represents a state.
- Moving up, down, left, or right are the possible actions.
- Transition probabilities determine how reliably the robot moves as intended (e.g., slipping or obstacles may affect outcomes).
- Reaching the goal square provides a reward, while hitting walls or taking unnecessary steps may have negative rewards.
- The discount factor γ ensures the agent balances short-term rewards with long-term planning, helping it find the most efficient path.
Exploration vs Exploitation
A central challenge in reinforcement learning is the exploration-exploitation trade-off. The agent must balance trying new actions to gain more knowledge about the environment with choosing actions it already knows will yield high rewards.
Exploration: Trying new or less-known actions to discover their potential rewards and improve future decisions.
Exploitation: Selecting actions that the agent currently believes will maximize reward, based on past experience.
Intuition:
- Excessive exploration can slow down learning, as the agent spends too much time experimenting rather than taking advantage of known good actions.
- Excessive exploitation can trap the agent in suboptimal strategies, preventing it from discovering better actions that lead to higher long-term rewards.
Example: multi-armed bandit problem:
- The agent faces several slot machines (arms) with unknown payout probabilities.
- To maximize rewards, it must sometimes explore by trying different machines, but also exploit by repeatedly playing the machine that has so far paid out the most.
Balancing exploration and exploitation is critical for the agent to learn an effective strategy that maximizes long-term rewards.
Unit Summary
- Reinforcement Learning Framework: Reinforcement learning studies how an agent learns to make decisions by interacting with an environment and improving its behavior based on rewards received from its actions.
- Core Components: RL systems consist of an agent that takes actions in an environment, receives rewards as feedback, and follows a policy that determines how actions are selected in different states.
- Task Structures: RL problems can be episodic, where interactions have a clear start and end, or continuing, where the agent operates indefinitely without terminal states.
- Return and Discounting: The agent evaluates decisions using the return, which is the accumulated future reward. The discount factor (γ) determines how strongly the agent values long-term rewards compared to immediate ones.
- Markov Property: In many RL problems, the future depends only on the current state, not on the full history of previous states. This property simplifies decision-making and allows efficient modeling.
- Markov Decision Processes: RL problems are commonly formalized as MDPs, which define states, actions, transition probabilities, rewards, and discounting within a mathematical decision-making framework.
- Exploration vs Exploitation: A central challenge in RL is balancing exploration (trying new actions to gain knowledge) with exploitation (choosing actions that currently produce the highest rewards).