Unit 1: Foundations of Reinforcement Learning

1. What is Reinforcement Learning

Reinforcement Learning (RL) is a branch of machine learning in which an agent learns how to make decisions by interacting with an environment. Instead of learning from labeled examples like in supervised learning, the agent learns through trial and error.

During this process the agent performs actions and receives feedback from the environment in the form of rewards or penalties. Positive rewards indicate good decisions, while negative rewards indicate poor decisions. Over time, the agent learns which actions lead to better outcomes.

The main objective of reinforcement learning is to learn a policy that maximizes the total reward over time, often called the cumulative reward or return.

Reinforcement learning is inspired by how humans and animals learn. For example, a child learns that touching a hot stove causes pain and will avoid repeating that action in the future. Similarly, an RL agent learns which actions produce better results by observing the rewards it receives.

Reinforcement learning has become extremely important in modern artificial intelligence and is used in many real-world systems, including:

Game playing systems such as those used in chess and Go
Robotics control and automation
Autonomous driving systems
Recommendation systems
Financial trading algorithms

Some of the most famous achievements in AI, such as advanced game-playing agents, rely heavily on reinforcement learning.

2. Agent, Environment, State, Action, Reward

Reinforcement learning problems are typically described using five fundamental components: agent, environment, state, action, and reward. These components define how the learning system operates.

Agent

The agent is the entity that learns and makes decisions. It observes the environment and chooses actions.

Examples of agents include:

A robot navigating a room
An AI program playing a video game
A trading algorithm making buy or sell decisions
A recommendation system suggesting movies

The agent's goal is to choose actions that maximize the total reward it receives.

Environment

The environment represents everything the agent interacts with. It provides the agent with information about the current situation and returns rewards after actions are taken.

Examples of environments include:

A simulated game world
A robotic simulation
A financial market
A physical environment for a robot

The environment responds to the agent’s actions by changing its state and providing feedback.

State (S)

A state represents the current situation of the environment at a specific moment in time. It contains the information the agent uses to decide what action to take.

Examples of states include:

The positions of all pieces on a chess board
The location and orientation of a robot
The current price and indicators in a trading system
The current frame of a video game

The state should contain enough information for the agent to make a good decision.

Action (A)

An action is a decision made by the agent that affects the environment.

Examples of actions include:

Moving left or right in a game
Accelerating or braking in a driving system
Buying or selling a stock
Rotating a robotic arm

Each action changes the environment and may lead to a new state.

Reward (R)

The reward is a numerical value returned by the environment that tells the agent how good or bad its action was.

Examples:

+10 points for winning a game
−1 for hitting a wall
Profit from a successful trade
Energy savings from an efficient control action

The reward signal is the primary learning signal in reinforcement learning. The agent tries to learn behaviors that maximize the total rewards it receives over time.

3. The Reinforcement Learning Interaction Loop

Reinforcement learning works through a continuous cycle of interaction between the agent and the environment. This cycle is often called the reinforcement learning loop.

The process typically follows these steps:

The environment provides the agent with the current state.
The agent selects an action based on its current strategy.
The action is executed in the environment.
The environment transitions to a new state.
The agent receives a reward indicating the quality of the action.
The process repeats.

Through repeated interaction, the agent gradually improves its decision-making strategy.

This loop can continue thousands or even millions of times during training, allowing the agent to learn complex behaviors.

4. Exploration vs Exploitation

One of the most important challenges in reinforcement learning is the exploration–exploitation tradeoff.

When an agent is learning, it must decide whether to:

Explore, meaning it tries new actions to discover potentially better rewards.
Exploit, meaning it chooses the best action it already knows.

Exploration

Exploration allows the agent to gather more information about the environment. Without exploration, the agent may miss better strategies.

For example, if a robot always follows the same path through a maze, it may never discover a shorter path.

Exploitation

Exploitation means using existing knowledge to maximize reward.

If the agent has already discovered a good strategy, exploiting that strategy can produce high rewards consistently.

The Tradeoff

Too much exploration can slow down learning because the agent spends too much time trying random actions.

Too much exploitation can prevent the agent from discovering better strategies.

A common method used in reinforcement learning to balance this tradeoff is the epsilon-greedy strategy, where the agent:

Chooses a random action with probability ε
Chooses the best known action with probability 1 − ε

This allows the agent to explore while still benefiting from what it has learned.

5. Episodic vs Continuing Tasks

Reinforcement learning problems are typically divided into two categories: episodic tasks and continuing tasks.

Episodic Tasks

In episodic tasks, the interaction between the agent and the environment is divided into episodes.

Each episode has:

A starting state
A sequence of actions and rewards
A terminal state that ends the episode

Examples of episodic tasks include:

Playing a chess game
Completing a maze
Finishing a level in a video game
A robot completing a specific task

After reaching the terminal state, the environment resets and a new episode begins.

Continuing Tasks

Continuing tasks do not have a natural ending. The agent interacts with the environment indefinitely.

Examples include:

Autonomous driving systems
Industrial control systems
Stock market trading algorithms
Network traffic management

In these tasks, the agent must continually make decisions and adapt to changing conditions.

Because these tasks do not end, reinforcement learning algorithms often use discounting to prioritize rewards received sooner rather than later.

Summary

Reinforcement learning focuses on learning optimal behavior through interaction with an environment. The core components of this framework include the agent, environment, states, actions, and rewards.

Through repeated interactions, the agent learns a strategy that maximizes cumulative reward. Important challenges in this process include balancing exploration and exploitation and adapting to different types of tasks such as episodic and continuing environments.

These foundational concepts form the basis for more advanced reinforcement learning methods that will be introduced in later units.