Unit 2: Markov Decision Processes (MDP)

1. Introduction to Markov Decision Processes

A Markov Decision Process (MDP) is the mathematical framework used to describe most reinforcement learning problems. It provides a structured way to model how an agent interacts with an environment and how decisions influence future outcomes.

In reinforcement learning, the environment is often uncertain. When the agent performs an action, the next state is not always predictable. Instead, it follows probabilities. MDPs allow us to represent this uncertainty and reason about the long-term consequences of actions.

An MDP describes how states change, how rewards are generated, and how an agent should behave to maximize its long-term reward.

Formally, an MDP is defined by five components:

S — a set of states
A — a set of actions
P — state transition probabilities
R — reward function
γ (gamma) — discount factor

These components together fully define the reinforcement learning environment.

2. The Markov Property

The defining feature of an MDP is the Markov Property.

A system satisfies the Markov property if the future depends only on the current state, not on the full history of previous states.

In other words, the current state contains all the information needed to predict the future.

This assumption simplifies reinforcement learning problems significantly. Instead of remembering every past event, the agent only needs to consider the current state when making decisions.

For example, in a chess game the board configuration at the current moment already contains all necessary information to determine the next possible moves. The game does not need to know every move that happened earlier.

3. Components of an MDP

To understand how MDPs work, we examine each component in more detail.

States (S)

The state space is the set of all possible situations the environment can be in.

Examples:

All possible board configurations in chess
Positions of a robot in a grid
All possible frames of a video game

The number of states can sometimes be extremely large. In modern reinforcement learning problems, states are often represented by vectors, images, or sensor data.

Actions (A)

The action space defines all possible decisions the agent can take.

Examples:

Move left, right, up, or down
Buy, sell, or hold a stock
Accelerate or brake in a driving system

Actions can be:

Discrete actions

A fixed number of choices (e.g., four movement directions).

Continuous actions

Values within a range (e.g., steering angles or motor torque).

Transition Probabilities (P)

The transition probability describes how the environment changes after an action.

It represents the probability that the agent moves from one state to another when performing a specific action.

Formally:

P(s′ | s, a)

This means:

The probability of transitioning to state s′ given that the agent was in state s and took action a.

In deterministic environments, the next state is always the same. In stochastic environments, multiple outcomes are possible.

Reward Function (R)

The reward function defines the immediate feedback the agent receives after taking an action.

Formally:

R(s, a)

This represents the reward obtained after performing action a in state s.

Examples of rewards include:

+1 for reaching a goal
−1 for hitting an obstacle
Profit from a financial trade
Score gained in a game

The reward function guides the learning process by encouraging desirable behavior.

Discount Factor (γ)

The discount factor, represented by gamma (γ), determines how much the agent values future rewards compared to immediate rewards.

Its value lies between 0 and 1.

γ close to 0 → agent focuses on immediate rewards
γ close to 1 → agent values long-term rewards

For example:

A robot navigating a maze should consider long-term rewards to reach the goal.
A trading system may prioritize short-term profit.

The discount factor ensures that the sum of future rewards remains finite.

4. Policies

A policy defines how the agent chooses actions.

It is essentially a strategy that maps states to actions.

Formally:

π(a | s)

This represents the probability that the agent chooses action a when it is in state s.

Policies can be

Deterministic policies

The agent always chooses the same action in a given state.

Example:

π(s) = a

Stochastic policies

The agent chooses actions based on probabilities.

Example:

Move left with probability 0.7

Move right with probability 0.3

Stochastic policies are often used when environments are uncertain.

5. Value Functions

In reinforcement learning, the agent must evaluate how good a state or action is. This is done using value functions.

Value functions estimate the expected future reward an agent can obtain.

Two important value functions exist.

State Value Function

The state value function measures how good it is for the agent to be in a particular state.

It represents the expected cumulative reward starting from that state and following a given policy.

Example:

If a state is close to a goal, it likely has a high value.

Action Value Function (Q-function)

The action value function, often called the Q-function, measures how good a specific action is in a specific state.

It estimates the expected return if the agent takes action a in state s and then follows a particular policy afterward.

This function is central to many reinforcement learning algorithms such as Q-learning and Deep Q Networks.

6. Optimal Policies

The ultimate goal in reinforcement learning is to find an optimal policy.

An optimal policy is a strategy that produces the maximum expected cumulative reward from every state.

When an optimal policy is found, the agent consistently chooses the best possible actions.

Reinforcement learning algorithms are designed to approximate this optimal policy through experience and interaction with the environment.

Summary

Markov Decision Processes provide the mathematical framework for reinforcement learning. An MDP defines how an agent interacts with an environment using states, actions, transition probabilities, rewards, and a discount factor.

The Markov property allows the agent to make decisions based only on the current state. Policies define how actions are chosen, while value functions estimate the long-term rewards of states and actions.

Understanding MDPs is essential because most reinforcement learning algorithms—including dynamic programming, Q-learning, and deep reinforcement learning—are built on this framework.