Unit 5: Model-Free Reinforcement Learning

Introduction

In earlier units, we assumed that the environment is fully known and can be described as a Markov Decision Process (MDP). This means the agent has access to transition probabilities and reward functions, allowing it to plan and compute optimal behavior.

In many real-world problems, however, this information is not available. The agent does not know how the environment works and must instead learn directly from interaction. It observes states, takes actions, receives rewards, and gradually improves over time.

This setting is known as model-free reinforcement learning. Instead of relying on a model, the agent learns purely from experience.

Model-Based vs Model-Free Learning

Reinforcement learning methods can be divided into two categories.

Model-based learning assumes the agent knows the environment dynamics. It can use this knowledge to plan ahead and compute optimal strategies. This is the approach used in dynamic programming.

Model-free learning, on the other hand, does not assume any knowledge of the environment. The agent:

does not know transition probabilities
does not know reward functions
learns only from interaction

Because of this, model-free methods are more flexible and widely used in practice, especially when the environment is complex or unknown.

Monte Carlo Methods

One of the simplest model-free approaches is Monte Carlo learning. These methods learn by observing complete episodes and using the actual outcomes to update value estimates.

An episode is a sequence of states, actions, and rewards that eventually ends. After the episode finishes, the agent computes the return, which is the total accumulated reward:

G = R₁ + γR₂ + γ²R₃ + ...

This return represents how good the starting state (or state-action pair) turned out to be.

Monte Carlo methods estimate value functions by averaging returns over many episodes. Over time, these averages converge to accurate estimates.

Key idea:

Learning happens only after the final outcome is known.

Properties:

Uses full episodes
Does not rely on intermediate estimates (no bootstrapping)
Conceptually simple

Limitations:

Updates are delayed until the episode ends
High variance in returns
Inefficient for long or continuing tasks

Temporal Difference Learning

Temporal Difference (TD) learning improves on Monte Carlo methods by updating estimates during the episode, rather than waiting until the end.

Instead of using the full return, TD learning updates values using:

the immediate reward
the estimated value of the next state

This introduces bootstrapping, where estimates are updated using other estimates.

The simplest TD method, TD(0), uses the update rule:

V(s) ← V(s) + α [ R + γV(s') − V(s) ]

Here, the value of the current state is adjusted toward a better estimate that includes new information.

Key idea:

Learn immediately from each step, not just from final outcomes.

Properties:

Learns online (step-by-step)
More efficient than Monte Carlo
Works for both episodic and continuing tasks

Limitations:

Introduces bias due to bootstrapping
Can be unstable without proper tuning

Monte Carlo vs Temporal Difference

Monte Carlo and TD learning represent two different ways of learning from experience.

Monte Carlo methods use actual returns from complete episodes. This makes them unbiased but often slow and high in variance.

Temporal Difference methods update incrementally using estimates. This makes them faster and more efficient, but introduces some bias.

In practice, TD methods are more commonly used because they allow continuous learning and scale better to complex problems.

Summary

Model-free reinforcement learning allows agents to learn without prior knowledge of the environment.

Monte Carlo methods learn from complete episodes using actual returns
Temporal Difference learning updates estimates step by step using bootstrapping

Both approaches rely on experience and form the foundation for more advanced algorithms such as Q-learning and Deep Reinforcement Learning.