Unit 5: Model-Free Reinforcement Learning
Introduction
In earlier units, we assumed that the environment is fully known and can be described as a Markov Decision Process (MDP). This means the agent has access to transition probabilities and reward functions, allowing it to plan and compute optimal behavior.
In many real-world problems, however, this information is not available. The agent does not know how the environment works and must instead learn directly from interaction. It observes states, takes actions, receives rewards, and gradually improves over time.
This setting is known as model-free reinforcement learning. Instead of relying on a model, the agent learns purely from experience.
Model-Based vs Model-Free Learning
Reinforcement learning methods can be divided into two categories.
Model-based learning assumes the agent knows the environment dynamics. It can use this knowledge to plan ahead and compute optimal strategies. This is the approach used in dynamic programming.
Model-free learning, on the other hand, does not assume any knowledge of the environment. The agent:
- does not know transition probabilities
- does not know reward functions
- learns only from interaction
Because of this, model-free methods are more flexible and widely used in practice, especially when the environment is complex or unknown.
Monte Carlo Methods
One of the simplest model-free approaches is Monte Carlo learning. These methods learn by observing complete episodes and using the actual outcomes to update value estimates.
An episode is a sequence of states, actions, and rewards that eventually ends. After the episode finishes, the agent computes the return, which is the total accumulated reward:
G = R₁ + γR₂ + γ²R₃ + ...
This return represents how good the starting state (or state-action pair) turned out to be.
Monte Carlo methods estimate value functions by averaging returns over many episodes. Over time, these averages converge to accurate estimates.
Key idea:
Learning happens only after the final outcome is known.
Properties:
- Uses full episodes
- Does not rely on intermediate estimates (no bootstrapping)
- Conceptually simple
Limitations:
- Updates are delayed until the episode ends
- High variance in returns
- Inefficient for long or continuing tasks
Temporal Difference Learning
Temporal Difference (TD) learning improves on Monte Carlo methods by updating estimates during the episode, rather than waiting until the end.
Instead of using the full return, TD learning updates values using:
- the immediate reward
- the estimated value of the next state
This introduces bootstrapping, where estimates are updated using other estimates.
The simplest TD method, TD(0), uses the update rule:
V(s) ← V(s) + α [ R + γV(s') − V(s) ]
Here, the value of the current state is adjusted toward a better estimate that includes new information.
Key idea:
Learn immediately from each step, not just from final outcomes.
Properties:
- Learns online (step-by-step)
- More efficient than Monte Carlo
- Works for both episodic and continuing tasks
Limitations:
- Introduces bias due to bootstrapping
- Can be unstable without proper tuning
Monte Carlo vs Temporal Difference
Monte Carlo and TD learning represent two different ways of learning from experience.
Monte Carlo methods use actual returns from complete episodes. This makes them unbiased but often slow and high in variance.
Temporal Difference methods update incrementally using estimates. This makes them faster and more efficient, but introduces some bias.
In practice, TD methods are more commonly used because they allow continuous learning and scale better to complex problems.
Summary
Model-free reinforcement learning allows agents to learn without prior knowledge of the environment.
- Monte Carlo methods learn from complete episodes using actual returns
- Temporal Difference learning updates estimates step by step using bootstrapping
Both approaches rely on experience and form the foundation for more advanced algorithms such as Q-learning and Deep Reinforcement Learning.