Unit 6: Q-Learning and Advanced Methods
Introduction
In the previous unit, we explored how agents can learn value functions from experience. In this unit, we move from evaluation to control, where the goal is to learn how to act optimally.
Q-learning is one of the most widely used reinforcement learning algorithms. It allows an agent to learn the best action to take in each state without knowing the environment in advance.
Q-Learning
Q-learning is a model-free algorithm that learns the action-value function Q(s, a). This function represents how good it is to take action a in state s, assuming optimal behavior afterward.
The update rule is:
Q(s, a) ← Q(s, a) + α [ R + γ maxₐ Q(s', a) − Q(s, a) ]
This update adjusts the current estimate using:
- the reward received
- the best possible future value
By repeatedly applying this update, the agent learns which actions lead to the highest long-term reward.
Once learned, the optimal policy is:
- choose the action with the highest Q-value
On-Policy vs Off-Policy Learning
Reinforcement learning methods differ in how they learn from actions.
On-policy learning means the agent learns from the actions it actually takes. The same policy is used for both acting and learning.
Example: SARSA
Off-policy learning means the agent learns about an optimal policy while possibly following a different one.
Q-learning is off-policy because:
- it may explore using random actions
- but it updates using the best possible action (max Q)
This allows it to learn optimal behavior even when actions are not optimal.
Exploration Strategies (ε-greedy)
To learn effectively, the agent must balance:
- exploitation → choosing the best known action
- exploration → trying new actions
A common strategy is ε-greedy:
- With probability ε → choose a random action
- With probability 1 − ε → choose the best action
This ensures the agent explores the environment while still improving its decisions.
In practice, ε is often reduced over time:
- early → more exploration
- later → more exploitation
Summary
In this unit, we introduced Q-learning as a method for learning optimal behavior.
- Q-learning learns the action-value function Q(s, a)
- It is an off-policy algorithm
- It uses ε-greedy exploration to balance exploration and exploitation
These ideas form the basis for many advanced reinforcement learning methods.