Unit 6: Q-Learning and Advanced Methods

Introduction

In the previous unit, we explored how agents can learn value functions from experience. In this unit, we move from evaluation to control, where the goal is to learn how to act optimally.

Q-learning is one of the most widely used reinforcement learning algorithms. It allows an agent to learn the best action to take in each state without knowing the environment in advance.

Q-Learning

Q-learning is a model-free algorithm that learns the action-value function Q(s, a). This function represents how good it is to take action a in state s, assuming optimal behavior afterward.

The update rule is:

Q(s, a) ← Q(s, a) + α [ R + γ maxₐ Q(s', a) − Q(s, a) ]

This update adjusts the current estimate using:

the reward received
the best possible future value

By repeatedly applying this update, the agent learns which actions lead to the highest long-term reward.

Once learned, the optimal policy is:

choose the action with the highest Q-value

On-Policy vs Off-Policy Learning

Reinforcement learning methods differ in how they learn from actions.

On-policy learning means the agent learns from the actions it actually takes. The same policy is used for both acting and learning.

Example: SARSA

Off-policy learning means the agent learns about an optimal policy while possibly following a different one.

Q-learning is off-policy because:

it may explore using random actions
but it updates using the best possible action (max Q)

This allows it to learn optimal behavior even when actions are not optimal.

Exploration Strategies (ε-greedy)

To learn effectively, the agent must balance:

exploitation → choosing the best known action
exploration → trying new actions

A common strategy is ε-greedy:

With probability ε → choose a random action
With probability 1 − ε → choose the best action

This ensures the agent explores the environment while still improving its decisions.

In practice, ε is often reduced over time:

early → more exploration
later → more exploitation

Summary

In this unit, we introduced Q-learning as a method for learning optimal behavior.

Q-learning learns the action-value function Q(s, a)
It is an off-policy algorithm
It uses ε-greedy exploration to balance exploration and exploitation

These ideas form the basis for many advanced reinforcement learning methods.