Unit 7: Recap & Key Takeaways

Introduction

Reinforcement Learning (RL) combines concepts from mathematics, algorithms, and decision-making. Throughout this course, you have moved from intuitive ideas to formal definitions and practical methods. This chapter brings everything together, reinforcing the key concepts and showing how they connect.

Core Concepts

Agent & Environment

An agent interacts with an environment by taking actions and receiving rewards.

A state represents the current situation of the environment.

The interaction loop: observe -> act -> receive reward -> update knowledge

Markov Decision Processes (MDPs)

Provide the formal framework for reinforcement learning problems.

States (S)
Actions (A)
Transition probabilities (P)
Reward function (R)
Discount factor (gamma)

Markov Property: the future depends only on the current state.

Policies

A policy pi(s) defines how the agent selects actions.

Deterministic: always chooses the same action for a state
Stochastic: assigns probabilities to actions

Value Functions

State Value Function V_pi(s): expected return from a state following policy pi
Action-Value Function Q_pi(s, a): expected return for taking an action in a state and following pi afterward

Bellman Equations define these values recursively and form the foundation of RL algorithms.

Dynamic Programming

Dynamic programming methods solve RL problems when the environment model is known:

Policy Evaluation: compute V_pi(s) for a given policy
Policy Improvement: update the policy using value estimates
Policy Iteration: alternate evaluation and improvement until convergence
Value Iteration: directly compute optimal values using repeated updates

These methods guarantee convergence to the optimal policy in finite environments.

Model-Free Methods

Model-free methods learn directly from experience without knowing the environment dynamics:

Monte Carlo Methods: learn from complete episodes
Temporal Difference (TD) Learning: update estimates step by step using bootstrapping

These approaches are more practical in real-world scenarios where the model is unknown.

Q-Learning & Control

Q-Learning learns the optimal action-value function Q(s, a)
It is an off-policy algorithm, meaning it learns the optimal policy while possibly following a different one
Epsilon-greedy exploration balances exploration and exploitation

Exploration means trying new actions, while exploitation means choosing the best known action. This balance allows the agent to learn optimal behavior through interaction with the environment.

Key Takeaways

Reinforcement learning is about learning optimal decisions through interaction
MDPs provide the mathematical foundation for modeling environments
Value functions and Bellman equations allow evaluation of decisions
Dynamic programming methods solve problems when the model is known
Model-free methods and Q-learning enable learning directly from experience
Exploration is essential to avoid suboptimal behavior and ensure learning