Unit 3: Mathematical Foundations of Reinforcement Learning
Introduction
Reinforcement learning relies heavily on precise mathematical definitions to evaluate policies and derive optimal actions. This unit covers the key equations like state value function, action-value function (Q-function), Bellman equations, policy improvement, and value iteration and includes Python examples to illustrate how these formulas are implemented in practice.
1. State Value Function
Formula:
V^π(s) = E^π [ Σ_{t=0}^{∞} γ^t · R_{t+1} | S₀ = s ]
Explanation:
- Measures the expected cumulative reward starting from state s and following policy π.
- γ (0 ≤ γ < 1) is the discount factor, controlling how future rewards are valued.
- E^π denotes expectation assuming the agent follows policy π.
- Provides a way to evaluate the “goodness” of a state under a given policy.
Python Example:
import numpy as np
# Simple 3-state environment with rewards
rewards = [0, 0, 1]
gamma = 0.9
V = np.zeros(len(rewards)) # initialize state values
# Iterative policy evaluation
for _ in range(10):
for s in range(len(rewards)):
V[s] = rewards[s] + gamma * V[s] # update V^π(s)
print("State Values:", V)
2. Action-Value Function (Q-Function)
Formula:
Q^π(s,a) = Σ_{s'} P(s' | s, a) · [ R(s, a, s') + γ · V^π(s') ]
Explanation:
- Measures the expected return of taking action a in state s, then following policy π.
- P(s' | s, a) is the probability of transitioning to state s'.
- R(s,a,s') is the immediate reward.
- Q(s,a) is fundamental for policy improvement and Q-learning.
Python Example:
# 2 actions per state
Q = np.zeros((len(rewards), 2))
# Transition rewards: (state, action) -> reward
transitions = {
(0, 0): 0, (0, 1): 0,
(1, 0): 0, (1, 1): 1,
(2, 0): 1, (2, 1): 1
}
for s in range(len(rewards)):
for a in range(2):
Q[s, a] = transitions[(s,a)] + gamma * np.max(Q[s,:])
print("Q-values:\n", Q)
3. Bellman Equations
Bellman Expectation Equation
Formula:
V^π(s) = Σ_a π(a | s) Σ_{s'} P(s' | s, a) · [ R(s, a, s') + γ · V^π(s') ]
- Expresses the value of a state recursively in terms of the values of successor states.
Bellman Optimality Equations
Formulas:
V*(s) = max_a Σ_{s'} P(s' | s, a) · [ R(s, a, s') + γ · V*(s') ]
Q*(s,a) = Σ_{s'} P(s' | s, a) · [ R(s, a, s') + γ · max_{a'} Q*(s', a') ]
- V*(s) is the maximum value achievable from state s.
- Q*(s,a) is the maximum expected return of taking action a in state s.
- These equations are central to computing optimal policies.
4. Policy Improvement
Formula:
π_new(s) = argmax_a Q^π(s, a)
Explanation:
- Update the policy to always pick the action with the highest Q-value.
- Repeat until the policy converges to the optimal policy π*.
5. Value Iteration
Formula:
V_{k+1}(s) = max_a Σ_{s'} P(s' | s, a) · [ R(s, a, s') + γ · V_k(s') ]
Optimal Policy:
π*(s) = argmax_a Σ_{s'} P(s' | s, a) · [ R(s, a, s') + γ · V*(s') ]
Explanation:
- Combines evaluation and improvement in one step.
- Iteratively update V(s) until it converges to V*.
- Once V* is known, derive π* easily.
Python Example:
V = np.zeros(len(rewards))
theta = 1e-4 # convergence threshold
while True:
delta = 0
for s in range(len(rewards)):
v = V[s]
V[s] = max([transitions[(s,a)] + gamma * V[s] for a in range(2)])
delta = max(delta, abs(v - V[s]))
if delta < theta:
break
# Derive optimal policy
pi_star = np.zeros(len(rewards), dtype=int)
for s in range(len(rewards)):
pi_star[s] = np.argmax([transitions[(s,a)] + gamma * V[s] for a in range(2)])
print("Optimal State Values:", V)
print("Optimal Policy:", pi_star)
6. Summary Table of Key Equations
| Concept | Formula | Explanation |
|---|---|---|
| State Value Function | V^π(s) = Σ_a π(a | s) Σ_{s'} P(s' |
| Action-Value Function | Q^π(s,a) = Σ_{s'} P(s' | s,a) · [ R(s,a,s') + γ · V^π(s') ] |
| Policy Improvement | π_new(s) = argmax_a Q^π(s,a) | Update policy to pick the best action |
| Value Iteration | V_{k+1}(s) = max_a Σ_{s'} P(s' | s,a) · [ R(s,a,s') + γ · V_k(s') ] |
| Optimal Policy | π*(s) = argmax_a Σ_{s'} P(s' | s,a) · [ R(s,a,s') + γ · V*(s') ] |