Unit 3: Mathematical Foundations of Reinforcement Learning

Introduction

Reinforcement learning relies heavily on precise mathematical definitions to evaluate policies and derive optimal actions. This unit covers the key equations like state value function, action-value function (Q-function), Bellman equations, policy improvement, and value iteration and includes Python examples to illustrate how these formulas are implemented in practice.

1. State Value Function

Formula:

V^π(s) = E^π [ Σ_{t=0}^{∞} γ^t · R_{t+1} | S₀ = s ]

Explanation:

  • Measures the expected cumulative reward starting from state s and following policy π.
  • γ (0 ≤ γ < 1) is the discount factor, controlling how future rewards are valued.
  • E^π denotes expectation assuming the agent follows policy π.
  • Provides a way to evaluate the “goodness” of a state under a given policy.

Python Example:


import numpy as np 
# Simple 3-state environment with rewards 
rewards = [0, 0, 1]   
gamma = 0.9 
V = np.zeros(len(rewards))  # initialize state values 
 
# Iterative policy evaluation 
for _ in range(10): 
    for s in range(len(rewards)): 
        V[s] = rewards[s] + gamma * V[s]  # update V^π(s) 
 
print("State Values:", V) 

2. Action-Value Function (Q-Function)

Formula:

Q^π(s,a) = Σ_{s'} P(s' | s, a) · [ R(s, a, s') + γ · V^π(s') ]

Explanation:

  • Measures the expected return of taking action a in state s, then following policy π.
  • P(s' | s, a) is the probability of transitioning to state s'.
  • R(s,a,s') is the immediate reward.
  • Q(s,a) is fundamental for policy improvement and Q-learning.

Python Example:


# 2 actions per state 
Q = np.zeros((len(rewards), 2)) 
 
# Transition rewards: (state, action) -> reward 
transitions = { 
    (0, 0): 0, (0, 1): 0, 
    (1, 0): 0, (1, 1): 1, 
    (2, 0): 1, (2, 1): 1 
} 
 
for s in range(len(rewards)): 
    for a in range(2): 
        Q[s, a] = transitions[(s,a)] + gamma * np.max(Q[s,:]) 
 
print("Q-values:\n", Q) 

3. Bellman Equations

Bellman Expectation Equation

Formula:

V^π(s) = Σ_a π(a | s) Σ_{s'} P(s' | s, a) · [ R(s, a, s') + γ · V^π(s') ]

  • Expresses the value of a state recursively in terms of the values of successor states.

Bellman Optimality Equations

Formulas:

V*(s) = max_a Σ_{s'} P(s' | s, a) · [ R(s, a, s') + γ · V*(s') ]

Q*(s,a) = Σ_{s'} P(s' | s, a) · [ R(s, a, s') + γ · max_{a'} Q*(s', a') ]

  • V*(s) is the maximum value achievable from state s.
  • Q*(s,a) is the maximum expected return of taking action a in state s.
  • These equations are central to computing optimal policies.

4. Policy Improvement

Formula:

π_new(s) = argmax_a Q^π(s, a)

Explanation:

  • Update the policy to always pick the action with the highest Q-value.
  • Repeat until the policy converges to the optimal policy π*.

5. Value Iteration

Formula:

V_{k+1}(s) = max_a Σ_{s'} P(s' | s, a) · [ R(s, a, s') + γ · V_k(s') ]

Optimal Policy:

π*(s) = argmax_a Σ_{s'} P(s' | s, a) · [ R(s, a, s') + γ · V*(s') ]

Explanation:

  • Combines evaluation and improvement in one step.
  • Iteratively update V(s) until it converges to V*.
  • Once V* is known, derive π* easily.

Python Example:


V = np.zeros(len(rewards)) 
theta = 1e-4  # convergence threshold 
 
while True: 
    delta = 0 
    for s in range(len(rewards)): 
        v = V[s] 
        V[s] = max([transitions[(s,a)] + gamma * V[s] for a in range(2)]) 
        delta = max(delta, abs(v - V[s])) 
    if delta < theta: 
        break 
 
# Derive optimal policy 
pi_star = np.zeros(len(rewards), dtype=int) 
for s in range(len(rewards)): 
    pi_star[s] = np.argmax([transitions[(s,a)] + gamma * V[s] for a in range(2)]) 
 
print("Optimal State Values:", V) 
print("Optimal Policy:", pi_star) 
6. Summary Table of Key Equations
Concept Formula Explanation
State Value Function V^π(s) = Σ_a π(a s) Σ_{s'} P(s'
Action-Value Function Q^π(s,a) = Σ_{s'} P(s' s,a) · [ R(s,a,s') + γ · V^π(s') ]
Policy Improvement π_new(s) = argmax_a Q^π(s,a) Update policy to pick the best action
Value Iteration V_{k+1}(s) = max_a Σ_{s'} P(s' s,a) · [ R(s,a,s') + γ · V_k(s') ]
Optimal Policy π*(s) = argmax_a Σ_{s'} P(s' s,a) · [ R(s,a,s') + γ · V*(s') ]