Unit 4: Dynamic Programming Methods

Introduction

Dynamic Programming (DP) is one of the foundational methods in reinforcement learning. It provides systematic techniques to compute optimal policies when the full model of the environment is known.

In DP, the agent has complete knowledge of:

All possible states
All possible actions
Transition probabilities P(s' | s, a)
Reward function R(s, a, s')

With this information, DP methods iteratively improve value estimates to find optimal solutions. While many real-world environments lack a full model, understanding DP is crucial for developing advanced RL algorithms.

Policy Evaluation

Concept: Determine how good a given policy π is.

If an agent follows policy π, the value of each state represents the expected cumulative reward starting from that state.
Policy evaluation iteratively updates state values until they converge.

Key idea:

Start with initial estimates V(s)
Use the Bellman expectation equation: V^π(s) = Σ_a π(a|s) Σ_{s'} P(s'|s,a) [ R(s,a,s') + γ V^π(s') ]
Repeat until values stabilize.

Purpose: Answers the question: “If the agent follows this policy, how good are the states?”

Policy Improvement

Concept: Use the current value estimates to make the policy better.

For each state s, select the action a that maximizes expected return: π_new(s) = argmax_a Σ_{s'} P(s'|s,a) [ R(s,a,s') + γ V^π(s') ]
Replace the old policy with the improved policy.
Repeat this process iteratively.

Key idea: Make the policy greedy with respect to the value function, always choosing actions that maximize expected reward.

Policy Iteration

Concept: Combine policy evaluation and improvement into a loop.

Steps:

Start with an initial policy.
Evaluate the policy (compute V^π).
Improve the policy using current value estimates.
Repeat until the policy stops changing.

Outcome: Guarantees convergence to the optimal policy π* for finite MDPs.

Note: Can be computationally intensive for large environments, but provides a solid theoretical framework.

Value Iteration

Concept: A more efficient alternative to policy iteration.

Instead of fully evaluating the policy before improving it, value iteration combines evaluation and improvement in one step.
Update state values using: V_{k+1}(s) = max_a Σ_{s'} P(s'|s,a) [ R(s,a,s') + γ V_k(s') ]
Iterate until V_k(s) → V*(s).

Optimal policy: Once V* is known: π*(s) = argmax_a Σ_{s'} P(s'|s,a) [ R(s,a,s') + γ V*(s') ]

Key advantage: Often faster and simpler than full policy iteration.

Convergence of Dynamic Programming

Dynamic Programming methods are guaranteed to converge under certain conditions:

Finite states and actions
Discount factor 0 ≤ γ < 1
Accurate transition probabilities and rewards

Limitation: Requires a complete model of the environment, which is rarely available in practice. This limitation motivates model-free RL methods, such as Monte Carlo and Temporal Difference learning.

Summary

Dynamic Programming provides the theoretical foundation for solving reinforcement learning problems when the full environment model is known.

Key concepts:

Policy Evaluation: Estimate state values under a policy
Policy Improvement: Update policy using value estimates
Policy Iteration: Repeat evaluation and improvement until convergence
Value Iteration: Combine evaluation and improvement into a single iterative step

Even though DP is often impractical for large or unknown environments, it is essential for understanding how optimal RL policies can be computed.