Unit 4: Dynamic Programming Methods
Introduction
Dynamic Programming (DP) is one of the foundational methods in reinforcement learning. It provides systematic techniques to compute optimal policies when the full model of the environment is known.
In DP, the agent has complete knowledge of:
- All possible states
- All possible actions
- Transition probabilities P(s' | s, a)
- Reward function R(s, a, s')
With this information, DP methods iteratively improve value estimates to find optimal solutions. While many real-world environments lack a full model, understanding DP is crucial for developing advanced RL algorithms.
Policy Evaluation
Concept: Determine how good a given policy π is.
- If an agent follows policy π, the value of each state represents the expected cumulative reward starting from that state.
- Policy evaluation iteratively updates state values until they converge.
Key idea:
- Start with initial estimates V(s)
- Use the Bellman expectation equation: V^π(s) = Σ_a π(a|s) Σ_{s'} P(s'|s,a) [ R(s,a,s') + γ V^π(s') ]
- Repeat until values stabilize.
Purpose: Answers the question: “If the agent follows this policy, how good are the states?”
Policy Improvement
Concept: Use the current value estimates to make the policy better.
- For each state s, select the action a that maximizes expected return: π_new(s) = argmax_a Σ_{s'} P(s'|s,a) [ R(s,a,s') + γ V^π(s') ]
- Replace the old policy with the improved policy.
- Repeat this process iteratively.
Key idea: Make the policy greedy with respect to the value function, always choosing actions that maximize expected reward.
Policy Iteration
Concept: Combine policy evaluation and improvement into a loop.
Steps:
- Start with an initial policy.
- Evaluate the policy (compute V^π).
- Improve the policy using current value estimates.
- Repeat until the policy stops changing.
Outcome: Guarantees convergence to the optimal policy π* for finite MDPs.
Note: Can be computationally intensive for large environments, but provides a solid theoretical framework.
Value Iteration
Concept: A more efficient alternative to policy iteration.
- Instead of fully evaluating the policy before improving it, value iteration combines evaluation and improvement in one step.
- Update state values using: V_{k+1}(s) = max_a Σ_{s'} P(s'|s,a) [ R(s,a,s') + γ V_k(s') ]
- Iterate until V_k(s) → V*(s).
Optimal policy: Once V* is known: π*(s) = argmax_a Σ_{s'} P(s'|s,a) [ R(s,a,s') + γ V*(s') ]
Key advantage: Often faster and simpler than full policy iteration.
Convergence of Dynamic Programming
Dynamic Programming methods are guaranteed to converge under certain conditions:
- Finite states and actions
- Discount factor 0 ≤ γ < 1
- Accurate transition probabilities and rewards
Limitation: Requires a complete model of the environment, which is rarely available in practice. This limitation motivates model-free RL methods, such as Monte Carlo and Temporal Difference learning.
Summary
Dynamic Programming provides the theoretical foundation for solving reinforcement learning problems when the full environment model is known.
Key concepts:
- Policy Evaluation: Estimate state values under a policy
- Policy Improvement: Update policy using value estimates
- Policy Iteration: Repeat evaluation and improvement until convergence
- Value Iteration: Combine evaluation and improvement into a single iterative step
Even though DP is often impractical for large or unknown environments, it is essential for understanding how optimal RL policies can be computed.