Dynamic Programming (Model-Based RL)
Dynamic Programming (DP) is a set of algorithms used to solve problems where the environment’s rules and outcomes are known. In reinforcement learning, it helps calculate how good each situation is and determine the best actions to take. This unit introduces DP methods and shows them using a grid world, a simple visual example.
Evaluating a Policy
Evaluating a policy means figuring out how good it is to be in each situation if you follow that policy. The process goes as follows:
- Start with an initial guess for the value of each situation (often zero).
- Repeatedly update each value by looking at all possible actions from that situation, considering the rewards and the values of the resulting situations.
- Stop when the values stop changing significantly.
Example: In a 4×4 Grid world, the agent starts with all zeros. Following a random strategy, the value of each cell gradually adjusts, reflecting how promising that cell is for reaching the goal.
Intuition: Each update spreads the expected rewards backward through the grid, showing which positions are better over time.
Improving and Iterating Policies
Once we know the value of a policy, we can try to improve it by choosing actions that lead to better results.
Improving a Policy
For each situation, pick the action that seems to lead to the highest total reward based on the current values. Policy iteration alternates between checking how good the policy is and making it better:
- Evaluate the current policy.
- Update the policy by choosing the best action in each situation.
- Repeat until the policy stops changing.
Example: In a grid world, this process might begin with a random policy. The agent evaluates each cell’s value, adjusts its strategy to move toward higher-value cells, and continues iterating. Eventually, the agent discovers a path that reaches the goal efficiently.
Value Iteration
Value iteration combines evaluating and improving the policy into one step. Update the value of each situation by considering all possible actions and picking the one that seems best. Once the values stabilize, the best action for each situation is the action that led to the highest value.
Why it’s useful: It is often faster than policy iteration because it does not require fully evaluating a policy before improving it and it also directly finds the best actions while updating values.
Example: In a Grid world, value iteration quickly highlights the cells near the goal as the most valuable, guiding the agent toward the best path automatically.
Convergence
Both policy iteration and value iteration gradually refine values. Each update brings the values closer to the true optimal values. The differences between successive value estimates shrink over repeated updates. Eventually, the values stabilize, showing the best expected rewards for each situation.
Grid World Examples
A grid world is a simple, visual environment used to illustrate how Dynamic Programming works in practice. It helps make abstract concepts like policy evaluation and value iteration concrete and easy to follow.
Structure of a grid world
The environment is a 4×4 grid, where each square represents a distinct state the agent can be in. Certain squares provide rewards (for example, reaching the goal), while most others give no reward.
The agent can move up, down, left, or right, and the movements may be deterministic (always succeed) or stochastic (sometimes fail or go in a different direction).
Applying Dynamic Programming in a Grid World
Policy Evaluation
Start with a fixed strategy for moving around (for example, choosing actions at random). Assign values to each square based on the expected total reward from that square under the current strategy. Over multiple iterations, the values stabilize, reflecting how good it is to be in each square.
Policy Iteration
- Begin with a random strategy.
- Evaluate the current policy to find the value of each square.
- Improve the strategy by choosing the best action in each square.
- Repeat evaluation and improvement until the strategy no longer changes and the optimal policy is found.
Value Iteration
Update the value of each square by considering all possible moves and picking the one that gives the highest expected reward. Continue updating until the values stabilize. Once complete, the optimal actions for each square can be directly determined from the final values.
Visualization
Heatmaps show the value of each square, helping visualize which areas are most valuable. Arrows or directions indicate the best action to take from each square. Over iterations, the arrows converge toward the goal, showing how the agent learns the optimal path.
Unit Summary
- Dynamic Programming: A set of model-based reinforcement learning methods used when the environment’s dynamics (transition probabilities and rewards) are known.
- Policy Evaluation: Calculates how good each state is under a given policy by repeatedly updating value estimates until they converge.
- Policy Iteration: Alternates between evaluating the current policy and improving it by selecting actions that lead to higher expected rewards.
- Value Iteration: Combines policy evaluation and policy improvement into a single update step, directly moving toward the optimal value function.
- Convergence: Repeated value updates gradually reduce differences between estimates until the values stabilize at the optimal solution.
- Grid World Example: A simple environment used to visualize how values propagate through states and how optimal actions emerge.