Project: Navigating a Grid World using Value Iteration

In this project, we implement a simple 4×4 grid world environment and use Value Iteration to compute the optimal policy. The agent learns which actions maximize expected rewards in the grid.

Step 1: Define the Grid World

Toggle Code

# Gridworld setup
grid_size = 4
goal_state = (3, 3)

# Rewards: +1 for goal, 0 otherwise
rewards = {(i,j): 0 for i in range(grid_size) for j in range(grid_size)}
rewards[goal_state] = 1

# Possible actions
actions = ['U', 'D', 'L', 'R']

# Function to move in the grid
def next_state(state, action):
    i, j = state
    if action == 'U': i = max(i-1, 0)
    if action == 'D': i = min(i+1, grid_size-1)
    if action == 'L': j = max(j-1, 0)
    if action == 'R': j = min(j+1, grid_size-1)
    return (i, j)
		  

Explanation:

  • The goal is the bottom-right corner with reward +1.
  • The agent can move Up, Down, Left, or Right.
  • States outside the grid are clipped to remain within bounds.

Step 2: Value Iteration

Toggle Code

gamma = 0.9   # Discount factor
theta = 0.0001  # Convergence threshold

# Initialize state values to 0
V = {(i,j): 0 for i in range(grid_size) for j in range(grid_size)}

# Value Iteration
while True:
    delta = 0
    for state in V:
        if state == goal_state:
            continue
        v = V[state]
        V[state] = max([rewards[next_state(state,a)] + gamma * V[next_state(state,a)] for a in actions])
        delta = max(delta, abs(v - V[state]))
    if delta < theta:
        break
		  

Explanation:

  • gamma discounts future rewards.
  • We repeatedly update each state’s value based on the best next action.
  • Convergence occurs when changes (delta) are smaller than theta.

Step 3: Extract the Optimal Policy

Toggle Code

policy = {}
for state in V:
    if state == goal_state:
        policy[state] = '.'  # No action at goal
    else:
        best_action = max(actions, key=lambda a: rewards[next_state(state,a)] + gamma * V[next_state(state,a)])
        policy[state] = best_action
		  

Explanation:

  • For each state, pick the action leading to the highest value.
  • The goal state has no outgoing action.

Step 4: Display Values and Policy

Toggle Code

# Print state values
print("State Values:")
for i in range(grid_size):
    print([round(V[(i,j)],2) for j in range(grid_size)])

# Print optimal policy
print("\nOptimal Policy:")
for i in range(grid_size):
    print([policy[(i,j)] for j in range(grid_size)])
		  

Explanation:

  • State Values: Expected return from each state.
  • Policy: Optimal action per state (U, D, L, R).

Step 5: Understanding the Process

  • Grid Setup: Each cell represents a state; only the goal has a reward.
  • Value Iteration: Updates state values iteratively until convergence.
  • Policy Extraction: Chooses the action that leads to the highest expected return.
  • Visualization: State values and policy provide intuition about optimal paths.

Outcome

The agent learns the optimal path to the goal. Value iteration ensures convergence to the true optimal values. This project demonstrates a model-based RL approach with deterministic transitions in a simple environment.

Example Output:


State Values:
[0.66, 0.73, 0.81, 0.90]
[0.73, 0.81, 0.90, 1.00]
[0.81, 0.90, 0.99, 1.00]
[0.90, 0.99, 1.00, 1.00]

Optimal Policy:
['R', 'R', 'R', 'D']
['U', 'R', 'R', 'D']
['U', 'R', 'R', 'D']
['U', 'R', 'R', '.']