Mathematical Foundations
Reinforcement learning relies on formal mathematical tools to describe how agents evaluate actions and make decisions. This unit introduces the key concepts and equations that form the backbone of classical RL, including value functions, Bellman equations, and the notion of an optimal policy. Understanding these foundations is essential before implementing RL algorithms.
State-Value and Action-Value Functions
In reinforcement learning, the agent’s goal is to maximize expected cumulative reward. To measure how good a state or action is, we define two fundamental functions: state-value and action-value.
State-Value Function Vπ(s)
The state-value function, denoted as Vπ(s), measures how “good” it is for an agent to be in a particular state when following a specific policy π. It captures the total expected reward the agent can collect starting from that state. Understanding Vπ(s) is essential for decision-making in reinforcement learning because it tells the agent which states are valuable and which are not.
V^π(s) = E_π [ G_t | S_t = s ]
Where:
- S_t = s – the agent is in state s at time t.
- π – the policy, which defines the probability of taking each possible action in each state.
- G_t – the total future cumulative reward from time t onward:
G_t = R_{t+1} + γ R_{t+2} + γ² R_{t+3} + … = Σ_{k=0}^{∞} γ^k R_{t+k+1}
- R_{t+1} – reward received immediately after leaving state s.
- γ – discount factor (0 ≤ γ < 1) that reduces the weight of future rewards.
Intuition: Vπ(s) tells us the expected value of being in a state s if the agent acts according to policy π. Higher values mean the state is “better” because it leads to more rewards in the future.
Example: In a grid world, a square near the goal or containing a coin will have a higher Vπ(s) than a square far from rewards.
Action-Value Function Qπ(s, a)
The action-value function tells us how good it is to take a specific action in a state, considering not only the immediate reward but also all future rewards if we follow a policy π\piπ afterward.
Q^π(s, a) = E_π [ G_t | S_t = s, A_t = a ]
Explanation of symbols:
- S_t = s - starting state.
- A_t = a - action taken at time t.
- π - policy followed after taking action a.
- E_π[] - expectation over all possible future trajectories.
- G_t - cumulative discounted reward, as defined above.
Intuition: Qπ(s, a) tells us how good a specific action is in a given state, considering both immediate and future rewards under policy π.
Example: In a grid world, Qπ(s, “up”) might be higher than Qπ(s, “down”) if moving up leads more directly toward the goal or coins.
Relation Between Vπ and Qπ
Once Qπ(s, a) is known, the state-value function can be recovered as:
V^π(s) = Σ_a π(a | s) Q^π(s, a)
Intuition: The value of a state is the expected value of all possible actions, weighted by the policy’s probabilities.
Bellman Expectation Equation
The Bellman Expectation Equation expresses a recursive relationship for the value of a state under a given policy π:
V^π(s) = Σₐ π(a | s) Σₛ' P(s' | s, a) [ R(s, a, s') + γ V^π(s') ]
Explanation of symbols:
- V^π(s) - the value of state s under policy π; the expected cumulative reward when starting in state s and following π.
- Σₐ - sum over all possible actions a available in state s.
- π(a | s) - probability of taking action a in state s according to policy π.
- Σₛ' - sum over all possible next states s'.
- P(s' | s, a) - probability of transitioning to state s' from state s by taking action a.
- R(s, a, s') - reward received when transitioning from s to s' using action a.
- γ ∈ [0,1) - discount factor, which reduces the weight of future rewards.
- V^π(s') - value of the next state s', assuming the agent continues following policy π.
Intuition: The value of a state is the expected immediate reward plus the discounted expected value of the next state, averaged over all possible actions and transitions according to the policy π.
Example: In a grid world, If a state has two possible actions, up and right, with probabilities 0.5 each under π, then V^π(s) is the weighted sum of the expected rewards and next-state values for both actions. This allows the agent to estimate long-term value of each state under its current policy.
Bellman Optimality Equation
The Bellman Optimality Equation describes the maximum value an agent can achieve from any state by always choosing the best possible action. It is the foundation for many reinforcement learning algorithms, including dynamic programming, value iteration, and Q-learning.
Optimal State-Value Function V*(s)
The optimal state-value function gives the highest expected cumulative reward the agent can achieve starting from state s:
V(s) = maxₐ Σₛ' P(s' | s, a) [ R(s, a, s') + γ V(s') ]**
Explanation of symbols:
- V(s)* - optimal value of state s, i.e., the maximum expected return achievable if the agent acts optimally afterward.
- maxₐ - choose the best action a from state s to maximize future reward.
- Σₛ' - sum over all possible next states s' reachable from s by action a.
- P(s' | s, a) - probability of ending up in state s' after taking action a in state s.
- R(s, a, s') - reward received when transitioning from s to s' using action a.
- γ ∈ [0,1) - discount factor, reducing the weight of future rewards.
- V(s')* - optimal value of the next state s', assuming optimal behavior continues.
Intuition: The optimal value of a state is the best total reward achievable if you pick the best action now and then act optimally in all future states.”
Example: In a maze, If moving right leads to the exit faster than moving up, then V*(s) is determined by the path that goes right. V*(s) includes both immediate rewards (like a coin) and future rewards (distance to the goal).
Optimal Action-Value Function Q*(s, a)
The optimal action-value function gives the expected total reward of taking a specific action a in state s and then acting optimally afterward:
Q(s, a) = Σₛ' P(s' | s, a) [ R(s, a, s') + γ maxₐ' Q(s', a') ]**
Explanation of symbols:
- Q(s, a)* - optimal value of taking action a in state s.
- s' - possible next states after taking action a.
- a' - possible future actions in state s'.
- P(s' | s, a) - probability of transitioning to s' from s using action a.
- R(s, a, s') - reward received for the transition.
- γ - discount factor for future rewards.
- maxₐ' Q(s', a')* - value of the best possible action in the next state s'.
Intuition: The value of taking action a in state s is the expected reward now plus the best possible rewards in the future, assuming optimal behavior continues.
Example: In a grid world, If moving up now leads closer to the goal, Q*(s, up) will be higher than Q*(s, down). Q*(s, a) allows the agent to compare all actions in a state and select the one maximizing long-term reward.
Relation Between V*(s) and Q*(s, a)
Once Q*(s, a) is known, the optimal state-value can be computed directly from the action-values:
V(s) = maxₐ Q(s, a)**
Intuition: The optimal value of a state equals the value of its best action. This relation is crucial because in many RL algorithms (like Q-learning), we often compute Q first*, then derive V* from it.
Defining the Optimal Policy
An optimal policy π∗ is a policy that maximizes the expected return from every state. Formally:
π(s) = argmaxₐ Q(s, a)**
Explanation of symbols:
- π(s)* - the optimal action to take in state s.
- argmaxₐ - the action a that maximizes the expected total reward.
- Q(s, a)* - optimal action-value function; the expected total reward of taking action a in state s and acting optimally thereafter.
Key points:
- Following π* ensures the agent achieves the highest possible cumulative reward.
- There may be multiple optimal policies, all producing the same optimal state-value function *V(s)**.
Example: In a maze, π*(s) tells the agent which direction to move in each square to reach the goal as efficiently as possible. If several paths lead to the goal with the same total reward, any of them can correspond to an optimal policy.
The Contraction Property
The contraction property ensures that repeated application of the Bellman operator brings value estimates closer to the true value. Formally:
‖ T V − T U ‖∞ ≤ γ ‖ V − U ‖∞
Explanation of symbols:
- T - the Bellman operator, which maps a value function to a new value function using either the Bellman Expectation or Optimality equation.
- V, U - two arbitrary value functions.
- ‖ · ‖∞ - the max-norm, measuring the largest difference across all states.
- γ < 1 - the discount factor; guarantees that the operator is a contraction, shrinking differences between value functions.
Intuition: Each iteration of updating value functions shrinks the difference between current estimates and the true values. Over repeated updates, this ensures convergence to a unique fixed point, which is the optimal value function V*.
Example: Suppose an agent starts with arbitrary value estimates for each state. Applying the Bellman update repeatedly moves these estimates *closer to the true V(s)**. Because γ < 1, the updates cannot overshoot, so the values steadily converge.
Unit Summary
- State-Value and Action-Value Functions: Vπ(s) and Qπ(s, a) measure how good states and actions are under a policy.
- Bellman Expectation Equation: Computes expected cumulative reward for a state based on the policy and next-state values.
- Bellman Optimality Equation: Determines the maximum achievable value by always choosing the best action.
- Optimal Policy (π*): Selects actions that maximize expected total reward from each state.
- Contraction Property: Repeated updates of value functions converge to the optimal values due to γ < 1.