In reinforcement learning, problems often take the form of a tradeoff between short-term and long-term rewards. Should an agent devote resources to exploring a system and setting up a larger payoff tomorrow? Or is it smarter to cash in accumulated resources for a reward today? In all likelihood, the optimal strategy lies somewhere in the middle, with a rational maximizing agent making a decision of what action to take at each point in time based on the current state of their environment. This paradigm can help us solve problems as varied as elevator scheduling, playing checkers, and swarm simulation. Today, we’ll explore how these problems are normally done -- and one way they could possibly be done more efficiently -- through the application of endogenous gridpoints to tax policy evaluation.

## Markov Decision Processes

The problem formulation described above can be represented mathematically as a Markov Decision Process (MDP). An MDP is a discrete-time stochastic control process, which makes them a suitable framework for modeling decision making when the outcomes of those decisions are partially random and not entirely in the agent’s control. To define an MDP, we need just four things:

**\(S\)**– a finite set of states that an agent can occupy at any given point in time**\(A\)**– a finite set of actions that an agent can take (that may depend on their current state)**\(P\)**or \(f\) – a function governing the agent’s transition from their current state to their state in the next timestep**\(R\)**

**Policy Functions**

Given a fully defined MDP, we can answer a question that often interests researchers: if the agent is at state s, what is the optimal action for them to take? A function that maps states to optimal actions is called the policy function. Mathematically, we represent the policy function as pi, and can define it like so:

$$\pi^* = \underset{\pi}{argmax} \bigg( \sum_{t=1}^T \beta^t R_{a_t}(s_t, s_{t+1})\bigg)$$

**Value Functions**

An adjacent concept to the policy function is the value function. The value function uses the policy function to make assumptions about future agent behavior and in turn calculates the cumulative value of the agent’s decision today and the discounted expected value of the agent’s possible decisions at future timesteps. This yields a recursive definition like so:

$$V(s) := \sum_{s'}P_{\pi(s)}(s, s')\big(R_{\pi(s)}(s,s') + \beta V(s') \big)$$

Note that we can equivalently use a value function to go the other logical direction and define our policy function:

$$\pi(s) := \underset{a}{argmax} \bigg\{ \sum_{s'} P(s' | s,a) \big( R(s'|s,a) + \beta V(s') \big) \bigg\}$$

**Example: Spend-Save Decisions**

The amount of money that a person chooses to spend or save each year can be represented as a Markov Decision Process. We can let an individual’s state be the amount of money they have today, \(y\). This person can then choose to consume an amount \(c\) between \(0\) and \(y\) of that money this year and save the rest. When a person consumes \(c\) they can expect to derive a certain amount of happiness from it. This expected happiness is their reward function \(R\). Finally, given how much money a person starts with \(y\) and how much they choose to consume \(c\), we know the probability distribution of them beginning the next year in various other states \(y\).

The optimal choice of \(c\) is represented by \(\pi(y)\) and the lifetime, cumulative discounted value of having \(y\) dollars at the present moment is \(V(y)\).

## Value Function Iteration

Unfortunately, while it depends on your choice of \(P\) and \(R\), the value function as defined above likely has no closed form solution. To arrive at an approximate solution, we can employ a technique known as value function iteration, plugging in a guess on the right side and calculating the implied left side. This left side then becomes our next best guess on the right side until both sides of the equation converge, i.e. our current best guess for \(V\) and the next iteration of \(V\) are sufficiently similar.

Practically what this entails is discretizing the state space (if not already discrete) and solving for the optimal action at each discrete state. When this has been done for all discrete states, we have what we’d need to linearly interpolate between these points and approximate the value function.

**Example: Income Grid**

In the *Spend-Save *example, we’ll need to discretize our state space to be income levels between \(0\) and the \(\max(y)\) we could ever expect to see. Now, this could be quite difficult -- Jeff Bezos pushes \(\max(y)\) to new heights every day -- but for purposes of our simulation, we’ll consider the \(\max(y)\) to be \(400000\). A discretized version of our one-dimensional “gridspace” can be represented as points equally spaced between \(0\) and \(400000\):

Suppose our initial guess of an optimal policy function is that a person should consume all \(y\) in the current year. We know this isn’t likely the true optimal policy function, but we have to start somewhere. When we do value function iteration, this rudimentary \(V\) serves as the basis for us to calculate a slightly more informed \(V\) that takes into account not only spending in the current timestep, but also likely states we may end in next year as a result and the discounted expected value of potentially inhabiting them.

## Exogenous Gridpoints

To get here, we’ve glossed over a seemingly harmless decision -- we discretized our continuous state space \(y\) from \(0\) to \(400000\) into multiples of \(1000\), at which we’ll calculate the optimal \(c\) until we have our latest iteration of the value function. Doing this for any given gridpoint \(y\) requires us to choose \(c\) that solves the following equation:

$$V(y_t) = \max_{c_t} R(c_t) + \beta \int V(y_{t+1}) dF(\xi)$$

$$\text{such that } c_t < y_t$$

$$\text{and } y_{t+1} = f(y_t - c_t) * \xi_{t+1}$$

By first order conditions, the \(c\) that maximizes the above will also be that which solves the following:

$$R'(c_t) = \beta \int (R' \circ V')(f(y_t - c_t) \xi_t) f'(y_t - c_t) \xi_t dF(\xi)$$

Which can be solved via iterative search, computing the above integral for each guess of optimal \(c\). This computation can be quite expensive and may need to be done for many guesses of \(c\) before you arrive at an answer. And we’ll need to repeat this process \(400\) times just to get the current iteration of our value function with no guarantee that our \(V\) is near convergence.

## Endogenous Gridpoints

Here’s an idea -- when we complete that expensive root-finding process, we do find a \(c\) that is optimal for some \(y\) in our pre-discretization state space, but it may not be the \(y\) we had in mind (given a few assumptions; see applicability section below). In the above example, that means we may have calculated the \(c\) that is optimal for \(y = \$137,567\), but that isn’t in our grid so we keep going. *But why isn’t it in our grid?* Our decision to discretize our state space into multiples of \(1000\) was arbitrary anyways. We’ll linearly interpolate between the points on our grid regardless of if they are evenly spaced and with a sufficient number of gridpoints it will work about the same. So we might as well allow the output of this expensive calculation step to endogenously determine where our gridpoints lie.

We can accomplish this mathematically by reformulating our problem slightly. If we define a to be our ** post-decision, pre-transition state**, \(y - c\), then we can rewrite the above FOC to yield optimal consumption c in terms of post-decision state \(a\):

$$c_t = R'^{-1} \bigg( \beta \int (R' \circ V')(f(a_t)\xi_t)f'(a_t)\xi_t d(F_\xi)\bigg)$$

We can set up an exogenous grid of \(a\) values, feed each point on the grid through this expensive calculation only one time to yield an optimal \(c\) value for that \(a\). In doing so, we also yield an endogenous grid of implied \(y\) values, \(a + c\), where we must have started to expect to arrive at a having taken action \(c\).

This endogenous grid need not be perfectly spaced out, in fact it will probably be anything but, but the penalty for this is small -- we’ll need to interpolate on a non-rectilinear grid to obtain a continuous estimate of the value function across our state space, which is slightly more complicated than interpolation on a rectilinear grid. The tradeoff in efficiency there is certainly worth it, however, and the speed benefits scale with the number of gridpoints. Regardless of grid size, performing value function iteration with the endogenous gridpoints method consistently takes about \(¼\) of the time it takes to use the standard exogenous method.

**Applicability**

This concept can be applied to well-formulated dynamic stochastic optimization problems, given a few assumptions. For one, \(f\) must be differentiable. For each increase in a in the above example, we must be able to quantify the expected effect on the distribution of states tomorrow. Furthermore, the reward function \(R\) must both be differentiable AND have a differentiable inverse. This is necessary in order to back out what c must have chosen to arrive at an exogenous a value. Reward functions that meet these conditions include strictly concave / convex functions like logarithms and many even degree polynomials, but exclude straight lines and polynomials of odd degree.

## Extended Example: Tax Policy Evaluation

Thus far, we’ve detailed how Endogenous Gridpoints can help speed up Value Function Iteration to arrive at an estimate of the optimal policy function. Given a miniature microeconomic system, we’ve alluded to how the introduction of the post-decision state a can help us answer the question of how we might expect a rational person to spend or save given their available resources. The value to us in doing this lies in our ability to tweak the system and see what the effects are over the long term.

We can choose a CRRA (Constant Relative Risk Aversion) utility function to be our reward function and for our evolution function we’ll choose a Cobb-Douglas with a random shock added on top, like so:

$$ R(c) = c^{1-\rho} / (1-\rho) $$

$$ f(a) = (a^\alpha + z) * \xi$$

$$\text{where } \xi \sim \text{Lognormal}(0, 0.1)$$

Note that the above problem setup has a variety of parameters we can choose. A greater \(\rho\) value (between \(0\) and \(1\)) will mean that the marginal reward from consumption will diminish more quickly. Alpha, a value around \(1\), represents the rate of depreciation of unconsumed resources over time. \(X_{i}\), an annual random value around \(1\) represents random shocks that may lead someone’s wealth to unexpectedly rise or fall in the next year, but could be parameterized differently to reflect more volatile conditions. And \(z\) represents a constant, annual income level. In real life, rho and alpha, along with discount rate beta, vary throughout a population. For now, we’ll solve for optimal consumption policy for one individual with a \(\rho\) of \(0.8\), an alpha of \(0.98\), and an annual discount rate of \(3.5\%\).

**Iteration**

To illustrate VFI, we start with an initial estimate of the optimal policy function that \(\pi(y) = y\). For simplicity, we assume \(income = 0\). At each iteration until convergence, we plot the new estimate of VFI below. Our final estimate of the value function is in red and can be interpreted as if a person has (x-axis) dollars, it is optimal for them to consume (y-axis) dollars this year.

**Income**

Now we can add an income level and see how the optimal policy function changes.

Notice that the optimal policy function for someone with an income of \(\$40,000\) has a kink. The kink occurs because at low enough consumption levels (below about \(\$34,000\)) the marginal benefit of consumption is greater than any potential benefits of saving wealth for the next year, after accounting for the expected influx of \(\$40,000\). According to this optimal policy function, whether this person has \(\$0\) or \(\$55,000\) in the bank, if they make \(\$40,000\), it will be optimal for them to consume \(\$34,000\) of their wealth next year. Or put another way, if a person with \(\$55,000\) loses some of that money, their saving behavior will change to bear that entire loss, while their consumption remains the same. On the other hand, if that same person gains some money, they won’t choose to consume any more until their accrued wealth reaches the kink point. This relationship between income and base-level consumption is consistent with the Permanent Income Hypothesis.

We can see how that kink point varies with the size of expected income in the next year. As income increases, the marginal benefit of consumption is stronger than the benefits of retaining depreciating assets at higher and higher consumption levels, despite diminishing marginal returns, leading to the existence of kink points at both higher levels of wealth and of consumption.

**Taxes**

Suppose we are considering instituting a \(5\%\) consumption tax on our microeconomic system and are curious how someone with these risk preferences at various wealth and income levels will respond. We can implement this task by updating our reward function such that individuals only receive reward for \(\$0.95\) out of every \(\$1.00\) they dedicate to consumption, like so:

$$ R(c) = (0.95*c)^{1-\rho} / (1-\rho) $$

Then, we can run the simulation again with this new tax.

As a result of the tax, people need to consume more to get the same reward. Because of this and because the appeal of saving is now slightly less than before, the optimal policy function adjusts to allocate more to consumption as a proportion of wealth to make up for the tax. For a person with these risk preferences, adding a tax on consumption will do more to alter their saving behavior than their consumption behavior.

We can also see that the kink point occurs at higher wealth and consumption levels than before across all income levels. This means that as a result of this tax: a. people with little savings have to spend more to achieve the same basic necessities (things that would make them happier than saving) and b. People need to accrue more wealth before they feel comfortable increasing their spending beyond this base level amount. Furthermore, the magnitude of this effect is not constant across income levels.

**System-wide Analysis**

The above is for only one discount factor and parameterization of our reward and evolution functions. To complete a system-wide analysis would require us to estimate a prior distribution across all these parameters, as well as income and wealth, for our entire population and solve optimal policy functions for each of them, weighting their influence in our system wide analysis by their prevalence in the population. We might also be interested in expanding the dimensionality of simulation. For instance, we may modify our action space to include life insurance, taking out loans, and more efficient instruments of saving; our state space to include income level, retirement, and unemployment; or our reward function to include bequests (along with knowledge of our age and the typical human lifespan) and the disutility of labor. All in all, it can require the estimation of hundreds if not thousands of optimal policy functions for different parameterizations of spenders over large, multidimensional decision grid spaces, both before and after the prospective change in tax policy. Leveraging the method of endogenous gridpoints can drastically speed up this calculation, enabling more complex and accurate economic simulations that can help us to understand potential effects on consumption patterns, generational wealth dynamics, and more.