## Simulating the Future

This blog has two aims: first, to describe how we go about simulation
in the Projections Toolkit; second, to emphasize the important role a
model has in determining the width of the confidence interval of the
forecast.

We use US male mortality data for years 1970 to 2009 downloaded from
the Human Mortality Database. Figure 1 shows the observed log
mortality. Unlike UK mortality (which shows accelerating improvements
in log mortality over the same period) the US improvement is perfectly
well described by a straight line. We fit the simplest of models:
\(y_j = a + b x_j + \epsilon_j\), where \(x_j\) is year \(j\), \(y_j =
\log(d_j/e_j)\) with \(d_j\) the observed number of deaths in year \(j\)
and \(e_j\) the corresponding central exposed to risk; the error terms
are independent \({\cal N}(0,\,\sigma^2)\). This is the familiar linear
regression model. (Of course, these data are better modelled by a
Gompertz model with Poisson errors but our purpose here is to keep
things as simple as possible so that our main points are made
clearly.) The fitted regression is also shown in Figure 1.

Figure 1. Log mortality with fitted linear regression.

Suppose we wish to forecast log(mortality) to 2050. Forecasts in R
are done with the \({\tt predict}\) function and importantly this function
has the option of computing the standard error of both the fit and the
forecast. Figure 2 show the fitted and forecast values with their
standard errors. Readers may be surprised at the narrow confidence
intervals. Intuitively, forecasts so far into the future should not
be made with such apparent certainty. This is our second point. The
width of the confidence interval is determined not only by the data
but also by the model assumption. Here our data varies tightly around
the fitted mean line so the estimate of the residual variance is small
\((\hat \sigma = 0.022)\). We have also made a very strong model
assumption and the strength of this assumption also leads to tight
confidence intervals. Thus model risk affects not only the central
forecast but also the confidence of that forecast.

Figure 2: Forecast of log mortality with 95% confidence interval

We turn now to simulation. There are two ways of proceeding. The
more obvious way is to simulate from the distribution of the estimates
of the coefficients. If \(\boldsymbol{\theta} = (a,\,b)'\) then we have the well
known result from linear regression that \(\hat {\boldsymbol{\theta}} \sim {\cal
N}(\boldsymbol{\theta_0}, \sigma^2(\boldsymbol{X'} \boldsymbol{X})^{-1})\) where \(\boldsymbol{\theta_0}\) is the
true but unknown value of \(\boldsymbol{\theta}\) and \(\boldsymbol{X}\) is the regression
matrix.
We can allow for the uncertainty in the estimation of
\(\boldsymbol{\theta}\) by simulating values \(\boldsymbol{\tilde \theta}\) from \({\cal N}(
\boldsymbol{\hat \theta}, \hat \sigma^2(\boldsymbol{X'} \boldsymbol{X})^{-1})\). Suppose \(\boldsymbol{\tilde \theta}
= (\tilde a, \tilde b)'\) is a simulated value of \(\boldsymbol{\theta}\) then the
simulated forecast at age \(x\) is \(\tilde a + \tilde b x\).
The Toolkit uses a more direct method. We already know the
fitted and forecast mean and its standard error, which we denote by
the vectors \(\boldsymbol{m}\) and \(\boldsymbol {s}\) respectively; see Figure 2. The
simulated value of the mean and the forecast is given by \(\boldsymbol{m} +\)
\(z\boldsymbol{s}\) where \(z \sim {\cal N}(0,\,1)\). It is a simple matter to check
that these two methods are precisely equivalent.

This is an important result as far as the Toolkit is concerned since
it applies to time series models as well. With a time series model we
obtain the mean forecast together with its standard error. Simulation
can now proceed with the second method in the previous paragraph.

Of course, there are other things to worry about when simulating
future mortality. The above method deals only with *parameter risk*. Both
*model risk* and *stochastic risk* must also be
accounted for. The Projections Toolkit has a selection of models to
help with the former, and parameter risk and stochastic risk can be
independently switched on and off as required within sample path generation.
See my earlier blog for details.

**References:**

R Core Team, (2014). *R: A Language and Environment for Statistical
Computing*. Vienna: R Foundation for Statistical Computing.

Human Mortality Database. University of California, Berkeley,
USA. Available at www.mortality.org (data downloaded 2012).

### Comments