## Right-Censoring Rules!

A fundamental assumption underlying most modern presentations of mortality modelling (see our new book) is that the future lifetime of a person now age \(x\) can be represented as a non-negative random variable \(T_x\). The actuary's standard functions can then be defined in terms of the distribution of \(T_x\), for example:

\[{}_tp_x = \Pr[ T_x > t ].\]

In fact, all of classical life insurance mathematics follows from this assumption; see Dickson, Hardy and Waters (2013). This is an example of a *probabilistic model* in action. We specify a model in terms of one or more random variables and then calculate the probabilities of interesting events.

The inverse problem is the domain of *statistics*. Given some observed data, find a probabilistic model that best explains them. The usual approach is to sample the quantities of interest and fit a probability distribution using one of many standard methods. So, in our case, sample many values of \(T_x\) and fit a plausible distribution, maybe one of those described in Section 5.9 of our book.

The problem is, we mostly don't get to observe values of \(T_x\). If we have a population (say of pensioners, or life-insurance policyholders) and we observe them for a few calendar years (as we often do) then it is obvious that most of them will still be alive when we cease observation. In our book, we use as a case study a pension scheme with 16,043 members, of whom 2,087 (13%) were observed to die. In 87% of cases, \(T_x\) was not observed. The survivors tell us only that \(x+T_x\) is greater than the age to which, annoyingly, they have survived.

This is *right-censoring*. If someone is alive today, 13th December 2018, when we close the study and analyze the data, then all we know is that they are now alive; we do not observe their age at death. Except in studies of the oldest old, or completed cohorts, most of our observations will be right-censored. So the statistician's usual recipe, described above, doesn't work.

One approach is to define two random variables, \(T\) and \(D\), as follows:

\[ T=\mbox{Total time spent under observation by the individual} \]

and:

\[D=\begin{cases}1\quad\mbox{if observation ceased because the individual died}\\0\quad\mbox{if observation ended by right-censoring}.\end{cases}\]

Clearly, \(T=T_x\) if and only if \(D=1\). These definitions allow us to write down a likelihood and fit a model. This is the essence of *survival analysis*, which is sometimes defined as the statistical analysis of right-censored data.

Another, and ultimately even more fruitful, approach starts by defining some stochastic processes. Suppose we observe a person called Jane, and define two processes \(N(x)\) and \(Y(x)\) as follows, for ages \(x \ge 0\).

\[N(x)=\begin{cases}0\quad\mbox{if Jane's death has not been observed by age $x$}\\1\quad\mbox{if Jane's death has been observed by age $x$}\end{cases}\]

and:

\[Y(x)=\begin{cases}1\quad\mbox{if Jane is alive and under observation at age $x^-$}\\0\quad\mbox{otherwise}\end{cases}\]

where \(x^-\) means the instant before age \(x\). The words `observed' and `under observation' are important. They tell us that \(N(x)\) and \(Y(x)\) are not trivial complements of each other; we can have \(N(x)=Y(x)=0\) if Jane is alive but not being observed. They describe observations subject to censoring, including, but not limited to, the right-censoring described above. For example, \(Y(x)\) could equally well describe time Jane spends in an `able' state, in between spells of illness. (Try doing that with random variables analogous to \(T_x\)! When you've given up, turn to Section 14.7 of our book.) In the simple case of right-censoring, we have the neat results that \(T = \int_x^{\infty} Y(t) \, dt\) and \(D = \lim_{t \to \infty} N(t)\), so this formulation also includes our first model. This way of allowing for censoring was hinted at in an earlier blog.

This approach is more fruitful because it opens up a whole new way of modelling a *life history*, of which a (possibly right-censored) future lifetime is just one example. \(N(x)\) is an example of a special kind of *counting process* it counts how many events have been *observed* to happen up to and including the present time (or age). With its essential consort \(Y(x)\), it provides all that we need to build a statistical model. In Part Three of our book we find many reasons, both mathematical and statistical, to build models that count events, like \(N(x)\), rather than models that measure times between events, like \(T_x\). The definitive account of this approach is Andersen et al. (1993), but Chapter 17 of our book attempts to give an elementary introduction.

**References**

Andersen, P. K., Borgan, ุ., Gill, R. D. & Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer-Verlag, New York.

Dickson, D. C. M., Hardy, M. R. and Waters, H.R. (2013). Actuarial Mathematics for Life Contingent Risks (second edition). Cambridge University Press.

Macdonald, A. S., Richards. S. J. and Currie, I.D. (2018). Modelling Mortality with Actuarial Applications. Cambridge University Press.