## The Curse of Cause of Death Models

Stephen's earlier blog explained the origin of the very useful result relating the life-table survival probability \({}_tp_x\) and the hazard rate \(\mu_{x+t}\), namely:

\[ {}_tp_x = \exp \left( - \int_0^t \mu_{x+s} \, ds \right). \qquad (1) \]

To complete the picture, we add the assumption that the future lifetime of a person now aged \(x\) is a random variable, denoted by \(T_x\), and the connection with expression (1) which is:

\[ {}_tp_x = \Pr[ T_x > t ]. \qquad (2) \]

The 'package' of random lifetime \(T_x\), hazard rate \(\mu_{x+t}\), survival function \({}_tp_x\), and expression (1) tying everything together, sums up the *mathematics* of a survival model. For the *statistics*, we have observations on the \(i^{\rm th}\) of \(n\) individuals, namely \(t_x^i\) the observed lifetime, and \(d_x^i\) an indicator equal to 1 if observation ended in death and 0 otherwise. We suppose each \((t_x^i,d_x^i)\) is a sample drawn from the distribution of a bivariate random variable \((T_x,D_x)\). The target of estimation is the distribution of \((T_x,D_x)\).

What does the 'package' above look like if we try to model cause of death, as well as time of death? A natural extension is to associate a random future lifetime with each separate cause. As a simple example, suppose heart attack and stroke are the only causes of death. Then, for a person aged \(x\), define:

\begin{eqnarray*} U_x & = & {\rm the\ time\ until\ death\ by\ heart\ attack\ occurs} \\ V_x & = & \mbox{the time until death by stroke occurs}. \end{eqnarray*}

The two causes of death *compete* with each other, to be the first to claim their victim, hence the name *competing-risks model* for models of this kind. We discuss these in more depth in Chapter 16 of our forthcoming book, *Modelling Mortality with Actuarial Applications*; Crowder (2001) is another good source.

We can never observe both \(U_x\) and \(V_x\), but only their minimum, which we denote by \(T_x^*\):

\[ T_x^* = \min[U_x , V_x]. \]

Associated with \(T_x^*\) is a survival function denoted by \({}_tp_x^*\):

\[ {}_tp_x^* = \Pr[ T_x^* > t ]. \]

Define a hazard rate \(\lambda_{x+t}\) associated with \(U_x\) (in the presence of \(V_x\)) and a hazard rate \(\nu_{x+t}\) associated with \(V_x\) (in the presence of \(U_x\)) in exactly the same way as the hazard rate \(\mu_{x+t}\) is associated with \(T_x\) (see Stephen's earlier blog). Then we can show that:

\[ {}_tp_x^* = \exp \left( - \int_0^t ( \lambda_{x+s} + \nu_{x+s} ) \, ds \right). \qquad \mbox{(3)} \]

So we have the *mathematical* 'package' of random lifetimes \(U_x\) and \(V_x\) and their observable minimum \(T_x^*\), hazard rates \(\lambda_{x+t}\) and \(\nu_{x+t}\), survival function \({}_tp_x^*\), and expression (3) tying everything together.

The *statistical* side of the competing risks model is much the same as before, because we can observe only \(T_x^* = \min[U_x,V_x]\), and the cause of death (or, at least, what prevailing practices define as the cause of death). Our observations of the \(i^{\rm th}\) of \(n\) individuals will take the form of the time of death \(t_x^{i*}\), and an indicator \(d_x^{i*}\) of cause of death, for example \(d_x^{i*} = 1\) if heart attack caused death or \(d_x^{i*} = 2\) if stroke caused death. The target of estimation is the joint distribution of the bivariate random variable \((T_x^*, D_x^*)\) from which the observed data \((t_x^{i*}, d_x^{i*})\) are supposed to be sampled.

Now comes the crux — *we have said nothing about the dependence or independence of \(U_x\) and \(V_x\)*. For given hazard functions \(\lambda_{x+t}\) and \(\nu_{x+t}\), the lifetimes \(U_x\) and \(V_x\) may be independent, or may have any kind of dependence that is mathematically possible. Expression (3) is still true, regardless. But this means that *the distribution of \((T_x^*, D_x^*)\) is always the same, whatever the dependence between \(U_x\) and \(V_x\) may be*. The time to death is governed by the same survival function \({}_tp_x^*\), and if death occurs at time \(t\) the probability that the cause was a heart attack is always the same, namely:

\[ \Pr[\, \mbox{Heart attack} \mid \mbox{Death at time } t \,] = \frac{\lambda_{x+t}}{\lambda_{x+t} + \nu_{x+t}}. \]

As a result, *it is impossible to tell, from any amount of data*, whether these two competing causes of death are dependent or independent. And, if those causes of death are dependent, it is equally impossible to say what the nature of that dependence is.

Our chosen causes, heart attack and stroke, very likely are strongly related, because of genetic and lifestyle factors (e.g. smoking and alcohol consumption). We would expect a cause-of-death model to let us estimate such dependencies. But no.

This is the *identifiability problem* (or, as one statistician called it, crisis). It is usually attributed to Tsiatis (1975), although it was noticed earlier. It poses an obvious challenge to cause-of-death modelling of a rather fundamental nature. Dependencies between different causes, which we believe to exist on biological grounds, can *never* be estimated in a survival model based on random lifetimes. They are forever out of reach, unless we equip the model with an additional mechanism giving rise to the dependencies, which often amounts to assuming the very thing we set out to estimate. A curse indeed.

**References**

Crowder, M. (2001). Classical Competing Risks. *Chapman & Hall CRC*, Boca Raton, FL.

Macdonald, A. S., Richards. S. J. and Currie, I. D. Modelling Mortality with Actuarial Applications. *Cambridge University Press* (forthcoming).

Tsiatis, A. A. (1975). A Nonidentifiability Aspect of the Problem of Competing Risks. *Proceedings of the National Academy of Sciences, U.S.A.*, **72**, 20–22.