## Correlation complications

A basic result in probability theory is that the variance of the sum of two random variables is not necessarily the same as the sum of their variances. Mathematically, the variance of the sum of two random variables, A and B, is as follows:

Var(A+B) = Var(A) + Var(B) + 2*Cov(A,B)                (1)

where Var() denotes the variance and Cov() denotes the covariance.  The above result shows that the variance of A+B is only equal to the sum of the variances when their covariance (or correlation) is zero, i.e. when A and B are independent.  If A and B are positively correlated, for example, then ignoring the covariance term will cause the total variance to be under-estimated.  This basic result is relevant to two common scenarios where cause-of-death data are sometimes used: (i) to project mortality rates by cause, and (ii) to examine the impact of "eliminating" a particular cause.

Carriere (1994) wrote about attempts to investigate the impact of "eliminating" a cause of death.  He showed mathematically that a proper cause-of-death elimination is impossible without knowing the correlation structure between the various causes of death.  For convenience, many attempts at cause-of-death elimination simplistically assume that the causes are uncorrelated, despite this not being a particularly realistic assumption.  As an illustration, consider a simple model with three competing risks: (i) heart disease, (ii) cancer, and (iii) all other causes.  Since we know that smoking causes both heart disease and numerous cancers, for example, the mortality rates in categories (i) and (ii) are positively correlated.  There are numerous other correlating factors, of course, such as diet, so any attempt at a cause-of-death elimination should acknowledge the correlation structure between the sub-groups.  However, Carriere (1994) also showed that it is impossible to know what these correlations actually are.  The correlations are unlikely to be simple, so some heroic assumptions have to be made.  However, it isn't a good idea to ignore the problem entirely and assume zero correlation for convenience.

This problem also applies to attempts at projecting disaggregated mortality rates.  There are many practical problems with cause-of-death projections, some of which are discussed in Richards (2010).  As with "elimination"-style exercises, one of the biggest problems is the correlation between the various categories for cause of death.  Once again, the assumption of independence  i.e. zero correlation is often made to make life easier for the model builder, rather than there being much evidence for the assumption.

Any projection which ignores correlations like this would be misleading, especially as regards the uncertainty over the projected rates.  The reason lies in Equation (1) — if you ignore the covariance (correlation), then you mis-state the total variance.  The same point applies to attempts to project disaggregated mortality rates: if positive correlations are ignored, the uncertainty over the projections is understated.  As Iain discussed earlier, correlations cannot be brushed aside in projections work, however inconvenient they may be!

References

Carriere, J. F. (1994) Dependent decrement theory, Transactions of the Society of Actuaries, Volume XLVI, 1–21.

Richards, S. J. (2010) Selected Issues in Modelling Mortality by Cause and in Small Populations, British Actuarial Journal, 15 (supplement), pages 267–283.