How much data do you need?

We have written before about how survival models make better use of available data. Another way of viewing this is that survival models can make do with smaller data volumes than methods based on the rate of mortality, q_x. But what do we mean by "data volumes"? Should we measure this by claim events, by number of lives or by exposure time? And how much is enough?

For survival models the most sensible measure is a combination of claim events and exposure time. The number of lives is of secondary importance for survival models, since they naturally and easily span multi-year investigations. For a survival model it is less important if 10,000 life-years of exposure is observed amongst 10,000 people for one year, or 5,000 people for two years.

In an analysis of a critical-illness portfolio we had 267 claims out of nearly 130,000 life-years of exposure. Of these 267 claims, just 56 were to smokers, who accounted for 20,000 life-years of the exposure time. A natural reaction would be to think that these claim counts would be too small to detect any smoker/non-smoker differential. Natural, but mistaken — the survival model we fitted to this data estimated that smokers had a 57% higher critical-illness claim rate than non-smokers, with a standard error of plus or minus 15%. This gave a p-value of 0.02% for the effect of smoking, i.e. the result was highly significant at even the 0.1% test level.

The reason such a significant result can be obtained from such a small number of claim counts is that the event we are modelling is comparatively rare in this portfolio: just two claims on average for each thousand life-years of exposure. Thus, comparatively few additional claims are required amongst a sub-group to provide significant evidence of higher risk.

Even if you think you have relatively little data, you might be surprised what you can achieve with survival models.

Written by: Stephen Richards

Publication Date: 26 March 2010

Last Updated: 26 March 2010

Tags: survival models, data volumes, critical illness

Model types in Longevitas

Longevitas users can choose between seventeen types of survival model (μ_x) and seven types of GLM (q_x). In addition there are a further seven extensions of the GLM models for q_x to span multi-year data without violation of the independence assumption. Longevitas also offers non-parametric analysis, including Kaplan-Meier survival curves and traditional A/E comparisons against standard tables.

View all posts

How much data do you need?

Model types in Longevitas

Add new comment

Restricted HTML