Visualising data-quality in time

(Nov 30, 2020)

In a recent blog I defined the Nelson-Aalen estimate with respect to calendar time, rather than with respect to age as is usual.  I showed how a simple difference of this estimate could be used to reveal seasonal patterns in mortality, and also how it could identify shocks like covid-19.  However, this time-based non-parametric estimator also turns out to be handy for detecting data-quality issues.

To recap, the Nelson-Aalen estimate of the integrated hazard from time \(y\) to \(y+t\) is denoted \(\hat\Lambda_{y,t}\); it is defined as follows:

\[\hat\Lambda_{y,t} = \sum_{t_t\le t}\frac{d_{y+t_i}}{l_{y+t_i^-}}\qquad (1)\]

for a set \(\{y+t_i\}\) of distinct times (dates) of death with \(d_{y+t_i}\)…

Read more

Tags: data validation, missing data, Nelson-Aalen

Dealing with missing data

(Oct 14, 2011)

In an earlier post we looked at how to create a proxy for ill-health early retirements based on age at commencement.  This is an example of dealing with missing data - we infer a useful proxy to replace the lost or missing health status at retirement.

Another common problem occurs during data or system migrations, where historical experience data is often not carried across to a new administration system.  Migrations happen when a life office consolidates multiple systems into one, or when a pension scheme changes administrator.  System migrations aren't easy, and migrating past historical data is usually one of the last tasks on the priority list.  As a result, data migration is unfortunately one of the first…

Read more

Tags: missing data

Summary judgement

(Jul 18, 2011)

In previous posts we have looked at problems with the quality and reliability of cause-of-death data and a list of hurdles for mortality projections based on such data.  One other issue is that of detail.  While cause-of-death data is spread over literally thousands of individual causes, important detail is lost on the most important mortality risk factor of all.  Oeppen (2008) states the problem:

"deaths are often tabulated by 5 year age groups and the open age interval into which the deaths of the oldest-old are aggregated is often defined at a relatively young age such as 85. Unfortunately, it is at these high ages where most of the temporal dynamics are occurring."

Read more

Tags: cause of death, missing data

Forecasting mortality at high ages

(Feb 28, 2011)

The forecasting of future mortality at high ages presents additional challenges to the actuary.  As an illustration of the problem, let us consider the CMI assured-lives data set for years 1950-2005 and ages 40-100 (see Stephen's blog posts on selection and data volumes).  The blue curve (partly hidden under the green curve) in Figure 1 shows observed log(mortality) averaged over time.  A striking feature of this curve is the suggestion of data-quality issues above age 95:

Figure 1. log(mortality) by age for CMI assured-lives data, ages 40-100.

log(mortality) by age for CMI data, ages 40-100

We don't believe mortality rates fall at high ages, so there must be a problem with the data.  The obvious first solution is simply to model mortality up to an age where…

Read more

Tags: missing data, mortality projections, age extrapolation

Out for the count

(Jul 31, 2009)

In an earlier post we described a problem when fitting GLMs for qx over multiple years.  The key mistake is to divide up the period over which the individual was observed in a model for individual mortality.  This violates the independence assumption and leads to parameter bias (amongst other undesirable consequences). If someone has three records aged 60, 61 and 62 initially, then these are not independent trials: the mere existence of the record at age 62 tells you that there was no death at age 60 or 61.

Life-company data often comes as a series of in-force extracts, together with a list of movements.  The usual procedure is to re-assemble the data to create a single record for each policy, using the policy number…

Read more

Tags: survival models, force of mortality, GLM, missing data

Sweating your data assets

(Apr 16, 2009)

In recent years insurers have looked to making better use of the data they already have. The appeal is simple: if you have already collected the data, then it is like leaving money on the table if it is not being exploited to the full. Worse, if your competitors make better use of their data, you can be selected against and lose money.

The biggest change has been in insurers' attitude towards the use of postcodes. Postcodes have to be collected and maintained anyway as part of normal business, so any extra value which can be squeezed out of them is a low-cost bonus.  As we will see, this can sometimes even be a zero-cost bonus.

Every UK residential postcode can be assigned a geodemographic type to describe the sort of people…

Read more

Tags: postcodes, geodemographics, smoking, missing data, P-squared

Early retirements

(Mar 25, 2009)

Members of defined-benefit pension schemes can often retire early if they are in poor health.  Unsurprisingly, such ill-health retirements exhibit higher mortality rates than those who retire at the normal scheme age.

Over time, however, the information on the health status of a pensioner is often lost.  When administrators are changed, for example, the original reason for retirement may not be migrated across onto the new payment system.  This poses a dilemma: we know that the reason for retirement will be a material risk factor, but we often won't know the reason codes for all pensioners.

One solution adopted by the CMI in the U.K. is to assume that everyone whose pension began before a certain age has retired…

Read more

Tags: early retirement, missing data, Kaplan-Meier

Find by key-word

Find by date

Find by tag (show all )