No calculation without representation

You are in charge of systems programming for an insurer writing disability insurance. It is your job to write reporting modules to meet the needs of the actuaries, claims managers, accountants and so on. Where to start?

The data would seem to be a good place. I'll take it as read what kind of data the business will generate. The question is how to represent it for efficient use in our programs - something we worry about so that the user doesn't have to.

To keep things simple suppose: (a) that data are sorted into cohorts, so we have everyone age \(x\) at policy inception; and (b) that de-duplication has been carried out, so we can assume one policy per individual; and (c) that the policies define just three states:

Alive (paying premiums),
Ill (receiving benefits) and
Dead (possibly receiving a lump-sum benefit on death).

We will refer to these states by their numbers above. Now consider a user's request:

Example 1. The accountants need to know how many persons were alive and in possession of a policy on each monthly policy anniversary after inception. How might you proceed?

Define \(S_i(t)\) to be the state occupied by the \(i\)th policy at time (meaning policy duration) \(t\). We suppose that your system will always be able to work out \(S_i(t)\), though not necessarily cheaply in terms of computing effort. Then the answer is:

\[ \mbox{No. Policies such that } S_i(t) = 1 \; \; \mbox{or} \; \; S_i(t) = 2 \]

which could be calculated in many ways.

Example 2. The actuaries need to know the number of persons who fall ill or who recover from illness during each year after policy inception.

This can be worked out from \(S_i(t)\), but it could be cheaper, computationally, to set up the following functions at outset:

Function	Returns
\(N_i^{12}(t)\)	Number of transitions state 1 \(\rightarrow\) state 2 by time \(t\)
\(N_i^{13}(t)\)	Number of transitions state 1 \(\rightarrow\) state 3 by time \(t\)
\(N_i^{21}(t)\)	Number of transitions state 2 \(\rightarrow\) state 1 by time \(t\)
\(N_i^{23}(t)\)	Number of transitions state 2 \(\rightarrow\) state 3 by time \(t\)

You may recognize these as the set of counting processes representing the life history. The actuaries' needs can now be met easily by returning \(\sum_i (N_i^{jk}(t+1) - N_i^{jk}(t))\) and so on.

Example 3. Your company has agreed to provide Longevitas with data for research purposes, so you need to keep a permanent record of every life history. What would be a compact representation for storing this information?

An efficient method would to record, for the \(i\)th individual: (a) the number \(k_i \ge 0\) of events; (b) a list \(T_i^1 < T_i^2 < \ldots < T_i^{k_i}\) of event times; and (c) at the \(j\)th event time for the \(i\)th individual, a pair \((u_i^j,v_i^j)\) called a mark indicating the two states involved. This scheme compresses, in storage, long periods of time when nothing happens. Conveniently, it lends itself to a linked list structure in software.

The three ways above of representing the data are all equivalent computationally, in the sense that any of them can be reconstructed from any other. And, they map on to three standard ways of representing the data from a multiple-state model, also formally equivalent mathematically (see Jacobsen (2006)).

The function \(S_i(t)\) in Example 1 is the sample path representation.
The functions \(N_i^{jk}(t)\) in Example 2 gave the counting process representation.
The random times \(T_i^1, T_i^2 \ldots, T_i^{k_i}\) and marks \((u_i^j,v_i^j)\) in Example 3 gave the marked point process (MPP) representation.

A vital point is that while these all represent data, by which I mean they describe it with enough precision to allow mathematics to proceed, they do not in any sense model the data; there are no intensities, no smoothing parameters, and so on. That comes after the more basic decision, of how to represent the data.

Actuarial eyebrows tend to rise archly when counting processes are mentioned, perhaps because things stochastic cannot be far behind. Counting processes do figure in more advanced treatments (see Macdonald et al. (2018, Chapter 15) or Andersen et al. (1993) for example), but, of themselves, all they do is represent survival data - something that programmers must have re-invented many times.

References:

Andersen, P. K., Borgan, Ø, Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New York.

Jacobsen, M. (2006). Point Process Theory and Applications: Marked Point and Piecewise Deterministic Processes. Birkhäuser, Boston.

Macdonald, A. S., Richards. S. J. and Currie, I. D. (2018). Modelling Mortality with Actuarial Applications. Cambridge University Press, Cambridge.

Written by: Angus Macdonald

Publication Date: 09 August 2023

Last Updated: 09 August 2023

Tags: data representation, sample paths, counting process, marked point processes

Testing Times (version 2.8.7)

03 August 2023

We have the next release, version 2.8.7 of Longevitas and the Projections Toolkit up on the ramp. So what exactly is in there?

Tags: Testing

Unhiding the bodies

29 July 2023

All governments like to divert attention from their mistakes. However, this is tricky in an open democracy with a free press if those mistakes lead to tens of thousands of deaths. In contrast, it is straightforward for an authoritarian regime to manipulate the death counts. Nevertheless, it is hard to hide all the indirect consequences of excess deaths. This allows resourceful researchers to uncover what even dictatorships would rather keep hidden. In this blog we look at examples in China and Russia.

Tags: coronavirus

View all posts

No calculation without representation

Previous posts

Testing Times (version 2.8.7)

Unhiding the bodies

Add new comment

Restricted HTML