## Factors

In statistical terminology, a *factor* is a categorisation which contains two or more mutually exclusive values called *levels*. These levels may have a natural order, in which case the variable is said to be an *ordinal factor*. An example might be year of birth: 1931 must lie between 1930 and 1932. Another example would be benefit size band: the 9th decile of sums assured must lie between the 8th and 10th deciles.

In contrast to ordinal factors, a *categorical factor* is a variable where the levels do not have an obvious order. An example of a categorical factor would be gender: all you can say about females is that they are categorically different from males, but whether you list males before females or vice versa is unimportant. Gender is an example of a *binary factor*, which are often quite powerful explanatory variables as they only require one parameter. Other possible examples include smoker status (smoker v. non-smoker) and employment status (employed v. self-employed).

Ideally a risk factor would contain a small number of levels with very strong risk differentials between each level. However, in practice we often have a large number of levels which we would like to simplify into a more manageable number of groups. A common example presents itself with geodemographics: there may be fifty or sixty geodemographic types, but one typically only wants to work with four or five lifestyle groups. There are other examples as well: grouping perhaps fifty different years of birth into a small number of birth cohorts, or finding the best boundary definitions amongst a hundred percentiles of benefit size-band.

In each case the problem is the same: how to reduce the number of original groups while maximising the explanatory power of the resulting simple factor. One way to do this is to fit the factor 'as is' and visually inspect which levels should be grouped together. A more thorough approach is to look at every combination of groupings and pick the one with the lowest AIC. This was discussed in a paper presented to the Institute of Actuaries in 2008.

The problem is that the number of combinations explodes such that it quickly becomes unfeasible to look at every one. In the majority of cases one needs search algorithm to selectively crawl through the combination space. This might mean that the final "optimised" factor is not an optimal definition in absolute terms. After all, one cannot definitively claim something is optimal without exhaustively trying very combination to prove it. However, the final result is typically hard to beat, and the risk of there being a slightly better combination is an acceptable price to pay for finding a good answer in a workable timeframe.

A further boost can be had from using parallel processing to test multiple combinations simultaneously. This permits either a greater number of combinations to be examined in the same amount of time, or else to accelerate the same search procedure.

### Comments