Confounding compounding

Earlier posts discussed the importance of deduplication in annuity portfolios and pension schemes and some of the issues around the deduplication of names, specifically the use of double metaphone to look through common variant spellings of the surname or family name.

One problem is that often the surname data is prepended by first or middle names as well. Or it might be suffixed with a post-nominal term as in Douglas Fairbanks Junior. Even trickier is the presence of compound names like Simon Van der Valk, and the fact that in teleservicing Van der Valk sounds awfully like Vandervalk or even Vander Valk.

So trying to match Mr Simon Piet Van der Valk with S VanderValk Senior PHD isn't a walk in the park. If we try a metaphone match on the final token we'll find Valk doesn't match PHD on a primary or alternate basis.

What to do? Well, you can never be perfect in this area, but you can be more than good enough with the appropriate effort. Recognising common trailing terms and disregarding them as with titles takes us part of the way there.

The final trick is by combining space-separated tokens based on list of compounding name elements such as the fragment shown here:

Compound Name Elements Fragment

Doing this, you can actually match such complex names on a string equivalence basis without metaphone.

This kind of name is of course more common in some territories than others, and some might argue it will be a small part of most portfolios. This may be true, but if it occurs amongst the wealthiest policyholders representing the largest concentration of risk, it has a disproportionate impact.

The general point shouldn't be conceded in any case, since creating statistical models responsibly means making every effort to preserve the independence assumption. And that makes it worth going the extra mile.

Written by: Gavin Ritchie

Publication Date: 08 December 2008

Last Updated: 08 December 2008

Tags: deduplication, duplicates

Deduplicating Names in Longevitas

Longevitas offers ten deduplication schemes and many process names to use either directly or as metaphone codes. You can choose which of the core set of prefixes and suffixes to ignore in name processing, as well as supply some of your own. The same thing applies to merging across compound name-elements - you opt-out of core terms and supply others as you see fit. Finally the ability to interactively inspect potential duplicates allows you to investigate your data before finalising on deduplication settings to be applied automatically.

Great Expectations

08 December 2008

When fitting statistical models, a number of features are commonly assumed by users. Chief amongst these assumptions is that the expected number of events according to the model will equal the actual number in the data. This strikes most people as a thoroughly reasonable expectation. Reasonable, but often wrong.

Tags: GLM

View all posts

Confounding compounding

Deduplicating Names in Longevitas

Previous posts

Great Expectations

Add new comment

Restricted HTML