Confounding Compounding

Earlier posts discussed the importance of deduplication in annuity portfolios and pension schemes and some of the issues around the deduplication of names, specifically the use of double metaphone to look through common variant spellings of the surname or family name.

One problem is that often the surname data is prepended by first or middle names as well. Or it might be suffixed with a post-nominal term as in Douglas Fairbanks Junior. Even trickier is the presence of compound names like Simon Van der Valk, and the fact that in teleservicing Van der Valk sounds awfully like Vandervalk or even Vander Valk.

So trying to match Mr Simon Piet Van der Valk with S VanderValk Senior PHD isn't a walk in the park. If we try a metaphone match on the final token we'll find Valk doesn't match PHD on a primary or alternate basis.

What to do? Well, you can never be perfect in this area, but you can be more than good enough with the appropriate effort. Recognising common trailing terms and disregarding them as with titles takes us part of the way there.

The final trick is by combining space-separated tokens based on list of compounding name elements such as the fragment shown here:

Compound Name Elements Fragment

Doing this, you can actually match such complex names on a string equivalence basis without metaphone.

This kind of name is of course more common in some territories than others, and some might argue it will be a small part of most portfolios. This may be true, but if it occurs amongst the wealthiest policyholders representing the largest concentration of risk, it has a disproportionate impact.

The general point shouldn't be conceded in any case, since creating statistical models responsibly means making every effort to preserve the independence assumption. And that makes it worth going the extra mile.





Find by key-word


In Richards (2022) I proposed a simple real-time mortality tracker ... Read more
It is with great sadness that we note the passing ... Read more
In criminal investigation, it is well known that passing time ... Read more
Gavin Ritchie
Gavin Ritchie is the IT Director of Longevitas
Deduplicating Names in Longevitas
Longevitas offers ten deduplication schemes and many process names to use either directly or as metaphone codes. You can choose which of the core set of prefixes and suffixes to ignore in name processing, as well as supply some of your own. The same thing applies to merging across compound name-elements - you opt-out of core terms and supply others as you see fit. Finally the ability to interactively inspect potential duplicates allows you to investigate your data before finalising on deduplication settings to be applied automatically.