For the record

Stephen has written about the challenges in using population cause-of-death data for mortality analysis and forecasting.   Another potential source of data is computerised patient records such as the General Practice Research Database (GPRD).  However, when using them you need to know their strengths and weaknesses.

These databases are derived by extracting anonymized data from general practice (GP) computer records.  When GP computer systems emerged in the mid 1980s, they were used for keeping registers of patients and for administering repeat prescribing.  Recording of other aspects of the record has evolved incrementally since.  Consequently, data about prescribing is highly reliable, but they suffer from incompleteness and systematic biases in other areas.

Take smoking: a GP is more likely to record a history of smoking than non-smoking, not least because non-smokers are less likely to be sick.  If a well person never sees the doctor, their smoking status may never be recorded. Consequently, estimates of smoking rates from GP databases are likely to be skewed upwards if the denominator is taken as the number of smoking entries, or downwards if the denominator is taken as the number patient records.  Statistical corrections and imputations can be made, but this changes the status of the data from observation to approximation.

To make a written diagnosis readable for a database, it must be coded in some way, and this was often not done.  Whilst this might seem a bit sloppy, it is best to view it from the perspective of the purpose of the record.  This data was recorded as an aid to patient care and fee claims, and not as a research database.  This problem was recognized in the early 1990s by a health informaticist who coined the ‘First Law of Informatics’ which states:

"Data shall be used only for the purpose for which they are collected."

Van der Lei, J. Use and abuse of computer-stored medical records,
Methods Inf. Med. 1991 April; 30(2):79–80.

This is a severe response to the problem, but makes a valid point.  GP databases were primarily used to support doctors' day-to-day work and data was not entered to the rigorous standards required for research.

In 2003, the new GP contract introduced a payment-by-results system called the ‘Quality and Outcomes Framework’ (QoF) which requires extensive computerised audit, and sets standards for record keeping.  It even specifies the codes to be used.  This means that data is much more consistent and complete post 2003 than prior to 2003 for QoF-related diagnoses.  Such discontinuities must be borne in mind when trying to research trends using databases like the GPRD.

Assume we have a random variable, $$X$$, with expected value ... Read more