# Quantiles and percentiles

Quantiles are points taken at regular intervals from the cumulative distribution function of a random variable. They are generally described as *q*-quantiles, where *q* specifies the number of intervals which are separated by *q*−1 points. For example, the 2-quantile is the median, i.e. the point where values of a distribution are equally likely to be above or below this point.

A *percentile* is the name given to a 100-quantile. In Solvency II work we most commonly look for the 99.5th percentile, i.e. the point at which the probability that a random event exceeds this value is 0.5%. The simplest approach to estimating the 99.5th percentile might be to simulate 1,000 times and take the 995th or 996th largest value. However, there are several alternative ways of estimating a quantile or percentile, as documented by Hyndman and Fan (1996). One of the commonest approaches is the definition used by Microsoft Excel, and which is also option type 7 in the R function quantile(). In general, we seek a percentile level *p* which lies in the interval (0, 1). If *x*[*i*] denotes the *i*th largest value in a data set, then the percentile sought by Excel is *x*[(*n* − 1)*p* + 1].

To illustrate the calculation of sample quantiles, consider the following R commands to generate a simulated loss distribution of 1,000 values from the N(0,1) distribution:

# Generate some pseudo-random N(0,1) variates set.seed(1) temp = rnorm(1000) sort(temp)[994:1000]

When these commands are run you should see the seven largest values as follows:

2.401618, 2.446531, 2.497662, 2.649167, 2.675741, 3.055742, 3.810277

In this example, *n*=1000 and *p*=0.995, so (*n* − 1)*p* + 1 = 995.005. This latter value is not an integer, so we must interpolate between the 995th and 996th largest values. The final answer is then:

0.995 × 2.446531 + 0.005 × 2.497662 = 2.447

So we now have our estimate of the 99.5^{th} percentile. What is often overlooked is that the sample percentile is an estimate, i.e. there is uncertainty over what the true underlying value is for the 99.5th percentile. In fact, the percentile is part of a branch of probability theory called order statistics, and it turns out that the sample percentile above is not the most efficient estimator. There are many other estimators, of which one is due to Harrell and Davis (1982). One reason the Harrell-Davis estimator is more efficient is because it uses all of the data, rather than the order statistics. In the example above, the Harrell-Davis estimate of the 99.5th percentile can be found with the following extra R commands:

# Calculate the 99.5th percentile and a standard error for it library(Hmisc) hdquantile(temp, 0.995, se=TRUE)

This yields an estimate of the 99.5th percentile of 2.534, and we can see that the Harrell-Davis estimator is more efficient because this is closer to the known percentile of the N(0,1) distribution (2.576). Perhaps even more useful is the fact that the Harrell-Davis estimator comes with a standard error, which here is 0.136. Table 1 shows how many simulations are required to get within a given level of closeness to the true underlying 99.5th percentile.

Table 1. Harrell-Davis estimates and standard errors of 99.5^{th} percentile of *n* N(0,1) variates. Source: Own calculations.

n |
99.5^{th}percentile |
Standard error |
Coefficient of variation |
---|---|---|---|

1,000 | 2.534 | 0.136 | 5.4% |

10,000 | 2.517 | 0.047 | 1.9% |

25,000 | 2.564 | 0.027 | 1.1% |

50,000 | 2.577 | 0.020 | 0.8% |

100,000 | 2.564 | 0.014 | 0.5% |

**References: **

Harrell, F. E. and Davis, C. E. (1982) A new distribution-free quantile estimator. *Biometrika*, **69**, 635–640.

Hyndman, R. J. and Y. Fan, Y (1996) Sample quantiles in statistical packages. *American Statistician* (American Statistical Association), **50** (4):361–365.

## Add new comment