Kernel Density Estimation

Up until now, we've been simulating data by randomly sampling from an underlying theoretical distribution, then constructing a sampling distribution from our observed data. But how is such a construction possible? Kernel Density Estimation (KDE) is among the most important topics in nonparametric statistics and has myriad practical applications.

Histogram Binning

As a motivating example, let's consider how we might even begin to approach estimating the source probability density function (PDF), given only the observed sample data. An elementary (and respectable) choice would be to make a histogram โ€” a graph constructed by grouping data into "bins" (equally spaced intervals) with height corresponding to the proportion to the of data that falls within the interval. Intuitively, (but naively) we might think to take the limit of the width of the bins going to zero, to find the value of the PDF for each point, visualized below.

To briefly explain the plot, the curve drawn with the dashed black line is the true underlying Normal distribution from which we are sampling our data. The black ticks below the plot are the random sample of the 250 data values we observe. The colored boxes are our estimated value of the underlying distribution over that interval.

There are obvious shortcomings to this approach. This finite sample size is not "dense" in continuous values. What this means is that when you start decreasing the bin size, you start getting gaps in your estimated values. This artificial sparsity yields a disproportionate number of zero-height-bins, which forces the probability to be redistributed onto non-zero bins, blowing up the error of our PDF estimate.

In fact, for histograms approximating a Normal distribution, there is an optimal binwidth, called "Silverman's Rule of Thumb" which is given by: $$h=0.9 * \min \left(\hat{\sigma}, \frac{I Q R}{1.34}\right) * n^{-\frac{1}{5}}$$ Where h is the binwidth, sigma_hat is the sample standard deviation, IQR is the interquartile range, and n is the sample size. So there is a general equation for what's on average the optimal binwidth that can be easily calculated. For our random sample of 250 data points, the histogram using Silverman's binwidth is visualized below.

Wow, that looks a lot closer! We no longer have big sparse gaps in our binning, and each bin seems relatively close to the dashed black line underlying truth. However, our initial problem with histograms remains: We're only able to calculate the probability density estimate over intervals, not for individual points themselves โ€” all points within an interval are given uniformly the same probability. This was the reason we wanted to take the binwidth to zero in the first place. Can we do better than histograms?

The Kernel Density Estimator

In fact, we can. The solution is similar in spirit. The idea is we want to estimate the probability density for each point by how many points lie close to it. The question becomes: How do we weight the contribution of neighboring points, and how do we ensure points are contributing "their fair share" โ€” no more, no less? The Kernel Density Estimator is given by: $$\widehat{f}_{h}(x)=\frac{1}{n h} \sum_{i=1}^{n} K\left(\frac{x-x_{i}}{h}\right)$$ Where h is the "bandwidth", K(ยท) is a "kernel function" (weighting function), and n is the sample size. The bandwidth of the kernel is analogous to the binwidth of the histogram. Unlike the histogram, points outside the bandwidth still do give a contribution to the density estimate, however, their contribution is increasingly small with increasing distance. Bandwidth gives us smooth control (as opposed to discrete, like a histogram's bins) over how fast our weighting function drops off. In other words, how much weight we give close points compared to points further away. For estimating an underlying Normal PDF, the bandwidth h is also calculated using Silverman's Rule of Thumb, from binwidth above, and we see that the bandwidth is proportional to our estimator of the standard deviation.

A common choice of kernel function is the Normal distribution that we are already very familiar with from the previous 3 issues. Let's see KDE in action on the same data as in the histograms above: 250 points randomly sampled from a Normal distribution.

For small bandwidth, we run into the same problems as small binwidth as before: too much weight is placed locally, and the PDF estimate is way too high and spikey on data we observed and way too low on regions directly adjacent. Conversely, when bandwidth is too high, the PDF estimate is overly smoothed, and at the tails, we have a relatively high PDF estimate over regions where we observed literally no data even close. Note that as bandwidth tends towards infinity, the KDE tends towards the shape of the kernel function itself, and ignores the observed data. So here we see the KDE tending towards a widely-shaped Normal distribution.

However, at around bandwidth = 0.29 (approximately the Silverman's Rule of Thumb estimate from above) the KDE appears close to optimal, and looks even better than the histogram!

Comparison of Kernels

While the Normal kernel function is a common, and often good, choice, it's worth noting that there are other kernel functions which are also used for weighting. Here are the kernel functions themselves plotted on the same axis, below.

And for comparison, here is the same bandwidth simulation as above, repeated using the uniform kernel function, rather than the Normal.

Again we note that as the bandwidth tends towards infinity, the KDE assumes the shape of the kernel function itself, and here we see it tending towards the "rectangular" uniform distribution.

And here is the KDE of the data estimated using each kernel function, and Silverman's bandwidth.

Wackier Densities

I would be remiss for leaving the impression that KDE is only used to estimate PDFs which are Normal. While Silverman's rule is optimal on average (i.e. any given dataset may have a different optimal bandwidth, which is why it's very important to choose the best bandwidth by visualizing the KDE at a variety of bandwidths as we are doing here) for Normal PDF only, KDE is a very powerful tool, and is relatively robust. In fact, we are able to estimate more complicated densities, such as Gaussian (Normal) Mixtures:

And even densities which are far from Normal like the exponential PDF:

However, note that we run into issues when the kernel function and the PDF we are trying to estimate have different support. Our Normal kernel function is defined over the entire real line, while the exponential PDF is defined only for non-negative values. Thus, as the bandwidth increases, we start putting substantial weight on negative numbers which are not even possible to obtain as the KDE tends towards Normal. Despite this, for the right choice of bandwidth (by eyeball, approximately 0.15), the KDE still provides a reasonably good approximation of the true underlying exponential PDF.


Code for plots Seeing Statistics aims to animate the predictable structures that emerge from repeated randomness.