One thing many people forget when dealing with data: outliers.
Even in a controlled online experiment, your dataset may be skewed by extremities.
How do you deal with them?
Trim them out, or is there some other way?
How do you even detect the presence of outliers and how extreme they are?
Especially if you’re optimizing your site for revenue, you should care about outliers.
This post will dive into the nature of outliers in general, how to detect them, and then some popular methods for dealing with them.
What Are Outliers?
First, what exactly are outliers?
An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.
There is, of course, a degree of ambiguity here.
Qualifying a data point as an anomaly leaves it up to the analyst or model to determine what is actually abnormal and what to do with such data points.
There are also different degrees of outliers:
Mild outliers lie beyond an inner fence on either side
Extreme outliers are beyond an outer fence
Why do outliers occur?
According to Tom Bodenberg, chief economist and data consultant at market research firm Unity Marketing, “It can be the result of measurement or recording errors, or the unintended and truthful outcome resulting from the set’s definition.”
Outliers could contain valuable information, they could be meaningless aberrations caused by measurement and recording errors, they could cause problems with repeatable A/B test results.
So it’s important to question and analyze outliers in any case to see what their actual meaning is.
Leave a Reply