How Can You Identify Outliers in a Dataset?
Outliers are numbers that are much higher or lower than the rest of the dataset. They are the odd ones out! In this article, we will explain how you can find these outliers in a dataset using simple steps.
What are Outliers?
Outliers are numbers that are dramatically different from the dataset. If a group has most numbers between 10 and 20, one hundred is an outlier. Outliers can be very important because they might show something special, or they might be errors.
How to identify outliers in a dataset
There are quite a few ways to identify outliers in a dataset. Let’s start with some of the most basic.
Plot the data
First of all, just look at the numbers in the data set. If you have a small enough set of numbers, you can glance through them to see if any number seems too big or too small compared to the others. This is probably the easiest way, but sometimes it may not work for large datasets.
Use a Box Plot
- A box plot is useful for finding outliers. A box plot shows the range of your data and the median (middle value). Outliers are numbers that fall significantly outside the boundaries of the plot’s “box”.
- The plot’s box indicates the location of the majority of the data. The lines—known as whiskers—represent the range of the data. If a number falls outside the distance from these whiskers, then it’s an outlier.
Find the interquartile range (IQR).
The interquartile range, or IQR, is the difference in ranges between the first quartile, Q1, and the third quartile, Q3. Here is how one may use the IQR formula to find an outlier:
- Step 1: Calculate Q1 and Q3
- Step 2: Find the IQR by subtracting Q1 from Q3 (IQR = Q3 – Q1).
- Step 3: Multiply the IQR by 1.5.
- Step 4: Determine the lower and upper bounds
o Lower bound = Q1 – (1.5 * IQR)
o Upper bound = Q3 + (1.5 * IQR)
- Step 5: Any data point below the lower bound or above the upper bound is an outlier.
Z-Score Method
- The Z-score indicates how many standard deviations away from the mean a number lies. The higher or lower the Z-score, the more that number is an outlier.
- To calculate the Z-score, subtract the mean from the number, then divide the result by the standard deviation. Any Z-score higher than 3 or lower than -3 generally represents an outlier.
Visualize the data
Sometimes, looking at the data in a graphical form like a scatter plot or histogram will help you find outliers. Normally, outliers look like points that are far from the rest of the data points. Visualizing the data is a quick way to spot these unusual numbers.
Why is it important to find outliers?
Identifying outliers is important because they can:
- Show mistakes in the data collection process.
- Indicate special events or important information.
- Are there any outliers that could skew the results of your analysis and require careful attention?
Conclusion
The methods mentioned above can assist you in identifying an outlier within a dataset. Examining the data, creating a box plot, performing an IQR calculation, or calculating the Z-score will undoubtedly aid in identifying the outlier. Now that you know how to find them, you can finally decide whether to include them, remove them, or conduct further investigation.