Anomaly detection has the potential to bring sweeping changes to an organization. When data pros identify extreme deviations from the norm businesses can start asking interesting questions.
- Why are sales up 400% percent?
- Why is a customer who lives in Ohio making a purchase from Nigeria?
- Why does the Texas shipping warehouse have twice as many injuries as the national average?
- Why hasn't Joe logged into his account for 60 days?
So how do we know if an outlier in the data is really an anomaly? One approach is measure the number of deviations from the mean. Three to four deviations out might indicate an anomaly.
A few pitfalls exist with this simplistic approach. First, most data in the real world is not normally distributed. Skewness and high Kurtosis make deviation analysis problematic. For these real world data sets there's no easy way to determine how many deviations from the mean constitutes an anomaly. The data must be analyzed in context by a user with domain knowledge. For example, a distribution of company salaries would most likely be skewed right. Would the CEO's salary be an anomaly? Probably not, but we wouldn't know without analyzing the data in context.
Without the use of analytical tools detection could be performed manually, inspecting each case. Once our data grows beyond a few thousand rows this becomes impractical. Later in the post we will look at data set containing 26 million rows. Task someone with manually inspecting this set and he will be done in seven years. That's assuming an inspection rate of 10,000 rows a day, 365 days a year. Luckily we don't need to resort to manual detection. We have a plethora of tools and methods to speed up our search including:
- Graphical methods
- Statistical methods
- Data Mining methods
This post will explore anomaly detection using the graphical methods approach. Our sample data set comes from the VAST 2015 visualization competition. The set contains movement and communication transaction data from a fictitious theme park called Dino World. We only have data from one weekend in which a vandalism occurred in the park. We have not been informed of any details regarding the vandalism ahead of time. For more information about the competition check out the VAST website http://bit.ly/2gJ43Ac
The provided data set is relatively straight forward. It's just two files with a few columns each. Every park visitor is assigned a unique id.
Here's a peak at the communication data:
Row count: 4,153,329
The columns "from" and "to" are the visitor IDs of the guests involved in the phone call. The location column indicates the area of the call. Guest have the ability to place external calls.
Row count: 26,021,962
The type column indicates if the movement event was a check in to a ride or just walking through the park.
A simple time-series analysis of the communication data using Tableau immediately reveals a potential anomaly. Internal calls are represented by the orange line and external blue. Look at the red box.
A spike in external calls emerges from the set. When we drill down to this time period the pattern becomes easy to describe. On June 8th between 11:45 and 12pm guests made 500 calls a minute. By the far the highest volume for this type of call over the entire weekend.
The spike merits additional analysis. One might start to question what area of the park these calls originated from. The next graph gives the answer - the Wet lands.
Finally, let's look at communication patterns between guests during the spike. The following graph was generated using the igraph package in R.
Each dot represents a park visitor and each connection indicates a communication occurred between the visitors. The center node at of the largest green dot cluster represents an external number. Thus we confirm the visualizations from above (lot's of external calls) but also have an opportunity to look for anomalies outside the main clusters. For example, the blue dot network in the upper right hand corner looks suspicious. Why did no one communicate with an external number when most others did?
These visitors may have had a legitimate reason for not calling an external line or they could be related to the vandalism.
With just four graphs we've identified a potential time frame for the vandalism, a location, and some suspects. Not bad for having no prior knowledge 15 minutes ago.
Searching for anomalies using graphical methods can be an effective starting point. Conversely, graphical analysis is too slow when we need analyze 100s of columns or variables. The next post in this series will address alternative methods and build on the analysis performed thus far.