Which algorithm is better for the detection of outliers?

Anomaly detection with machine learning

Why not every outlier is an outlier


An anomaly or an outlier describes a data point whose properties deviate so strongly from the norm that the suspicion arises that it was generated by a special mechanism. This definition, admittedly somewhat circular, already shows the difficulties inherent in this topic. But the anomalous events in particular are extremely valuable for entrepreneurial activity: from the fraudulent payment transaction and the frequently failing "Monday" equipment on the negative side to the particularly willing to pay customers on the positive side, anomalies signal a need for action for a company. Interesting business transactions can be found and new findings uncovered, especially in the data that deviates from the crowd.

With smaller amounts of data and low demands on time criticality, such checks can just be carried out “by hand”. On the other hand, large amounts of data and fast processing times require the support of intelligent algorithms. It is these algorithms that objectify the decisions about what is considered abnormal and thus make them accessible to quantitative analysis in the first place.

This article aims to provide an overview of the various facets of this topic (see figure 1) give. In addition to basic properties of data anomalies and use cases of anomaly detection, suitable approaches from the field of machine learning are also presented. Both the possibilities and the limits of the methods used are shown.

Are all dashboards on green? Why intelligent automatic solutions are necessary

In addition to the terms “anomaly” and “outlier”, the term is also used in English Novelty detection common for the detection of unusual data points and is largely used synonymously [Pim14]. In a narrower sense, Novelty Detection is about comparing a data point with an entity known to be normal while Outlier detection denotes the identification of outliers in a mixed normal / abnormal population.

As Novelty Detection, the process is a little more valued that it actually deserves: This not only provides protection against unloved dangers, but also the opportunity to increase the value of data for business intelligence and make it accessible to the decision-maker do.

The data available for analysis can, of course, be diverse. In the field of IT security, anomaly detection was used very early on - this is precisely where large amounts of data are available in the form of log data. Access to the IT system from abroad at an unusual time or with a different usage pattern would, for example, represent an anomalous signal that an analysis can reveal.

Another facet of anomaly monitoring can be found in the area of ​​the Internet of Things (IoT). Data supplied by sensors provide information on the condition of machines, IT devices and other assets and allow preventive maintenance (Predictive maintenance). It is also possible to react early to conditions that deviate from the norm.

A similar methodology emerges in the area of Fraud detection. This methodology plays a central role in payment transactions - but not only there. An illustrative example is the stolen credit card. The criminal's purchasing behavior deviates so far from the norm in terms of sales volume, location and frequency that an alarm is triggered and the fraudulent sales can be prevented. The detection of attempted fraud is also an important measure in the rest of the service business.

In addition to these "classic" use cases, the digitization of all business processes also increasingly results in the need to identify anomalies over time in economically relevant key performance indicators (KPIs). These include, for example, slumps in sales, changes in the payment behavior of customers or, in online business, also decreasing click-through rates. Proactive reporting of such abnormalities complements classic business intelligence and can reduce response times.

Intelligent automation of the anomaly detection enables an increasing granularity of the observed indicators. Different channels, which are only considered en bloc in classic BI, can be analyzed in this way. A company's sales can thus be monitored, for example, at the level of product categories, products and sales channels - a task that can only be performed with great effort with manual checking.

In order to substantiate the definition of an anomaly as a “suspicious” data point presented at the beginning, a look at the different types of anomalies helps in the first step [CBK09]. Based on this first classification, it is already clear that in all applications of automated anomaly detection, a careful definition of the objective is indispensable in advance.

Punctual anomaly

A punctual anomaly occurs when a single value is exceptional on its own. A distinction must be made between univariate and multivariate punctual anomalies [Jol02]:

A univariate anomaly is already expressed in a single data dimension. If, for example, the pupils of a primary school class are measured, a height of 1.80 meters stands out - the suspicion is that the teacher was also measured here or that there is a recording error. This data point was "generated" by a different mechanism. Univariate anomalies are usually noticed quickly, even with superficial analysis.
More complex anomalies, on the other hand, only express themselves in the joint consideration of several dimensions, so they are multivariate. An isolated consideration of a dimension does not reveal such outliers. When measuring the student population of a comprehensive school, neither a height of 1.70 m nor a weight of 25 kg would be particularly out of the ordinary, since a 10-year-old child can weigh 25 kg and a 16-year-old teenager can also weigh 1.70 m can reach. The combination of both measurements (25 kg, 1.70 m) in a child, on the other hand, should be almost impossible. Such a measurement would therefore be a multivariate anomaly, the generating mechanism of which is very likely to be a recording error.

Contextual anomaly

Another type of anomaly are contextual anomalies. These data points are only noticeable when they are seen in a larger context. Take a case from IT security as an example: A company's network traffic fluctuates considerably between day and night. A high volume of data, which is normal during working hours, can be an indication of unauthorized access to company data at night. There is a security-related contextual anomaly.

Collective anomaly

The final, and perhaps the most challenging, type of anomaly is found in the field of so-called collective anomalies. With these, individual data points are not noticeable. Instead, an abnormality only emerges when a data group is considered. The signal in Figure 2 shows an electrocardiogram in which every single heartbeat appears normal. Only the irregularity of an additional beat (extrasystole) defines the abnormal - as it is called in the medical context - condition.

Time series

Time series occupy a special position in which, in addition to univariates, there are often contextual or collective anomalies (see figure 3). On the one hand, updating the time series into the future as part of predictive analytics enables future anomalies to be identified before they occur. Statistical methods such as ARIMA or machine learning algorithms such as neural networks (LSTM) are used for this. On the other hand, the predicted values ​​and their confidence intervals can be used as a definition of normality in order to identify anomalies when they occur [Gup14].

Anomaly detection algorithms

Various approaches are available for the algorithmic identification of outliers, the selection of which must be made dependent on the specific properties of the question as well as on general data properties. There are three different situations:

1. In addition to data marked normally, there are already known anomalous data (monitored learning).

2. Only data marked as normal are available, but no anomalies are marked (semi-supervised learning).

3. Only unlabelled data is available (unsupervised learning).

A second classification is based on the algorithmic approach:

At probabilistic procedure a statistical model is adapted to the data. The probability of generating a data point from this model is assessed in order to identify outliers. The statistical models must be suitably selected and, if necessary, parameterized.

Distance and density method like the k-NN algorithm, however, consider each data point in the context of its environment or the similarity to other data points. If a sufficiently large amount of similar data is available for an instance, the method evaluates the data point as normal.

Work on a similar principle Clustering processthat use machine learning algorithms like k-means to divide the data into groups. Instances that are far from all groups are identified as outliers.

Class-based procedures require an at least partially classified training data set (monitored or semi-monitored learning). A machine learning classifier is trained with the training data in order to predict whether a data point belongs to a class. One-class support vector machines (SVM), which determine a boundary between normality and anomaly and are therefore also referred to as domain methods, are widespread.

Both Reconstruction procedures and the spectral method the data is transferred to a lower dimensionality and thus compressed. Instances that can be poorly mapped in this compression process are considered an anomaly. These methods include Principal Component Analysis (PCA) and Replicator Neural Networks. A comparable method can be found in information-theoretic procedure, in which parameters such as entropy and Kolmogorow complexity are assessed.

Use of the algorithms

Choosing an appropriate approach to anomaly detection depends on many factors. At this point some illustrative examples should show the possibilities of the algorithms. Figure 4 shows the phenol content of wines from three different grape varieties [DKT17]. On the y-axis you can see the content of flavonoids, which are mainly responsible for the color of the wine. The x-axis shows the total content of all phenols. These have a far-reaching influence on the taste of the wine. In addition to the measured data points, random impurities are shown, which should be recognized by the algorithms as anomalies.

One algorithm each from the area of ​​class-based, probabilistic methods and density methods was used to find an area of ​​normality - in figure 4 marked by the dashed line. The overall quality of the algorithms is comparably high.

For the anomaly detection in Figure 4, on the other hand, only the data of two grape varieties were available as a definition of normality, which are divided into two clusters. With this difficult data situation, the density method in particular turns out to be suitable. It should be noted that in practice mostly higher-dimensional data is available, which enables the algorithms to distinguish between normality and anomaly more clearly.

Conclusion

The amount of data generated in modern companies and the associated data granularity make the use of machine learning algorithms increasingly attractive. Combined with an automated real-time analysis, these approaches offer noticeable added value compared to classic BI tools and help to make the benefits of the data accessible. The selection and use of the right tools must, however, be carried out carefully, even with automated processes. It is important to find the right level of granularity at which anomalies are reliably detected and false alarms are minimized at the same time. An automated solution can only be accepted in the company under these conditions.

literature

[CBK09] Chandola, V. / Banerjee, A. / Kumar, V .: Anomaly detection: A survey. In: ACM Computing Surveys 41, 2009

[DKT17] Dua, D. / Karra Taniskidou, E .: UCI Machine Learning Repository. University of California, School of Information and Computer Science, 2017, archive.ics.uci.edu/ml

[Gup14] Gupta, M. et al .: Outlier Detection for Temporal Data: A Survey. In: IEEE Transactions on Knowledge and Data Engineering 26, 2014

[Jol02] Jolliffe, I. T .: Principal Component Analysis. New York, Berlin, Heidelberg: Springer 2002

[Pim14]Pimentel, M.A.F. et al .: A review of novelty detection. In: Signal Processing 99, 2014, pp. 215–249

[Sub17] Subutai, A. et al .: Unsupervised real-time anomaly detection for streaming data. In: Neurocomputing 262, 2017, pp. 134–147


is Senior Consultant Data Science at Consist Software Solutions GmbH.