Anomaly Detection Using Gaussian Distribution

Jupyter Demos

▶️ Demo | Anomaly Detection - find anomalies in server operational parameters like latency and threshold

Gaussian (Normal) Distribution

The normal (or Gaussian) distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate.

Let’s say:

x-in-R

x-in-R

If x is normally distributed then it may be displayed as follows.

Gaussian Distribution

Gaussian Distribution

mu - mean value,

sigma-2 - variance.

x-normal - “~” means that “x is distributed as …”

Then Gaussian distribution (probability that some x may be a part of distribution with certain mean and variance) is given by:

Gaussian Distribution

Gaussian Distribution

Estimating Parameters for a Gaussian

We may use the following formulas to estimate Gaussian parameters (mean and variation) for ith feature:

mu-i

mu-i

sigma-i

sigma-i

i

i

m - number of training examples.

n - number of features.

Density Estimation

So we have a training set:

Training Set

Training Set

x-in-R

x-in-R

We assume that each feature of the training set is normally distributed:

x-1

x-1

x-2

x-2

x-n

x-n

Then:

p-x

p-x

p-x-2

p-x-2

Anomaly Detection Algorithm

  1. Choose features x-i that might be indicative of anomalous examples (Training Set).
  2. Fit parameters params using formulas:
mu-i

mu-i

sigma-i

sigma-i

  1. Given new example x, compute p(x):
p-x-2

p-x-2

Anomaly if anomaly

epsilon - probability threshold.

Algorithm Evaluation

The algorithm may be evaluated using F1 score.

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

F1 Score

F1 Score

f1

f1

Where:

precision

precision

recall

recall

tp - number of true positives.

fp - number of false positives.

fn - number of false negatives.