Author: pseudo_Skye Updated: 2023-04-05
Time series anomaly detection is an essential task in various domains such as finance, healthcare, and security. However, the existing evaluation metrics for time series anomaly detection algorithms have limitations that can lead to misleading results. In this article, we will discuss the limitations of existing evaluation metrics and introduce the new evaluation metrics that address these limitations from the most recent publications.
- Towards a Rigorous Evaluation of Time-series Anomaly Detection (AAAI 2022)
- Local Evaluation of Time Series Anomaly Detection Algorithms (KDD 2022)
Summary: The paper highlights that most studies apply a peculiar evaluation protocol called point adjustment (PA) before scoring. The authors reveal that the PA protocol has a great possibility of inflating F1 scores and thus can be misleading. They propose a new evaluation protocol called PA%K that can provide more reliable evaluation results. (paper)
If at least one moment in a contiguous anomaly segment is detected as an anomaly, the entire segment is then considered to be correctly predicted as an anomaly.
(1) PA has a high possibility of overestimating the model performance.
(2) No baseline and relative comparison against the proposed model.
have been discussed here.
Unsupervised model - trained by only normal signals, then assign anomaly scores to inputs.
(1) Reconstruction-based AD method: Train a model to minimize the distance between a normal input and its reconstruction, then anomalies are those hard to reconstruct by the model and yield a large distance.
(2) Forecast-based AD method: Train a model to predict the signal that comes after the normal input, and take the distance between the ground truth and the predicted signal as an anomaly score.
PA adjusts
Where
F1 can unexpectedly underestimate the detection capability. In fact, due to the incomplete test set labeling, some signals labeled as anomalies share more statistics with normal signals.
Given the definition of precision (P), recall (R) and F1, after PA, we have
A deep neural network is generally initialized with random weights drawn from a Gaussian distribution
where
F1 is measured from the prediction of a randomly initialized reconstruction model with simple architecture, such as an untrained autoencoder comprising a single-layer LSTM.
Alternatively, the anomaly score can be defined as the input itself.
The idea of PA%K is to apply PA to
K can be selected manually between 0 and 100 based on prior knowledge. For example, if the test set labels are reliable, a larger K is allowable. If a user wants to remove the dependency on K, it is recommended to measure the area under the curve of
No evidence is given to assure the existence of the correlation between
1. Random anomaly score
2. Input itself as an anomaly score
3. Anomaly score from the randomized model
(1) Case 2&3: window size
(2) Case 1&3: randomness with 5 different seeds and take the mean value.
(3) No preprocessing such as early time steps removal or downsampling.
(4) Thresholds were obtained that yielded the best score.
The up arrow is displayed with the result for the following cases:
(1)
- Case 1:
$F1_{PA}$ appears to yield the SOTA methods; thus distinguishing whether the SOTA method successfully detects anomalies or whether it merely outputs a random anomaly score irrelevant to the input is impossible. (extreme low F1 but high$F1_{PA}$ - illusion of the robust TAD methods) - Case 2&3: F1 score depends on the length of the input window. The longer the input window, the larger the F1. From the F1 score derived under setting Case 2&3, the proposed SOTA methods may have obtained marginal or even no advancement against the baselines.
- The
$F1_{PA\%K}$ of a well-trained model is expected to show constant results regardless of the value of K.
Existing TAD methods set the threshold after investigating the test dataset or simply use the optimal threshold that yields the best F1. Thus, the detection result depends significantly on threshold selection. Additional metrics with reduced dependency such as receiver operating characteristic (AUROC) or area under precision-recall (AUPR) curve will help in rigorous evaluation.
summary: This paper talks about the limitations of classical precision and recall in TAD evaluation. It proposes a new metric named 'affiliation metrics' that is theoretically principled, parameter-free, robust against adversary predictions, retains a physical meaning (as they are connected to quantities expressed in time units), and is locally interpretable (allowing troubleshooting detection at individual event level). (paper, code)
1. Unawareness of temporal adjacency (A)
Penalize without tolerance for the wrong predictions located closely to the ground truth event.
2. Unawareness of the event durations (B)
Point-wise evaluation reward more if the anomalous segment is longer.
1. handle the limitations (A) and (B) introduced by the presence of time.
2. Parameter-free definition.
3. Expressiveness of the scores, which means a slight improvement of the predictions should result in a slight improvement of the scores.
4. Loacal interpretability of the scores.
5. Existence of statistical bounds of the scores.
STEP 1: Calculate the average directed distance between sets to measure how far the events are one from each other
- The definition of the average directed distance:
$\mathrm{dist}(X,Y) = \frac{1}{|X|} \int_{x \in X} \mathrm{dist}(x,Y) dx$
-
Precision: From prediction (
$Y^\prime$ ) to ground truth ($Y$ ) (left)
Here,
In this case, removing the FPs can largely increase the precision.
- Recall: From ground truth to prediction (right)
Here, we calculate the distance within the range of ground truth given the definition of recall that how many TPs are corrected classified.
In this case, removing the FPs wouldn't change the results of recall.
-
Zone of affiliation: Assign each time stamp along the time axis to the nearest ground truth event (
$gt_j$ ), then the time zone will be partitioned into several intervals. -
Individual precision:
$D_{p_j} = \mathrm{dist} \left( Y^\prime \cap I_j, gt_j \right)$ -
Individual recall:
$D_{r_j} = \mathrm{dist} \left( gt_j, Y^\prime \cap I_j \right)$
-
Example: the time zone has been partitioned into 3 zones of affiliation named
$I_1$ (discussed in STEP 1),$I_2$ and$I_3$ . Thus, we have$D_{p_1} = 18 \mathrm{s}$ , and$D_{r_1} = 76.5 \mathrm{s}$ . Similarly, we can have the individual precision and recall of the other zones as$D_{p_2} = 11 \mathrm{min} 30 \mathrm{s}$ ,$D_{r_2} = 2 \mathrm{min} 30 \mathrm{s}$ ,$D_{p_3} = 31 \mathrm{min} 15\mathrm{s}$ , and$D_{r_3} = 2 \mathrm{min} 30\mathrm{s}$ . Individual distances are providing a summarized view of each ground truth event expressed in a meaningful unit.
It would be desirable to normalize each individual value to the [0, 1] range, by assuming that each ground truth event is equally important.
-
Individual precision probability: For the precision, the distance from
$X$ to the ground truth$gt_j$ is a random variable with a cumulative distribution function$F_{p_j}$ . The survival function is given by$\overline F_{p_j}(d) = 1- F_{p_j}(d-)$ . Thus, the individual precision probability is given by
Explanation:
The cumulative distribution function means that if we randomly sample a point within the affiliation zone, what is the possibility that the distance
-
Individual recall probability: Similarly, we can have
$$P_{\text {recall }_j}=\frac{1}{\left|\mathrm{gt}_j\right|} \int_{y \in \mathrm{gt}_j} \bar{F}_{y, \text { recall }_j}\left(\mathrm{dist}\left(y, \text { pred } \cap I_j\right)\right) d y$$
Explanation:
The cumulative distribution function of recall means that if we sample a point within the affiliation zone, the score of distance
where
From my perspective, the probability function here is not given by the possibility of sampling a random point of
The precision/recall are defined as the mean over the defined individual probabilities. We let
- As for algorithmic comparisons, they are expressive enough to perform a fair comparison without the need for introducing additional parameters.
- The assessment of the proximity between the predicted and the ground truth labels, which is the primary aspect to compare anomaly detection algorithms from a research point of view.