🐘 The Elephant in the Room:
Towards A Reliable Time-series Anomaly Detection Benchmark

The Ohio State University
NeurIPS 2024 D&B

Time Series Sample

Example time series from TSB-AD, with anomalies highlighted in red.

Abstract

Time-series anomaly detection is a fundamental task across scientific fields and industries. However, the field has long faced the ''🐘 elephant in the room:'' critical issues including flawed datasets, biased evaluation measures, and inconsistent benchmarking practices that have remained largely ignored and unaddressed. We introduce the TSB-AD to systematically tackle these issues in the following three aspects: (i) Dataset Integrity: with 1070 high-quality time series from a diverse collection of 40 datasets (doubling the size of the largest collection and four times the number of existing curated datasets), we provide the first large-scale, heterogeneous, meticulously curated dataset that combines the effort of human perception and model interpretation; (ii) Measure Reliability: by revealing issues and biases in evaluation measures, we identify the most reliable and accurate measure, namely, VUS-PR for anomaly detection in time series to address concerns from the community; and (iii) Comprehensive Benchmarking: with a broad spectrum of 40 detection algorithms, from statistical methods to the latest foundation models, we perform a comprehensive evaluation that includes a thorough hyperparameter tuning and a unified setup for a fair and reproducible comparison. Our findings challenge the conventional wisdom regarding the superiority of advanced neural network architectures, revealing that simpler architectures and statistical methods often yield better performance. The promising performance of neural networks on multivariate cases and foundation models on point anomalies highlights the need for further advancements in these methods.

📈 TSB-AD-U Leaderborad

📈 TSB-AD-M Leaderborad

Summary of Datasets

Dataset Description Data Source License
UCR A collection of univariate time series of multiple domains, including air temperature, arterial blood pressure, astronomy, ECG, and more. Most anomalies are introduced artificially. Website None
NAB Labeled real-world and artificial time series, including AWS server metrics, online advertisement click rates, real-time traffic data, and Twitter mentions of publicly traded companies. Website GPL
YAHOO A dataset published by Yahoo Labs, consisting of real and synthetic time series based on production traffic to Yahoo systems. Website See details in Website
IOPS A dataset with performance indicators reflecting the scale, quality of web services, and machine health status. Website None
MGAB Mackey-Glass time series, where anomalies exhibit chaotic behavior that is challenging for the human eye to distinguish. Website CC0-1.0
WSD is a web service dataset, which contains real-world KPIs collected from large Internet companies. Website None
SED a simulated engine disk data from the NASA Rotary Dynamics Laboratory representing disk revolutions recorded over several runs (3K rpm speed). Website None
TODS is a synthetic dataset that comprises global, contextual, shapelet, seasonal, and trend anomalies. Website Apache-2.0
NEK is collected from real production network equipment. Website None
Stock is a stock trading traces dataset, containing one million transaction records throughout the trading hours of a day. Website None
Power power consumption for a Dutch research facility for the entire year of 1997. Website None
GHL contains the status of 3 reservoirs such as the temperature and level. Anomalies indicate changes in max temperature or pump frequency. Website None
Daphnet contains the annotated readings of 3 acceleration sensors at the hip and leg of Parkinson's disease patients that experience freezing of gait (FoG) during walking tasks. Website CC BY 4.0
Exathlon is based on real data traces collected from a Spark cluster over 2.5 months. For each of these anomalies, ground truth labels are provided for both the root cause interval and the corresponding effect interval. Website Apache-2.0
Genesis is a portable pick-and-place demonstrator that uses an air tank to supply all the gripping and storage units. Website CC BY-NC-SA 4.0
OPP is devised to benchmark human activity recognition algorithms (e.g., classification, automatic data segmentation, sensor fusion, and feature extraction), which comprises the readings of motion sensors recorded while users executed typical daily activities. Website CC BY 4.0
SMD is a 5-week-long dataset collected from a large Internet company, which contains 3 groups of entities from 28 different machines. Website MIT
SWaT is a secure water treatment dataset that is collected from 51 sensors and actuators, where the anomalies represent abnormal behaviors under attack scenarios. Website Needs request form
PSM is a dataset collected internally from multiple application server nodes at eBay. Website CC 4.0
SMAP is real spacecraft telemetry data with anomalies from Soil Moisture Active Passive satellite. It contains time series with one feature representing a sensor measurement, while the rest represent binary encoded commands. Website Caltech
MSL is collected from Curiosity Rover on Mars satellite. Website Caltech
CreditCard is an intrusion detection evaluation dataset, which consists of labeled network flows, including full packet payloads in pcap format, the corresponding profiles, and the labeled flows. Website None
GECCO is a water quality dataset used in a competition for online anomaly detection of drinking water quality. Website CC BY 4.0
MITDB contains 48 half-hour excerpts of two-channel ambulatory ECG recordings, obtained from 47 subjects studied by the BIH Arrhythmia Laboratory between 1975 and 1979. Website Open Data Commons Attribution License v1.0
SVDB includes 78 half-hour ECG recordings chosen to supplement the examples of supraventricular arrhythmias in the MIT-BIH Arrhythmia Database. Website Open Data Commons Attribution License v1.0
LTDB is a collection of 7 long-duration ECG recordings (14 to 22 hours each), with manually reviewed beat annotations. Website Open Data Commons Attribution License v1.0
CATSv2 is the second version of the Controlled Anomalies Time Series (CATS) Dataset, which consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies. Website CC BY 4.0

Slides

BibTeX

@inproceedings{liu2024elephant,
        title={The Elephant in the Room: Towards A Reliable Time-Series Anomaly Detection Benchmark},
        author={Liu, Qinghua and Paparrizos, John},
        booktitle={NeurIPS 2024},
        year={2024}
      }
@article{paparrizos2022tsb,
        title={Tsb-uad: an end-to-end benchmark suite for univariate time-series anomaly detection},
        author={Paparrizos, John and Kang, Yuhao and Boniol, Paul and Tsay, Ruey S and Palpanas, Themis and Franklin, Michael J},
        journal={Proceedings of the VLDB Endowment},
        volume={15},
        number={8},
        pages={1697--1711},
        year={2022},
        publisher={VLDB Endowment}
      }
@article{paparrizos2022volume,
        title={{Volume Under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection}},
        author={Paparrizos, John and Boniol, Paul and Palpanas, Themis and Tsay, Ruey S and Elmore, Aaron and Franklin, Michael J},
        journal={Proceedings of the VLDB Endowment},
        volume={15},
        number={11},
        pages={2774--2787},
        year={2022},
        publisher={VLDB Endowment}
      }