Time-series anomaly detection is a fundamental task across scientific fields and industries. However, the field has long faced the ''🐘 elephant in the room:'' critical issues including flawed datasets, biased evaluation measures, and inconsistent benchmarking practices that have remained largely ignored and unaddressed. We introduce the TSB-AD to systematically tackle these issues in the following three aspects: (i) Dataset Integrity: with 1070 high-quality time series from a diverse collection of 40 datasets (doubling the size of the largest collection and four times the number of existing curated datasets), we provide the first large-scale, heterogeneous, meticulously curated dataset that combines the effort of human perception and model interpretation; (ii) Measure Reliability: by revealing issues and biases in evaluation measures, we identify the most reliable and accurate measure, namely, VUS-PR for anomaly detection in time series to address concerns from the community; and (iii) Comprehensive Benchmarking: with a broad spectrum of 40 detection algorithms, from statistical methods to the latest foundation models, we perform a comprehensive evaluation that includes a thorough hyperparameter tuning and a unified setup for a fair and reproducible comparison. Our findings challenge the conventional wisdom regarding the superiority of advanced neural network architectures, revealing that simpler architectures and statistical methods often yield better performance. The promising performance of neural networks on multivariate cases and foundation models on point anomalies highlights the need for further advancements in these methods.
Dataset | Description | Data Source | License |
---|---|---|---|
UCR | A collection of univariate time series of multiple domains, including air temperature, arterial blood pressure, astronomy, ECG, and more. Most anomalies are introduced artificially. | Website | None |
NAB | Labeled real-world and artificial time series, including AWS server metrics, online advertisement click rates, real-time traffic data, and Twitter mentions of publicly traded companies. | Website | GPL |
YAHOO | A dataset published by Yahoo Labs, consisting of real and synthetic time series based on production traffic to Yahoo systems. | Website | See details in Website |
IOPS | A dataset with performance indicators reflecting the scale, quality of web services, and machine health status. | Website | None |
MGAB | Mackey-Glass time series, where anomalies exhibit chaotic behavior that is challenging for the human eye to distinguish. | Website | CC0-1.0 |
WSD | is a web service dataset, which contains real-world KPIs collected from large Internet companies. | Website | None |
SED | a simulated engine disk data from the NASA Rotary Dynamics Laboratory representing disk revolutions recorded over several runs (3K rpm speed). | Website | None |
TODS | is a synthetic dataset that comprises global, contextual, shapelet, seasonal, and trend anomalies. | Website | Apache-2.0 |
NEK | is collected from real production network equipment. | Website | None |
Stock | is a stock trading traces dataset, containing one million transaction records throughout the trading hours of a day. | Website | None |
Power | power consumption for a Dutch research facility for the entire year of 1997. | Website | None |
GHL | contains the status of 3 reservoirs such as the temperature and level. Anomalies indicate changes in max temperature or pump frequency. | Website | None |
Daphnet | contains the annotated readings of 3 acceleration sensors at the hip and leg of Parkinson's disease patients that experience freezing of gait (FoG) during walking tasks. | Website | CC BY 4.0 |
Exathlon | is based on real data traces collected from a Spark cluster over 2.5 months. For each of these anomalies, ground truth labels are provided for both the root cause interval and the corresponding effect interval. | Website | Apache-2.0 |
Genesis | is a portable pick-and-place demonstrator that uses an air tank to supply all the gripping and storage units. | Website | CC BY-NC-SA 4.0 |
OPP | is devised to benchmark human activity recognition algorithms (e.g., classification, automatic data segmentation, sensor fusion, and feature extraction), which comprises the readings of motion sensors recorded while users executed typical daily activities. | Website | CC BY 4.0 |
SMD | is a 5-week-long dataset collected from a large Internet company, which contains 3 groups of entities from 28 different machines. | Website | MIT |
SWaT | is a secure water treatment dataset that is collected from 51 sensors and actuators, where the anomalies represent abnormal behaviors under attack scenarios. | Website | Needs request form |
PSM | is a dataset collected internally from multiple application server nodes at eBay. | Website | CC 4.0 |
SMAP | is real spacecraft telemetry data with anomalies from Soil Moisture Active Passive satellite. It contains time series with one feature representing a sensor measurement, while the rest represent binary encoded commands. | Website | Caltech |
MSL | is collected from Curiosity Rover on Mars satellite. | Website | Caltech |
CreditCard | is an intrusion detection evaluation dataset, which consists of labeled network flows, including full packet payloads in pcap format, the corresponding profiles, and the labeled flows. | Website | None |
GECCO | is a water quality dataset used in a competition for online anomaly detection of drinking water quality. | Website | CC BY 4.0 |
MITDB | contains 48 half-hour excerpts of two-channel ambulatory ECG recordings, obtained from 47 subjects studied by the BIH Arrhythmia Laboratory between 1975 and 1979. | Website | Open Data Commons Attribution License v1.0 |
SVDB | includes 78 half-hour ECG recordings chosen to supplement the examples of supraventricular arrhythmias in the MIT-BIH Arrhythmia Database. | Website | Open Data Commons Attribution License v1.0 |
LTDB | is a collection of 7 long-duration ECG recordings (14 to 22 hours each), with manually reviewed beat annotations. | Website | Open Data Commons Attribution License v1.0 |
CATSv2 | is the second version of the Controlled Anomalies Time Series (CATS) Dataset, which consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies. | Website | CC BY 4.0 |
@inproceedings{liu2024elephant,
title={The Elephant in the Room: Towards A Reliable Time-Series Anomaly Detection Benchmark},
author={Liu, Qinghua and Paparrizos, John},
booktitle={NeurIPS 2024},
year={2024}
}
@article{paparrizos2022tsb,
title={Tsb-uad: an end-to-end benchmark suite for univariate time-series anomaly detection},
author={Paparrizos, John and Kang, Yuhao and Boniol, Paul and Tsay, Ruey S and Palpanas, Themis and Franklin, Michael J},
journal={Proceedings of the VLDB Endowment},
volume={15},
number={8},
pages={1697--1711},
year={2022},
publisher={VLDB Endowment}
}
@article{paparrizos2022volume,
title={{Volume Under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection}},
author={Paparrizos, John and Boniol, Paul and Palpanas, Themis and Tsay, Ruey S and Elmore, Aaron and Franklin, Michael J},
journal={Proceedings of the VLDB Endowment},
volume={15},
number={11},
pages={2774--2787},
year={2022},
publisher={VLDB Endowment}
}