CREME: A toolchain of automatic dataset collection for machine learning in intrusion detection

Huu Khoi Bui, Ying Dar Lin, Ren Hung Hwang*, Po Ching Lin, Van Linh Nguyen, Yuan Cheng Lai

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

8 Scopus citations


Intrusion detection is one of the most common approaches for addressing security attacks in modern networks. However, given the increasing diversity of attack behaviors, efficient detection becomes more challenging. Machine learning (ML) has recently dominated as one of the most promising techniques to improve detection accuracy for intrusion detection systems(IDS). With ML-based approaches, a quality dataset for training holds the key to gain high detection performance. Unfortunately, there are few methods to assess the dataset quality, and specifically for ML training. This work presents an automated toolchain, termed CREME (Configuration, REproduction, Multi-dataset, and Evaluation), to generate a dataset and measure its quality and efficiency. CREME integrates various tools to automate all stages of configuration, attack and benign behavior reproduction, data collection, feature extraction, data labeling, and evaluation. CREME can also automatically collect and generate a dataset from multiple sources such as accounting, network traffic, and system logs. Compared with the available datasets in the same category, experiment results show that the datasets generated by CREME contribute up to 20% better performance to ML-based IDS in terms of coverage. They also have significantly better efficiency than most other datasets. The CREME source code is available at

Original languageEnglish
Article number103212
JournalJournal of Network and Computer Applications
StatePublished - 1 Nov 2021


  • Dataset evaluation
  • Dataset generation
  • Dataset toolchain
  • Intrusion detection
  • Machine learning
  • Multiple data sources
  • Security dataset


Dive into the research topics of 'CREME: A toolchain of automatic dataset collection for machine learning in intrusion detection'. Together they form a unique fingerprint.

Cite this