Multi-datasource machine learning in intrusion detection: Packet flows, system logs and host statistics

Ying Dar Lin, Ze Yu Wang, Po Ching Lin*, Van Linh Nguyen, Ren Hung Hwang, Yuan Cheng Lai

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

8 Scopus citations


This work compares the performance of different combinations of data sources for intrusion detection in depth. To learn and distinguish between normal and malicious behavior, we use machine learning algorithms and train three typical models on three kinds of datasets: system logs, packet flows and host statistics. Unlike other studies, our study captures and monitors the behavior from multiple data sources in order to catch security attacks. Our aim is to figure out how to build the most effective dataset for machine learning with a combination of multiple sources. However, since there are no such datasets which have been generated from multiple sources for given attacks, we show how to build and generate a dataset with three data sources. We then compare the F1 score of the detection by applying machine learning algorithms for various combinations of the data sources. Our evaluation results show that the dataset of host statistics results in better performance (0.91) than traffic flows (0.63) and system logs (0.44) because it has the highest average F1-score in the three stages of attacks, while the other datasets may have poor F1-scores in some of the stages, particularly in the stage of impact. However, in the initial access stage of attacks, the dataset of logs performs the best (0.94), and the packet flows are suitable for detecting network DoS attacks (0.82). Furthermore, running this detection with all three data sources results in minor overheads of at most 2.1% CPU utilization. Finally, we analyze the important features of each model, such as the number of logs generated by apache-access, in.telnetd and postfix in the dataset of logs, SrcBytes and TotBytes in the dataset of flows, and MINFLT, VSTEXT and RSIZE in the dataset of statistics.

Original languageEnglish
Article number103248
JournalJournal of Information Security and Applications
StatePublished - Aug 2022


  • Feature engineering
  • Host statistics
  • Intrusion detection
  • Machine learning
  • System log
  • Traffic flow


Dive into the research topics of 'Multi-datasource machine learning in intrusion detection: Packet flows, system logs and host statistics'. Together they form a unique fingerprint.

Cite this