TY - JOUR
T1 - Multi-datasource machine learning in intrusion detection
T2 - Packet flows, system logs and host statistics
AU - Lin, Ying Dar
AU - Wang, Ze Yu
AU - Lin, Po Ching
AU - Nguyen, Van Linh
AU - Hwang, Ren Hung
AU - Lai, Yuan Cheng
N1 - Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2022/8
Y1 - 2022/8
N2 - This work compares the performance of different combinations of data sources for intrusion detection in depth. To learn and distinguish between normal and malicious behavior, we use machine learning algorithms and train three typical models on three kinds of datasets: system logs, packet flows and host statistics. Unlike other studies, our study captures and monitors the behavior from multiple data sources in order to catch security attacks. Our aim is to figure out how to build the most effective dataset for machine learning with a combination of multiple sources. However, since there are no such datasets which have been generated from multiple sources for given attacks, we show how to build and generate a dataset with three data sources. We then compare the F1 score of the detection by applying machine learning algorithms for various combinations of the data sources. Our evaluation results show that the dataset of host statistics results in better performance (0.91) than traffic flows (0.63) and system logs (0.44) because it has the highest average F1-score in the three stages of attacks, while the other datasets may have poor F1-scores in some of the stages, particularly in the stage of impact. However, in the initial access stage of attacks, the dataset of logs performs the best (0.94), and the packet flows are suitable for detecting network DoS attacks (0.82). Furthermore, running this detection with all three data sources results in minor overheads of at most 2.1% CPU utilization. Finally, we analyze the important features of each model, such as the number of logs generated by apache-access, in.telnetd and postfix in the dataset of logs, SrcBytes and TotBytes in the dataset of flows, and MINFLT, VSTEXT and RSIZE in the dataset of statistics.
AB - This work compares the performance of different combinations of data sources for intrusion detection in depth. To learn and distinguish between normal and malicious behavior, we use machine learning algorithms and train three typical models on three kinds of datasets: system logs, packet flows and host statistics. Unlike other studies, our study captures and monitors the behavior from multiple data sources in order to catch security attacks. Our aim is to figure out how to build the most effective dataset for machine learning with a combination of multiple sources. However, since there are no such datasets which have been generated from multiple sources for given attacks, we show how to build and generate a dataset with three data sources. We then compare the F1 score of the detection by applying machine learning algorithms for various combinations of the data sources. Our evaluation results show that the dataset of host statistics results in better performance (0.91) than traffic flows (0.63) and system logs (0.44) because it has the highest average F1-score in the three stages of attacks, while the other datasets may have poor F1-scores in some of the stages, particularly in the stage of impact. However, in the initial access stage of attacks, the dataset of logs performs the best (0.94), and the packet flows are suitable for detecting network DoS attacks (0.82). Furthermore, running this detection with all three data sources results in minor overheads of at most 2.1% CPU utilization. Finally, we analyze the important features of each model, such as the number of logs generated by apache-access, in.telnetd and postfix in the dataset of logs, SrcBytes and TotBytes in the dataset of flows, and MINFLT, VSTEXT and RSIZE in the dataset of statistics.
KW - Feature engineering
KW - Host statistics
KW - Intrusion detection
KW - Machine learning
KW - System log
KW - Traffic flow
UR - http://www.scopus.com/inward/record.url?scp=85133542989&partnerID=8YFLogxK
U2 - 10.1016/j.jisa.2022.103248
DO - 10.1016/j.jisa.2022.103248
M3 - Article
AN - SCOPUS:85133542989
SN - 2214-2134
VL - 68
JO - Journal of Information Security and Applications
JF - Journal of Information Security and Applications
M1 - 103248
ER -