Towards a Utopia of Dataset Sharing: A Case Study on Machine Learning-based Malware Detection Algorithms

Ping Jui Chuang, Chih Fan Hsu, Yung Tien Chu, Szu Chun Huang, Chun Ying Huang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Working with a high-quality (complete and up-to-date) dataset is the key to building a good machine learning model, especially in security research areas. However, it is not easy to collect a good quality dataset for security research communities because of the sensitive property of most security datasets. We believe that having more contributors to share up-to-date samples would increase the quality of datasets. Therefore, this study aims to increase security dataset sharing for research communities by eliminating possible information leakage. We propose a dataset sharing model and the core algorithm, FeatureTransformer, which guarantees no sensitive information leakage from a shared dataset. FeatureTransformer transforms extracted raw features into intermediate features that conceal sensitive information. Meanwhile, models built from transformed features maintain similar performance compared to models built from the original raw features. We show the effectiveness of our model by evaluating FeatureTransformer with typical malware classification problems using (1) traditional machine learning classifiers and (2) neural network-based classifiers. The experiment results show that the models trained with transformed features merely suffer from 2.56% and 1.48% accuracy degradation on the investigated problems. It indicates that models validated by datasets processed by FeatureTransformer work well with the original raw (untransformed) datasets. We believe that our privacy-preserving model can stimulate dataset sharing and advance the development of machine learning approaches in solving security problems.

Original languageEnglish
Title of host publicationASIA CCS 2022 - Proceedings of the 2022 ACM Asia Conference on Computer and Communications Security
PublisherAssociation for Computing Machinery, Inc
Pages479-493
Number of pages15
ISBN (Electronic)9781450391405
DOIs
StatePublished - 30 May 2022
Event17th ACM ASIA Conference on Computer and Communications Security 2022, ASIA CCS 2022 - Virtual, Online, Japan
Duration: 30 May 20223 Jun 2022

Publication series

NameASIA CCS 2022 - Proceedings of the 2022 ACM Asia Conference on Computer and Communications Security

Conference

Conference17th ACM ASIA Conference on Computer and Communications Security 2022, ASIA CCS 2022
Country/TerritoryJapan
CityVirtual, Online
Period30/05/223/06/22

Keywords

  • dataset sharing
  • machine learning
  • malware classification
  • reproducible research

Fingerprint

Dive into the research topics of 'Towards a Utopia of Dataset Sharing: A Case Study on Machine Learning-based Malware Detection Algorithms'. Together they form a unique fingerprint.

Cite this