TY - JOUR
T1 - Big data platform configuration using machine learning
AU - Yeh, Chao Chun
AU - Lu, Han Lin
AU - Zhou, Jiazheng
AU - Chang, Sheng An
AU - Lin, Xuan Yi
AU - Sun, Yi Chiao
AU - Huang, Shih Kun
N1 - Publisher Copyright:
© 2020 Institute of Information Science. All rights reserved.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020/5
Y1 - 2020/5
N2 - By ensuring well-developed complex big data platform architectures, data engineers provide data scientists and analysts infrastructure with computational and storage resources to perform their research. Based on such supports, data scientists are provided an opportunity to focus on their domain problems and design the required intelligent modules (i.e., prepare the data; select, train, and tune the machine-learning modules; and validate the results). However, there are still gaps between system engineering and data scientist/engineering teams. Generally, system engineers have limited knowledge on the application domains and the purposes of an analytical program. On the contrary, both data scientists and engineers are usually unfamiliar with the configuration of a computational system, file system, and database. However, the performance of an application can be affected by a system's configuration, and the data scientists and engineers have little information and knowledge about which of the system's properties can affect the application's performance. As a typical example, for Internet-scale applications that have thousands of computing nodes or billions of Internet of Things devices, even a slight improvement may have an enormous influence on energy management and environmental protection issues. To bridge the gap between system engineering and data scientist/engineering teams, we proposed the concept of a configuration layer based on a big data platform, Hadoop. We built a configuration tuner, BigExplorer, to collect and preprocess data. Furthermore, we also created golden configurations for performance improvement. Based on the processed data, we used a semi-automatic feature engineering technique to provide more features for data engineers and developed the performance model using three different machine learning algorithms (i.e., random forest, gradient boosting machine, and support vector machine). Using the commonly used benchmarks of WordCount, TeraSort, and Pig workloads, our configuration tuner achieved a significant performance improvement of 28%-51% for different workloads than using the rule-of-thumb configuration.
AB - By ensuring well-developed complex big data platform architectures, data engineers provide data scientists and analysts infrastructure with computational and storage resources to perform their research. Based on such supports, data scientists are provided an opportunity to focus on their domain problems and design the required intelligent modules (i.e., prepare the data; select, train, and tune the machine-learning modules; and validate the results). However, there are still gaps between system engineering and data scientist/engineering teams. Generally, system engineers have limited knowledge on the application domains and the purposes of an analytical program. On the contrary, both data scientists and engineers are usually unfamiliar with the configuration of a computational system, file system, and database. However, the performance of an application can be affected by a system's configuration, and the data scientists and engineers have little information and knowledge about which of the system's properties can affect the application's performance. As a typical example, for Internet-scale applications that have thousands of computing nodes or billions of Internet of Things devices, even a slight improvement may have an enormous influence on energy management and environmental protection issues. To bridge the gap between system engineering and data scientist/engineering teams, we proposed the concept of a configuration layer based on a big data platform, Hadoop. We built a configuration tuner, BigExplorer, to collect and preprocess data. Furthermore, we also created golden configurations for performance improvement. Based on the processed data, we used a semi-automatic feature engineering technique to provide more features for data engineers and developed the performance model using three different machine learning algorithms (i.e., random forest, gradient boosting machine, and support vector machine). Using the commonly used benchmarks of WordCount, TeraSort, and Pig workloads, our configuration tuner achieved a significant performance improvement of 28%-51% for different workloads than using the rule-of-thumb configuration.
KW - Algorithms
KW - Big data platform
KW - Configuration optimization
KW - Learning by design
KW - Machine learning
UR - http://www.scopus.com/inward/record.url?scp=85093889498&partnerID=8YFLogxK
U2 - 10.6688/JISE.202005_36(3).0001
DO - 10.6688/JISE.202005_36(3).0001
M3 - Article
AN - SCOPUS:85093889498
SN - 1016-2364
VL - 36
SP - 469
EP - 493
JO - Journal of Information Science and Engineering
JF - Journal of Information Science and Engineering
IS - 3
ER -