Big data platform configuration using machine learning

Chao Chun Yeh, Han Lin Lu, Jiazheng Zhou, Sheng An Chang, Xuan Yi Lin, Yi Chiao Sun, Shih Kun Huang

Research output: Contribution to journalArticlepeer-review

1 Scopus citations


By ensuring well-developed complex big data platform architectures, data engineers provide data scientists and analysts infrastructure with computational and storage resources to perform their research. Based on such supports, data scientists are provided an opportunity to focus on their domain problems and design the required intelligent modules (i.e., prepare the data; select, train, and tune the machine-learning modules; and validate the results). However, there are still gaps between system engineering and data scientist/engineering teams. Generally, system engineers have limited knowledge on the application domains and the purposes of an analytical program. On the contrary, both data scientists and engineers are usually unfamiliar with the configuration of a computational system, file system, and database. However, the performance of an application can be affected by a system's configuration, and the data scientists and engineers have little information and knowledge about which of the system's properties can affect the application's performance. As a typical example, for Internet-scale applications that have thousands of computing nodes or billions of Internet of Things devices, even a slight improvement may have an enormous influence on energy management and environmental protection issues. To bridge the gap between system engineering and data scientist/engineering teams, we proposed the concept of a configuration layer based on a big data platform, Hadoop. We built a configuration tuner, BigExplorer, to collect and preprocess data. Furthermore, we also created golden configurations for performance improvement. Based on the processed data, we used a semi-automatic feature engineering technique to provide more features for data engineers and developed the performance model using three different machine learning algorithms (i.e., random forest, gradient boosting machine, and support vector machine). Using the commonly used benchmarks of WordCount, TeraSort, and Pig workloads, our configuration tuner achieved a significant performance improvement of 28%-51% for different workloads than using the rule-of-thumb configuration.

Original languageEnglish
Pages (from-to)469-493
Number of pages25
JournalJournal of Information Science and Engineering
Issue number3
StatePublished - May 2020


  • Algorithms
  • Big data platform
  • Configuration optimization
  • Learning by design
  • Machine learning


Dive into the research topics of 'Big data platform configuration using machine learning'. Together they form a unique fingerprint.

Cite this