TY - GEN
T1 - BigExplorer
T2 - 2016 Conference on Technologies and Applications of Artificial Intelligence, TAAI 2016
AU - Yeh, Chao Chun
AU - Zhou, Jiazheng
AU - Chang, Sheng An
AU - Lin, Xuan Yi
AU - Sun, Yichiao
AU - Huang, Shih-Kun
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/3/16
Y1 - 2017/3/16
N2 - With the complexity big data platform architectures, data engineer provides the infrastructure with computation and storage resource for data scientist and data analyst. With those supports, data scientists can focus their domain problem and design the intelligence module (e.g., prepare the data, select/train/tune the machine learning modules and validate the result). However, there is still a gap between system engineer team and data scientists/engineers team. For system engineers, they don't have any knowledge about the application domain and the propose of the analytic program. For data scientists/engineers, they don't know the configuration of the computation system, file system and database. Some application performance issues are related with system configurations. Data scientist and data engineer do not have information and knowledge about the system properties. In this paper, we propose a configuration layer with the current big data platform (i.e., Hadoop) and build a configuration recommendation system to collect data, pre-process data. Based on the processed data, we use semi-automatic feature engineer to provide features for data engineers and build the performance model with three different machine learning algorithms (i.e., random forest, gradient boosting machine and support vector regression). With the same two benchmarks (i.e., wordcount and terasort), our recommended configuration archives remarkable improvement than rule of thumb configuration and better than their improvements.
AB - With the complexity big data platform architectures, data engineer provides the infrastructure with computation and storage resource for data scientist and data analyst. With those supports, data scientists can focus their domain problem and design the intelligence module (e.g., prepare the data, select/train/tune the machine learning modules and validate the result). However, there is still a gap between system engineer team and data scientists/engineers team. For system engineers, they don't have any knowledge about the application domain and the propose of the analytic program. For data scientists/engineers, they don't know the configuration of the computation system, file system and database. Some application performance issues are related with system configurations. Data scientist and data engineer do not have information and knowledge about the system properties. In this paper, we propose a configuration layer with the current big data platform (i.e., Hadoop) and build a configuration recommendation system to collect data, pre-process data. Based on the processed data, we use semi-automatic feature engineer to provide features for data engineers and build the performance model with three different machine learning algorithms (i.e., random forest, gradient boosting machine and support vector regression). With the same two benchmarks (i.e., wordcount and terasort), our recommended configuration archives remarkable improvement than rule of thumb configuration and better than their improvements.
KW - big data platform
KW - configuration optimization
KW - machine learning
UR - http://www.scopus.com/inward/record.url?scp=85017600013&partnerID=8YFLogxK
U2 - 10.1109/TAAI.2016.7880179
DO - 10.1109/TAAI.2016.7880179
M3 - Conference contribution
AN - SCOPUS:85017600013
T3 - TAAI 2016 - 2016 Conference on Technologies and Applications of Artificial Intelligence, Proceedings
SP - 228
EP - 234
BT - TAAI 2016 - 2016 Conference on Technologies and Applications of Artificial Intelligence, Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 25 November 2016 through 27 November 2016
ER -