TY - GEN
T1 - RaFIO
T2 - 36th Annual ACM Symposium on Applied Computing, SAC 2021
AU - Slimani, Camélia
AU - Wu, Chun Feng
AU - Chang, Yuan Hao
AU - Rubini, Stéphane
AU - Boukhobza, Jalil
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/3/22
Y1 - 2021/3/22
N2 - Random Forest based classification is a widely used Machine Learning algorithm. Training a random forest consists of building several decision trees that classify elements of the input dataset according to their features. This process is memory intensive. When datasets are larger than the available memory, the number of I/O operations grows significantly, causing a dramatic performance drop. Our experiments showed that, for a dataset that is 8 times larger than the available memory workspace, training a random forest is 25 times slower than the case when the dataset can fit in memory. In this paper, we revisit the tree building algorithm to optimize the performance for the datasets larger than the memory workspace. The proposed strategy aims at reducing the number of I/O operations by smartly taking benefit from the temporal locality exhibited by the random forest building algorithm. Experiments showed that our method reduced the execution time of the tree building by up to 90% and by 60% on average as compared to a state-of-the-art method, when the datasets are larger than the main memory workspace.
AB - Random Forest based classification is a widely used Machine Learning algorithm. Training a random forest consists of building several decision trees that classify elements of the input dataset according to their features. This process is memory intensive. When datasets are larger than the available memory, the number of I/O operations grows significantly, causing a dramatic performance drop. Our experiments showed that, for a dataset that is 8 times larger than the available memory workspace, training a random forest is 25 times slower than the case when the dataset can fit in memory. In this paper, we revisit the tree building algorithm to optimize the performance for the datasets larger than the memory workspace. The proposed strategy aims at reducing the number of I/O operations by smartly taking benefit from the temporal locality exhibited by the random forest building algorithm. Experiments showed that our method reduced the execution time of the tree building by up to 90% and by 60% on average as compared to a state-of-the-art method, when the datasets are larger than the main memory workspace.
UR - http://www.scopus.com/inward/record.url?scp=85104958387&partnerID=8YFLogxK
U2 - 10.1145/3412841.3441932
DO - 10.1145/3412841.3441932
M3 - Conference contribution
AN - SCOPUS:85104958387
T3 - Proceedings of the ACM Symposium on Applied Computing
SP - 521
EP - 528
BT - Proceedings of the 36th Annual ACM Symposium on Applied Computing, SAC 2021
PB - Association for Computing Machinery
Y2 - 22 March 2021 through 26 March 2021
ER -