TY - GEN
T1 - Gumbel MuZero for the Game of 2048
AU - Kao, Chih Yu
AU - Guei, Hung
AU - Wu, Ti Rong
AU - Wu, I. Chen
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - In recent years, AlphaZero and MuZero have achieved remarkable success in a broad range of applications. AlphaZero masters playing without human knowledge, while MuZero also learns the game rules and environment's dynamics without the access to a simulator during planning, which makes it applicable to complex environments. Both algorithms adopt Monte Carlo tree search (MCTS) during self-play, usually using hundreds of simulations for one move. For stochasticity, Stochastic MuZero was proposed to learn a stochastic model and uses the learned model to perform the tree search. Recently, Gumbel MuZero was proposed to ensure the policy improvement and can thus learn reliably with a small number of simulations. However, Gumbel MuZero used a deterministic model as in MuZero, limiting its performance in stochastic environments. In this paper, we propose to combine Gumbel MuZero and Stochastic MuZero, the first attempt to apply Gumbel MuZero to a stochastic environment. Our experiment on the stochastic puzzle game 2048 demonstrates that the combined algorithm can perform well and achieve an average score of 394,645 with only 3 simulations during training, greatly reducing the computational resource needed for training.
AB - In recent years, AlphaZero and MuZero have achieved remarkable success in a broad range of applications. AlphaZero masters playing without human knowledge, while MuZero also learns the game rules and environment's dynamics without the access to a simulator during planning, which makes it applicable to complex environments. Both algorithms adopt Monte Carlo tree search (MCTS) during self-play, usually using hundreds of simulations for one move. For stochasticity, Stochastic MuZero was proposed to learn a stochastic model and uses the learned model to perform the tree search. Recently, Gumbel MuZero was proposed to ensure the policy improvement and can thus learn reliably with a small number of simulations. However, Gumbel MuZero used a deterministic model as in MuZero, limiting its performance in stochastic environments. In this paper, we propose to combine Gumbel MuZero and Stochastic MuZero, the first attempt to apply Gumbel MuZero to a stochastic environment. Our experiment on the stochastic puzzle game 2048 demonstrates that the combined algorithm can perform well and achieve an average score of 394,645 with only 3 simulations during training, greatly reducing the computational resource needed for training.
KW - 2048 game
KW - Gumbel MuZero
KW - Monte Carlo tree search
KW - Stochastic MuZero
KW - deep reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=85148274418&partnerID=8YFLogxK
U2 - 10.1109/TAAI57707.2022.00017
DO - 10.1109/TAAI57707.2022.00017
M3 - Conference contribution
AN - SCOPUS:85148274418
T3 - Proceedings - 2022 International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2022
SP - 42
EP - 47
BT - Proceedings - 2022 International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 27th International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2022
Y2 - 1 December 2022 through 3 December 2022
ER -