Model architecture for single agent deterministic game which can trained without prior human knowledge about the rules and strategies..
Main Contributions:
- Monte Carlo Tree Search (MCTS) Solve the exploitation vs exploration dilemma.
- The use of Representation, Prediction and Dynamic function
- Prediction functon $f$,
- predicts policy and value, $p_t$ and $v_t$
- Dynamic function $g$,
- given the current state and action taken, $s_t$ and $a_{t+1}$
- predicts the next state and immediate reward, $s_{t+1}$ and $r_{t+1}$
- Representation function $h$,
- convert current state to latent space, $s_t$
Cons:
- Only can learn in an environment with relatively small action space
Combination of policy evaluation and policy improvement (both can be called policy iteration)
https://arxiv.org/pdf/2104.06303.pdf Brilliant summary of muzero in this paper