Model architecture for single agent deterministic game which can trained without prior human knowledge about the rules and strategies..
Main Contributions:
- Monte Carlo Tree Search (MCTS) Solve the exploitation vs exploration dilemma.
- The use of Representation, Prediction and Dynamic function
- Prediction functon $f$,
- predicts policy and value, $p_t$ and $v_t$
- Dynamic function $g$,
- given the current state and action taken, $s_t$ and $a_{t+1}$
- predicts the next state and immediate reward, $s_{t+1}$ and $r_{t+1}$
- Representation function $h$,
- convert current state to latent space, $s_t$
- Only can learn in an environment with relatively small action space
Combination of policy evaluation and policy improvement (both can be called policy iteration) Brilliant summary of muzero in this paper