• Model architecture for single agent deterministic game which can trained without prior human knowledge about the rules and strategies..

  • Main Contributions:

    • Monte Carlo Tree Search (MCTS) Solve the exploitation vs exploration dilemma.
    • The use of Representation, Prediction and Dynamic function
    • Prediction functon $f$,
      • predicts policy and value, $p_t$ and $v_t$
    • Dynamic function $g$,
      • given the current state and action taken, $s_t$ and $a_{t+1}$
      • predicts the next state and immediate reward, $s_{t+1}$ and $r_{t+1}$
    • Representation function $h$,
      • convert current state to latent space, $s_t$
  • Cons:

    • Only can learn in an environment with relatively small action space
  • Combination of policy evaluation and policy improvement (both can be called policy iteration)

  • https://arxiv.org/pdf/2104.06303.pdf Brilliant summary of muzero in this paper