cs234-2: 马尔科夫奖励过程、Policy improvement

Lecture 2

MRP 马尔科夫奖励过程

\[ \begin{equation} \left(\begin{array}{c} V\left(s_1\right) \\ \vdots \\ V\left(s_N\right) \end{array}\right)=\left(\begin{array}{c} R\left(s_1\right) \\ \vdots \\ R\left(s_N\right) \end{array}\right)+\gamma\left(\begin{array}{ccc} P\left(s_1 \mid s_1\right) & \cdots & P\left(s_N \mid s_1\right) \\ P\left(s_1 \mid s_2\right) & \cdots & P\left(s_N \mid s_2\right) \\ \vdots & \ddots & \vdots \\ P\left(s_1 \mid s_N\right) & \cdots & P\left(s_N \mid s_N\right) \end{array}\right)\left(\begin{array}{c} V\left(s_1\right) \\ \vdots \\ V\left(s_N\right) \end{array}\right) \end{equation} \]

写成整式形式

\[ \begin{equation} \begin{aligned} V &=R+\gamma P V \\ V-\gamma P V &=R \\ (I-\gamma P) V &=R \\ V &=(I-\gamma P)^{-1} R \end{aligned} \end{equation} \]

只要γ<1则总是可逆的

如何计算MRP

  • Dynamic programming
  • Initialize \(V_0(s)=0\) for all \(s\)
  • For \(k=1\) until convergence
  • For all \(s\) in \(S\)

\[ V_k(s)=R(s)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s\right) V_{k-1}\left(s^{\prime}\right) \]

  • Computational complexity: \(O\left(|S|^2\right)\) for each iteration \((|S|=N)\)

 马尔科夫决策问题MDP 定义:

  • Markov Decision Process is Markov Reward Process \(+\) actions
  • Definition of MDP
    • \(S\) is a (finite) set of Markov states \(s \in S\)
    • \(A\) is a (finite) set of actions \(a \in A\)
    • \(P\) is dynamics/transition model for each action, that specifies \(P\left(s_{t+1}=s^{\prime} \mid s_t=s, a_t=a\right)\)
    • \(R\) is a reward function \({ }^1\)

\[ R\left(s_t=s, a_t=a\right)=\mathbb{E}\left[r_t \mid s_t=s, a_t=a\right] \]

  • Discount factor \(\gamma \in[0,1]\)
  • MDP is a tuple: \((S, A, P, R, \gamma)\)

马尔科夫决策和马尔科夫奖励过程

  • \(\mathrm{MDP}+\pi(a \mid s)=\) Markov Reward Process
  • Precisely, it is the MRP \(\left(S, R^\pi, P^\pi, \gamma\right)\), where

\[ \begin{aligned} R^\pi(s) &=\sum_{a \in A} \pi(a \mid s) R(s, a) \\ P^\pi\left(s^{\prime} \mid s\right) &=\sum_{a \in A} \pi(a \mid s) P\left(s^{\prime} \mid s, a\right) \end{aligned} \]

迭代法求状态

  • Initialize \(V_0(s)=0\) for all \(s\)
  • For \(k=1\) until convergence
  • For all \(s\) in \(S\)

\[ V_k^\pi(s)=r(s, \pi(s))+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, \pi(s)\right) V_{k-1}^\pi\left(s^{\prime}\right) \]

  • This is a Bellman backup for a particular policy

Deterministic policy & policy iteration

\(|A|^{|S|}\) 是policy的总数

  • Set \(i=0\)
  • Initialize \(\pi_0(s)\) randomly for all states \(s\)
  • While \(i=0\) or \(\left\|\pi_i-\pi_{i-1}\right\|_1>0\) (L1-norm, measures if the policy changed for any state):
    • \(V^{\pi_i} \leftarrow\) MDP \(V\) function policy evaluation of \(\pi_i\)
    • \(\pi_{i+1} \leftarrow\) Policy improvement
    • \(i=i+1\)

Policy improvement

Q值的定义:

\[ Q^{\pi_i}(s, a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, a\right) V^{\pi_i}\left(s^{\prime}\right) \]

PI推导:通过得到最大Q值选出最优policy

\[ \begin{gathered} Q^{\pi_i}(s, a)=R(s, a)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, a\right) V^{\pi_i}\left(s^{\prime}\right) \\ \max _a Q^{\pi_i}(s, a) \geq R\left(s, \pi_i(s)\right)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, \pi_i(s)\right) V^{\pi_i}\left(s^{\prime}\right)=V^{\pi_i}(s) \\ \pi_{i+1}(s)=\arg \max _a Q^{\pi_i}(s, a) \end{gathered} \]

Monotonic improvement in policy 单调提升策略

定义:如果一个策略优于另一个策略,则该策略下所有状态值都要优于另一策略下所有状态值

\[ V^{\pi_1} \geq V^{\pi_2}: V^{\pi_1}(s) \geq V^{\pi_2}(s), \forall s \in S \]

  • Proposition: \(V^{\pi_{i+1}} \geq V^{\pi_i}\) with strict inequality if \(\pi_i\) is suboptimal, where \(\pi_{i+1}\) is the new policy we get from policy improvement on \(\pi_i\)

策略\(\pi_{i}\) 是状态的期望值,小于等于最优动作的Q值:

\[ \begin{equation} \begin{aligned} V^{\pi_i}(s) & \leq \max _a Q^{\pi_i}(s, a) \\ &=\max _a R(s, a)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, a\right) V^{\pi_i}\left(s^{\prime}\right) \end{aligned} \end{equation} \]

证明

\[ \begin{aligned} V^{\pi_i}(s) & \leq \max _a Q^{\pi_i}(s, a) \\ &=\max _a R(s, a)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, a\right) V^{\pi_i}\left(s^{\prime}\right) \\ &=R\left(s, \pi_{i+1}(s)\right)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, \pi_{i+1}(s)\right) V^{\pi_i}\left(s^{\prime}\right) / / \text { by the definition of } \pi_{i+1} \\ & \leq R\left(s, \pi_{i+1}(s)\right)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, \pi_{i+1}(s)\right)\left(\max _{a^{\prime}} Q^{\pi_i}\left(s^{\prime}, a^{\prime}\right)\right) \\ &=R\left(s, \pi_{i+1}(s)\right)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, \pi_{i+1}(s)\right) \\ &\left(R\left(s^{\prime}, \pi_{i+1}\left(s^{\prime}\right)\right)+\gamma \sum_{s^{\prime \prime} \in S} P\left(s^{\prime \prime} \mid s^{\prime}, \pi_{i+1}\left(s^{\prime}\right)\right) V^{\pi_i}\left(s^{\prime \prime}\right)\right) \\ & \vdots \\ =& V^{\pi_{i+1}}(s) \end{aligned} \]

如果一个policy优化中不再改变,则该policy就再也不会改变

总结:通过优化Q值选出最优策略。

Value Iteration

\[ V^\pi(s)=R^\pi(s)+\gamma \sum_{s^{\prime} \in S} P^\pi\left(s^{\prime} \mid s\right) V^\pi\left(s^{\prime}\right) \]

Bellman backup operator

  • Applied to a value function
  • Returns a new value function
  • Improves the value if possible

\[ B V(s)=\max _a R(s, a)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s, a\right) V\left(s^{\prime}\right) \]

值循环算法

Set \(k=1\)

Initialize \(V_0(s)=0\) for all states \(s\) 初始化状态值

Loop until [finite horizon, convergence]: 有限循环直至收敛

  • For each state \(s\)

\[ V_{k+1}(s)=\max _a R(s, a)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, a\right) V_k\left(s^{\prime}\right) \]

  • View as Bellman backup on value function

\[ \begin{gathered} V_{k+1}=B V_k \\ \pi_{k+1}(s)=\arg \max _a R(s, a)+\gamma \sum_{s^{\prime} \in S} P\left(s^{\prime} \mid s, a\right) V_k\left(s^{\prime}\right) \end{gathered} \]

提升每一个状态的状态值到收敛,然后得到最优policy

  • 只要discount factor \(\gamma \lt 1\) ,或者结束在一个终端状态节点的概率为1就可以收敛
  • bellman equation是contraction operator,前提是\(\gamma \lt 1\)
  • 施加了Bellman equation的的两个不同的值函数,其距离会减小

Contraction operator

定义:

  • Let \(O\) be an operator, and \(|x|\) denote (any) norm of \(x\)
  • If \(\left|O V-O V^{\prime}\right| \leq\left|V-V^{\prime}\right|\), then \(O\) is a contraction operator

证明Bellman equation是一个压缩操作符

Contraction operator

value iteration和 policy iteration的区别

PI vs VI