1.Introduction

1.1 问题引入

大规模的MDP问题如何估计价值函数？

答：在面对大规模MDP问题时，要避免用table去表示特征（Q-table等），而是采用参数的函数近似的方式去近似估计V、Q、

\pi

1.2 Solution

Estimate with function approximation

\hat{v}(s, w) \approx v^{\pi}(s) \\ \hat{q}(s, a, w) \approx q^{\pi}(s,a) \\ \hat{\pi}(a,s,w) \approx \pi(a|s)

Generalize from seen states to unseen states
Update the parameter w using MC and TD learning

1.3 Types of value function approximation

1.4 近似方法

Linear combinations of features
Neural networks
Decision trees
Nearest neighbors

1.5 梯度下降

Consider a function $j(w)$ that is a differentiable function of a parameter vector $w$
Goal is to find parameter $w*$ that minimizes $j$
Define the gradient of $J(w)$ to be
$\nabla_wJ(w) = ( \frac {\partial j(w)} {\partial w_1}, \frac {\partial j(w)} {\partial w_2}, ..., \frac {\partial j(w)} {\partial w_n} )^T$
Adjust $w$ in the direction of the negative gradient, where $\alpha$ is step-size
$\Delta w = - \frac{1}{2} \alpha \nabla_wJ(w)$

1.6 Value Function Approximation with an Oracle

Oracle是Ground Truth/真值的意思，想要拟合最准确的v函数，就可以用已知policy下v函数的值拟合v函数

We assume that we have the oracle for knowing the true value for $v_{\pi}(s)$ for any given state $s$
Then the objective is to find the best approximate representation for $v_{\pi}(s)$
Thus use the mean squared error and define the loss function as
$J(w)=E_{\pi}[(v^{\pi}(s) - \hat{v}(s, w))^2]$
Follow the gradient descend to find a local minimum
$\Delta w = - \frac{1}{2} \alpha \nabla_wJ(w) \\ w_{t+1} = w_t + \Delta w$

2.Prediction

2.1 已知Oracle $v^{\pi}(s)$

Represent state using a feature vector :

x(s) = (x_1(s), ..., x_n(s))^T

（1）已知V Function是线性

随机梯度下降收敛到全局最优。因为在线性情况下，只有一个最优解，因此局部最优解为自动收敛到或接近全局最优。

Represent value function by a linear combination of features
$\hat{v}(s, w) = x(s)^Tw= \sum_{j=1}^n x_j(s)w_j$
The objective function is quadratic in parameter $w$
$J(w)=E_{\pi}[(v^{\pi}(s) - x(s)^Tw)^2]$
Thus the update rule is as simple as
$\Delta w = \alpha(v_{\pi}(s) - \hat{v}(s,w))x(s)$
$Updata = StepSize \times PredictionError \times FeatureValue$
Stochastic gradient descent converges to global optimum. Because in the linear case, there is only one optimum, thus local optimum is automatically converge to or near the global optimum.

（2）Table Lookup Feature

如果用one-hot编码，Table Lookup Feature

Table lookup is a special case of linear value function approximation
Table lookup feature is one-hot vector as follows
$x^{table}(s) =(1(s=s_1), ..., 1(s=s_n))^T$
Then we can see that each element on the parameter vector $w$ indicates the value of each individual state
$\hat{v}(s, w) = (1(s=s_1), ...., 1(s=s_n))(w_1, ..., w_n)^T$
Thus we have $\hat{v}(s_k,w)=w_k$

拟合出的价值函数就是等于

w_k

2.2 未知Oracle $v^{\pi}(s)$

（1）Model-free Prediction with VFA

基于采样计算

In practice, no access to oracle of the true value

v^{\pi}(s)

for any state

s

Recall model-free prediction

Goal is to evaluate $v^{\pi}$ following a fixed policy $\pi$
A lookup table is maintained to store estimates $v^{\pi}$ or $q^{\pi}$
Estimates are updated after each episode (MC method) or after each step (TD method)

Thus what we can do is to include the function approximation step in the loop

Incremental VFA Prediction Algorithms

We assumed that true value function $v^{\pi}(s)$ given by supervisor / oracle
$\Delta w = \alpha(v^{\pi}(s) - \hat{v}(s_t, w)) \nabla_w \hat{v}(s_t,w)$
But in RL there is no supervisor, only rewards
In practice, we substitute the target for $v^{\pi}(s)$
1. For MC, the target is the actual return $G_t$
  $\Delta w = \alpha(G_t - \hat{s_t, w}) \nabla_ w \hat{v}(s_t, w)$
2. For TD(0), the target is the TD target $R_{t+1}+\gamma \hat{v}(s_{t+1}, w)$
  $\Delta w = \alpha(R_{t+1} + \gamma \hat{v}(S_{t+1},w) - \hat{v}(s_t, w)) \nabla_w \hat{v}(s_t,w)$

（2）Monte-Carlo Prediction with VFA

MC return的Gt特点：无偏估计、但是有很大噪声（采很多次样才能得出结果），方差大

Return G_t is an unbiased, but noisy sample of true value v^{\pi}(s_t)

Why unbiased?

E[G_t] = v^{\pi}(s_t)

so we have the training data that can be used for supervised learning in VFA:

<S_1, G_1>, <S_2, G_2>, ..., <S_t, G_T>

Using linear Monte-Carlo policy evaluation

\Delta w = \alpha (G_t - \hat{v}(s_t, w)) \nabla_w \hat{v}(s_t, w) \\ = \alpha(G_t - \hat{v}(s_t, w)) x(s_t)

Monte-Carlo prediction converges, in both linear and non-linear value function approximation

（3）TD Prediction with VFA

TD target特点：有偏估计（取期望时包含了正在优化的W）

TD target

R_{t+1} + \gamma \hat{v}(s_{t+1}, w)

is a biased sample of true value

v^{\pi}(s_t)

Why biased? It is drawn from our previous estimate, rather than the true value:

E[R_{t+1} + \gamma \hat{v}(s_{t+1}, w)] \neq v^{\pi}(s_t)

We have the training data used for supervised learning in VFA:

<S_1, R_2+\gamma \hat v(s_2, w)>, <S_2, R_3+\gamma \hat v(s_3, w)>, ..., <S_{T-1}, R_T>

Using linear TD(0), the stochastic gradient descend update is

\Delta w = \alpha (R + \gamma \hat v (s', w) - \hat v (s, w)) \nabla_w \hat v(s, w) \\ = \alpha (R + \gamma \hat v (s', w) - \hat v (s, w)) x(s)

This is also called as semi-gradient, as we ignore the effect of changing the weight vector w on the target

Linear TD(0) converges(close) to global optimum

3.Control

Generalized policy iteration

Policy evaluation: approximate policy evaluation,

\hat q(.,.,w) \approx q^{\pi}

Policy improvement:

\epsilon - greedy

policy improvement

3.1 Q函数(Action-value Function)估计

（1）已知实际Q函数

可通过计算MSE和梯度下降的方法估计Q函数：

Approximate the action-value function

\hat q(s,a,w) \approx q^{\pi}(s,a)

Minimize the MSE (mean-square error) between approximate action-value and true action-value (assume oracle)

J(w) = E_{\pi}[(q^{\pi}(s,a) - \hat q(s,a,w))^2]

Stochastic gradient descend to find a local minimum

\Delta = \alpha (q^{\pi}(s,a)-\hat q(s,a,w)) \nabla_w \hat q(s,a,w)

（2）如果Q函数为线性

Represent state and action using a feature vector

x(s,a) = (x_1(s,a), ..., x_n(s,a))^T

Represent action-value function by a linear combination of features

\hat q(s,a,w) = x(s,a)^T w = \sum_{j=1}^{n} x_j(s,a) w_j

Thus the stochastic gradient descend update

\Delta w= \alpha(q^{\pi}(s,a) - \hat q(s,a,w))x(s,a )

3.2 Value function approximation

如果没有获得真正的Ground truth的Q函数,用Gt或TD target代替oracle, 从而更新Q函数近似函数的参数。

For MC, the target is return

G_t

\Delta w= \alpha (G_t - \hat q(s_t, a_t, w)) \nabla_w \hat q(s_t, a_t, w)

For Sarsa, the target is TD target

R_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w)

\Delta w= \alpha (R_{t+1} + \gamma \hat q(S_{t+1}, a_{t+1}, w) - \hat q(s_t, a_t, w)) \nabla_w \hat q(s_t, a_t, w)

For Q-learning, the target is TD target

R_{t+1} + \gamma max_a \hat q(s_{t+1}, a, w)

\Delta w= \alpha (R_{t+1} + \gamma \space \underset{a}{max} \space \hat q(S_{t+1}, a_{t+1}, w) - \hat q(s_t, a_t, w)) \nabla_w \hat q(s_t, a_t, w)

其中，Sarsa for VFA Control :

3.3 VFA Control的收敛问题

TD在使用off-policy或使用非线性函数拟合时会发散（难以收敛）

TD with VFA 得出来目标函数的梯度是不准确的（gradient本身包含了正在优化的参数）

update包含了Bellman backup的近似和价值函数的近似

off-policy的behavior policy（采取数据的策略） and target policy（优化的策略）不完全相同，因此价值估计函数难以收敛

3.4 强化学习的死亡三角

强化学习不确定不稳定的因素

Function approximation：近似V或者Q会引入误差
Bootstrapping：TD基于之前来估计当前函数会引入噪声，使得网络过拟合；MC用了实际的return，而且是无偏的，所以稍微好些（如果足够多会近似真值）
Off-policy training：采用behavior policy来采取数据，但优化的是target policy，导致采集的数据和优化的函数是不一样的

3.5 控制算法的收敛性问题

(√) moves around the near-optimal value function

3.6 Batch优化

之前单步优化效率低，Batch-based方法优化batch中所有样本，

3.7 Least Square Prediction

目标：求D数据库中pair的最小均方差

Given the value function approximation

\hat v(s,w) \approx v^{\pi}(s)

The experience

D

consisting of

<state, value>

pairs (may from one episode or many previous episodes)

D = \{<s_1, v_1^{\pi}>, ..., <s_t, v_T^{\pi}> \}

Objective: To optimize the parameter

w

that best fit all the experience

D

Least squares algorithms are used to minimize the sum-squared error between

\hat v(s_t, w)

and the target values

v^{\pi}_t

w^* = \underset{w}{arg min} E_D[(v^{\pi} - \hat v(s,w))^2] \\ = \underset{w}{arg min} \sum_{t=1}^{T}(v_t^{\pi} - \hat v(s_t, w))^2

如果数据集太大，每次随机采样部分样本，计算梯度并优化函数，不断迭代。

4.Deep Q Networks

4.1 Deep Reinforcement Learning

Frontier in machine leaning and artificial intelligence

Deep neural networks are used to represent

value function
Policy function (policy gradient methods to be introduced)
World model

Loss function is optimized by stochastic gradient descent(SGD)

Challenges

Efficiency : too many model parameters to optimize
The Deadly Triad for the danger of instability and divergence in training
1. Nonlinear function approximation
2. Bootstrapping
3. Off-policy training

4.2 DQN for Playing Atari Games

End-to-end learning of values Q(s,a) from input pixel frame

Input state

s

is a stack of raw pixels from latest 4 frames

Output of

Q(s, a)

is 18 joystick / button positions

Reward is the change in score for that step

Network architecture and hyperparameters fixed across all games

4.3 训练DQN问题及解决方法

样本之间的具有相关性：Experience replay
不稳定的目标：Fixed Q targets

4.4 DQNs: Experience Replay

To reduce the correlations among samples, store transition (s_t, a_t, r_t, s_{t+1}) in replay memory D

To perform experience replay, repeat the following

sample an experience tuple from the dataset: $(s, a, r, s') \sim D$
compute the target value for the sampled tuple: $r + \gamma max_{a'} \hat Q(s', a', w)$
use stochastic gradient descent to update the network weights
$\Delta w = \alpha (r + \gamma \underset{a'}{max} \hat Q(s', a', w) - Q(s,a,w)) \nabla_w \hat Q(s,a,w)$

4.5 DQNs: Fixed Targets

To help improve stability, fix the target weights used in the target calculation for multiple updates

Let a different set of parameter

w^-

be the set of weights used in the target, and

w

be the weights that are being updated

To perform experience replay with fixed target, repeat the following

sample an experience tuple from the dataset: $(s, a, r, s') \sim D$
compute the target value for the sampled tuple: $r + \gamma max_{a'} \hat Q(s', a', w^-)$
use stochastic gradient descent to update the network weights
$\Delta w = \alpha (r + \gamma \underset{a'}{max} \hat Q(s', a', w^-) - Q(s,a,w)) \nabla_w \hat Q(s,a,w)$

4.6 Loss

在计算loss时，我们是通过计算TD target（Q_target）和当前Q value（Q的估计）的差来得到的。

但是其实对于真实的TD target，我们是不知道的，没有概念的。我们也是需要预测的。使用Bellman 方程，我们发现TD target仅仅是当前动作的reward加上通过衰减的下一个state的最高Q value。

问题是我们使用同样的参数（网络权重）来估计target和Q value。这就使得网络权重和TD target具有很强的相关性。因此，这就意味着训练的每一步，我们的Q value偏移了，同时Q target也偏移了。

所以我们将使用一个单独的固定参数的网络（w-）来预测TD target

4.7结果和Demo

Demo of Breakout by DQN : https://www.youtube.com/watch?v=V1eYniJ0Rnk
Demo of Flappy Bird by DQN : https://www.youtube.com/watch?v=xM62SpKAZHU
Code of DQN in PyTorch : https://github.com/cuhkrlcourse/DeepRL-Tutorials/blob/master/01.DQN.ipynb
Code of Flappy Bird : https://github.com/xmfbit/DQN-FlappyBird
参考：https://blog.csdn.net/mike11222