DQN實戰：MIT強化學習實戰—Deep Traffic（上）

本專案是對DQN知識掌握程度的檢驗，需要對強化學習基礎知識以及DQN有所瞭解，詳細可參閱文章：

先來了解一下該專案的背景：

DeepTraffic is a deep reinforcement learning competition part of the MIT Deep Learning for Self-Driving Cars course。 The goal is to create a neural network to drive a vehicle （or multiple vehicles） as fast as possible through dense highway traffic。 An instance of your neural network gets to control one of the cars （displayed in red） and has to learn how to navigate efficiently to go as fast as possible。 The car already comes with a safety system， so you don’t have to worry about the basic task of driving – the net only has to tell the car if it should accelerate/slow down or change lanes， and it will do so if that is possible without crashing into other cars。

—— from 《DeepTraffic： About》

DeepTraffic實際上就是一個DQN的實驗平臺。詳細的可參閱官方網站。

首先，DeepTraffic的目標是讓小車跑得越快越好。小車的限速是80MPH，因此可以理解為本專案的目標就是越接近這個極限越好。而這個專案的最低目標是讓小車車速超過65MPH。我們先來看看排行榜：

資源來源：https：//selfdrivingcars。mit。edu/deeptraffic-leaderboard/

可以看到，目前的排行榜第一名已經相當接近80的極限了。

我們再看看當我們什麼都不改變的時候，基礎模型的表現是多少：

基礎模型

試了幾次，大概都是在51MPH左右。

好了，那我們從程式碼分析開始吧：

//<！［CDATA［

// a few things don‘t have var in front of them - they update already existing variables the game needs

lanesSide

；

//可觀察多少條旁邊車道

patchesAhead

；

//可往前觀察多少格

patchesBehind

；

//可往後觀察多少格

trainIterations

10000

；

//訓練的迭代數

// the number of other autonomous vehicles controlled by your network

otherAgents

；

// 多輛自動駕駛車輛在路上行駛

// 計算狀態向量（不用修改）

var

num_inputs

（

lanesSide

）

（

patchesAhead

patchesBehind

）；

// 動作數量（不用修改）

var

num_actions

；

// 參考的時間視窗

var

temporal_window

；

// 把狀態向量轉化為神經網路的輸入向量（不用修改）

var

network_size

num_inputs

temporal_window

num_actions

temporal_window

num_inputs

；

//神經網路的架構

var

layer_defs

［］；

layer_defs

。

push

（{

type

：

’input‘

，

out_sx

：

，

out_sy

：

，

out_depth

：

network_size

}）；

layer_defs

。

push

（{

type

：

’fc‘

，

num_neurons

：

，

activation

：

’relu‘

}）；

layer_defs

。

push

（{

type

：

’regression‘

，

num_neurons

：

num_actions

}）；

//訓練的超引數

var

tdtrainer_options

{

learning_rate

：

0。001

，

//學習率

momentum

：

0。0

，

//動量

batch_size

：

，

//批大小

l2_decay

：

0。01

//衰減

}；

//其他超引數

var

opt

{}；

opt

。

temporal_window

；

opt

。

experience_size

3000

；

//經驗暫存區大小

opt

。

start_learn_threshold

500

；

//開始訓練閥值

opt

。

gamma

0。7

；

//遠期獎勵折扣率

//this。epsilon = Math。min（1。0， Math。max（this。epsilon_min， 1。0-（this。age - this。learning_steps_burnin）/（this。learning_steps_total - this。learning_steps_burnin）））；

opt

。

learning_steps_total

10000

；

//用於計算epsilon

opt

。

learning_steps_burnin

1000

；

//用於計算epsilon，開始訓練時隨機行動訓練次數

opt

。

epsilon_min

0。0

；

//訓練時的最小epsilon

opt

。

epsilon_test_time

0。0

；

//測試時的epsilon值

opt

。

layer_defs

；

opt

。

tdtrainer_options

；

//構建神經網路

brain

new

deepqlearn

。

Brain

（

num_inputs

，

num_actions

，

opt

）；

// 訓練並生成動作

learn

function

（

state

，

lastReward

）

{

brain

。

backward

（

lastReward

）；

var

action

brain

。

forward

（

state

）；

draw_net

（）；

draw_stats

（）；

return

action

；

}

//］］>

程式碼說明（大致的程式碼邏輯已經用中文註釋，下面只說關鍵點）：

什麼是狀態向量？

由於小車的觀察範圍被網格化處理，如下圖：

因此網格的狀態所組成的向量就是狀態向量。我們驗證一下：

首先，在程式碼加上輸出

lanesSide

；

patchesAhead

；

// 把前方觀察範圍設為5格

patchesBehind

；

//中間程式碼略。。。

learn

function

（

state

，

lastReward

）

{

console

。

log

（

state

）；

// 加上輸出，檢視狀態

brain

。

backward

（

lastReward

）；

var

action

brain

。

forward

（

state

）；

draw_net

（）；

draw_stats

（）；

return

action

；

}

然後在瀏覽器的F12控制檯，可以看到不斷有狀態陣列輸出，如下圖：

狀態值會在［0，1］的範圍，當網格沒有其他車輛時狀態值為1，當網格有其他車輛佔位時狀態值為（0，1）之間，當網格不可行駛（比如靠邊的車道）時狀態值為0。

我們再驗證一下觀察旁邊一條車道的情況：

lanesSide

；

// 觀察旁邊一條車道

patchesAhead

；

// 把前方觀察範圍設為5格

patchesBehind

；

可以看到，當狀態為一個

$m\times n$

的矩陣時，會被壓成一個長度為

$m \times n$

的一維陣列。

如何定義時間步？

在強化學習的理論框架裡，有個很重要的概念：時間步。智慧體在每個時間步，根據環境的狀態採取相應的行動，從而得到相應的獎勵反饋。那麼，在DeepTraffic模型中，時間步則是每隔30幀就取一幀代表一個時間步。

The simulation uses

frames

as an internal measure of time – so neither a slow computer， nor a slow net influences the result。 The

Simulation Speed

setting lets you control how the simulation is displayed to you – using the

Normal

setting the simulation tries to draw the frames matching real time， so it waits if the actual calculation is going faster –

Fast

displays frames as soon as they are finished， which may be much faster。

—— from 《DeepTraffic： About》

Further， there is one car （displayed in red） that is not using these random actions。 This is the car controlled by the deep reinforcement learning agent。 It is able to choose an action every 30 frames （the time it takes to make a lane change）

and gets a cutout of the state map as an input to compute its actions。

—— from 《Driving Fast through Dense Traffic with Deep Reinforcement Learning》

同時需要說明的是，系統不會因為執行機器的快慢而影響計算結果以及模擬效果。

狀態向量如何轉化為輸入向量？

從原來給定的程式碼可以看到，輸入向量大小與狀態向量大小並不相同：

//狀態向量經過temporal_window的處理進行轉換

var

network_size

num_inputs

temporal_window

num_actions

temporal_window

num_inputs

；

var

layer_defs

［］；

layer_defs

。

push

（{

type

：

’input‘

，

out_sx

：

，

out_sy

：

，

out_depth

：

network_size

}）；

temporal_window是什麼？

為了搞清這個問題，我把原始碼從頭到尾翻了一遍，在deepqlearn。js找到以下這段程式碼：

//摘自deepqlearn。js

getNetInput

：

function

（

）

{

// return s = （x，a，x，a，x，a，xt） state vector。

// It’s a concatenation of last window_size （x，a） pairs and current state x

var

［］；

。

concat

（

）；

// start with current state

// and now go backwards and append states and actions from history temporal_window times

var

this

。

window_size

；

for

（

var

；

this

。

temporal_window

；

）

{

// state

。

concat

（

this

。

state_window

［

］）；

// action， encoded as 1-of-k indicator vector。 We scale it up a bit because

// we dont want weight regularization to undervalue this information， as it only exists once

var

action1ofk

new

Array

（

this

。

num_actions

）；

for

（

var

；

this

。

num_actions

；

）

action1ofk

［

］

0。0

；

action1ofk

［

this

。

action_window

［

］］

1。0

this

。

num_states

；

。

concat

（

action1ofk

）；

}

return

；

}，

程式碼說明：

英文註釋為原始碼的註釋；

getNetInput函式會把一個當前時間步的狀態向量轉換成一個包含前若干個時間步的狀態向量以及動作向量的輸入向量；

temporal_window是往前追溯多個對

狀態動作組；

經過轉換後，神經網路的輸入向量就相當於讓狀態從原來的靜態特徵變為包含動態特徵的向量。

如何理解神經網路的架構？

以預設的網路為例，實際上就是一個MLP，一共三層：一個輸入層、一個隱藏層、一個輸出層。由於該網路最終要計算的是動作值，因此輸出層沒有使用啟用函式。

神經網路的詳細生成過程可參考convnet。js的原始碼。

如何理解平均獎勵曲線？

learn

function

（

state

，

lastReward

）

{

brain

。

backward

（

lastReward

）；

var

action

brain

。

forward

（

state

）；

console

。

log

（

lastReward

）；

//輸出觀察獎勵值

draw_net

（）；

draw_stats

（）；

return

action

；

}

透過瀏覽器控制檯可以觀察到，當時速達到80MPH時，獎勵值為3；當時速為40MPH時，獎勵值為0；獎勵函式是一個與時速線性相關的函式。

上圖所示的曲線是平均獎勵曲線。當我們修改了程式碼然後點選“apply code”時，曲線起點會從上一次的計算點開始計算（這點導致我一度很困惑，但似乎對結果沒有影響，不明白的可以忽略）。這裡只需要知道，一個策略的平均獎勵是0分的話，那智慧車的平均時速就是40MPH；若平均獎勵是3分的話，那智慧車的平均時速就是80MPH。

策略的生成過程是怎樣的？

我們從DQN的理論可知，經驗回放與固定Q目標、以及其他關於DQN的改進方案都會對最終的函式逼近結果有顯著的影響。在預設的程式碼實現中，是否有考慮進行這些處理呢？我們又要翻原始碼了：

//摘自deepqlearn。js

backward

：

function

（

reward

）

{

this

。

latest_reward

reward

；

this

。

average_reward_window

。

add

（

reward

）；

this

。

reward_window

。

shift

（）；

this

。

reward_window

。

push

（

reward

）；

（

！

this

。

learning

）

{

return

；

}

// various book-keeping

this

。

age

；

// it is time t+1 and we have to store （s_t， a_t， r_t， s_{t+1}） as new experience

// （given that an appropriate number of state measurements already exist， of course）

（

this

。

forward_passes

this

。

temporal_window

）

{

var

new

Experience

（）；

var

this

。

window_size

；

。

state0

this

。

net_window

［

］；

。

action0

this

。

action_window

［

］；

。

reward0

this

。

reward_window

［

］；

。

state1

this

。

net_window

［

］；

（

this

。

experience

。

length

this

。

experience_size

）

{

this

。

experience

。

push

（

）；

}

else

{

// replace。 finite memory！

var

convnetjs

。

randi

（

，

this

。

experience_size

）；

this

。

experience

［

］

；

}

// learn based on experience， once we have some samples to go on

// this is where the magic happens。。。

// 當經驗元組的快取經驗大於開始閥值時才開始訓練

（

this

。

experience

。

length

this

。

start_learn_threshold

）

{

var

avcost

0。0

；

for

（

var

；

this

。

tdtrainer

。

batch_size

；

）

{

//re是隨機數下標

var

convnetjs

。

randi

（

，

this

。

experience

。

length

）；

//e是隨機經驗元組

var

this

。

experience

［

］；

var

new

convnetjs

。

Vol

（

，

this

。

net_inputs

）；

。

state0

；

//用policy函式生成最優策略

var

maxact

this

。

policy

（

。

state1

）；

//r就是TD目標

var

。

reward0

this

。

gamma

maxact

。

value

；

var

ystruct

{

dim

：

。

action0

，

val

：

}；

var

loss

this

。

tdtrainer

。

train

（

，

ystruct

）；

avcost

loss

。

loss

；

}

avcost

this

。

tdtrainer

。

batch_size

；

this

。

average_loss_window

。

add

（

avcost

）；

}

}，

程式碼說明：

英文註釋為原始碼註釋

中文註釋為筆者的註釋

從backward函式可知，在訓練時確實使用了經驗回放。

而固定Q目標的本質是預測時所用的函式Q與實際計算的TD目標時所用的函式Q‘不是同一個函式。那原始碼裡是什麼狀況呢？從上述的backward函式可知，每次訓練是先從policy函式生成一個maxact物件，然後再根據TD目標的計算公式計算目標函式值ystruct。val。檢視policy函式有：

//摘自deepqlearn。js

policy

：

function

（

）

{

// compute the value of doing any action in this state

// and return the argmax action and its value

var

svol

new

convnetjs

。

Vol

（

，

this

。

net_inputs

）；

svol

。

；

//以當前的值函式進行計算

var

action_values

this

。

value_net

。

forward

（

svol

）；

var

maxk

；

var

maxval

action_values

。

［

］；

for

（

var

；

this

。

num_actions

；

）

{

（

action_values

。

［

］

maxval

）

{

maxk

；

maxval

action_values

。

［

］；

}

return

{

action

：

maxk

，

value

：

maxval

}；

}，

再來查查this。value_net是什麼：

//摘自deepqlearn。js

this

。

value_net

new

convnetjs

。

Net

（）；

this

。

value_net

。

makeLayers

（

layer_defs

）；

// and finally we need a Temporal Difference Learning trainer！

var

tdtrainer_options

{

learning_rate

：

0。01

，

momentum

：

0。0

，

batch_size

：

，

l2_decay

：

0。01

}；

（

typeof

opt

。

tdtrainer_options

！==

’undefined‘

）

{

tdtrainer_options

opt

。

tdtrainer_options

；

// allow user to overwrite this

}

this

。

tdtrainer

new

convnetjs

。

SGDTrainer

（

this

。

value_net

，

tdtrainer_options

）；

可以看到this。value_net實際上就是this。tdtrainer封裝的網路，那麼還要再深挖tdtrainer內部研究下才能下定論。

//摘自convnet。js

train

：

function

（

，

）

{

var

start

new

Date

（）。

getTime

（）；

this

。

net

。

forward

（

，

true

）；

// also set the flag that lets the net know we’re just training

var

end

new

Date

（）。

getTime

（）；

var

fwd_time

end

start

；

var

start

new

Date

（）。

getTime

（）；

var

cost_loss

this

。

net

。

backward

（

）；

var

l2_decay_loss

0。0

；

var

l1_decay_loss

0。0

；

var

end

new

Date

（）。

getTime

（）；

var

bwd_time

end

start

；

this

。

；

（

this

。

this

。

batch_size

===

）

{

var

pglist

this

。

net

。

getParamsAndGrads

（）；

// initialize lists for accumulators。 Will only be done once on first iteration

（

this

。

gsum

。

length

===

（

this

。

method

！==

‘sgd’

this

。

momentum

0。0

））

{

// only vanilla sgd doesnt need either lists

// momentum needs gsum

// adagrad needs gsum

// adadelta needs gsum and xsum

for

（

var

；

pglist

。

length

；

）

{

this

。

gsum

。

push

（

global

。

zeros

（

pglist

［

］。

params

。

length

））；

（

this

。

method

===

‘adadelta’

）

{

this

。

xsum

。

push

（

global

。

zeros

（

pglist

［

］。

params

。

length

））；

}

else

{

this

。

xsum

。

push

（［］）；

// conserve memory

}

// perform an update for all sets of weights

for

（

var

；

pglist

。

length

；

）

{

var

pglist

［

］；

// param， gradient， other options in future （custom learning rate etc）

var

。

params

；

var

。

grads

；

// learning rate for some parameters。

var

l2_decay_mul

typeof

。

l2_decay_mul

！==

‘undefined’

？

。

l2_decay_mul

：

1。0

；

var

l1_decay_mul

typeof

。

l1_decay_mul

！==

‘undefined’

？

。

l1_decay_mul

：

1。0

；

var

l2_decay

this

。

l2_decay

l2_decay_mul

；

var

l1_decay

this

。

l1_decay

l1_decay_mul

；

var

plen

。

length

；

for

（

var

；

plen

；

）

{

l2_decay_loss

l2_decay

［

］

［

］

；

// accumulate weight decay loss

l1_decay_loss

l1_decay

Math

。

abs

（

［

］）；

var

l1grad

l1_decay

（

［

］

？

：

）；

var

l2grad

l2_decay

（

［

］）；

var

gij

（

l2grad

l1grad

［

］）

this

。

batch_size

；

// raw batch gradient

var

gsumi

this

。

gsum

［

］；

var

xsumi

this

。

xsum

［

］；

（

this

。

method

===

‘adagrad’

）

{

// adagrad update

gsumi

［

］

gsumi

［

］

gij

；

var

this

。

learning_rate

Math

。

sqrt

（

gsumi

［

］

this

。

eps

）

gij

；

［

］

；

}

else

（

this

。

method

===

‘windowgrad’

）

{

// this is adagrad but with a moving window weighted average

// so the gradient is not accumulated over the entire history of the run。

// it‘s also referred to as Idea #1 in Zeiler paper on Adadelta。 Seems reasonable to me！

gsumi

［

］

this

。

gsumi

［

］

（

this

。

）

gij

；

var

this

。

learning_rate

Math

。

sqrt

（

gsumi

［

］

this

。

eps

）

gij

；

// eps added for better conditioning

［

］

；

}

else

（

this

。

method

===

’adadelta‘

）

{

// assume adadelta if not sgd or adagrad

gsumi

［

］

this

。

gsumi

［

］

（

this

。

）

gij

；

var

Math

。

sqrt

（（

xsumi

［

］

this

。

eps

）

（

gsumi

［

］

this

。

eps

））

gij

；

xsumi

［

］

this

。

xsumi

［

］

（

this

。

）

；

// yes， xsum lags behind gsum by 1。

［

］

；

}

else

（

this

。

method

===

’nesterov‘

）

{

var

gsumi

［

］；

gsumi

［

］

gsumi

［

］

this

。

momentum

this

。

learning_rate

gij

；

this

。

momentum

（

1。0

this

。

momentum

）

gsumi

［

］；

［

］

；

}

else

{

// assume SGD

（

this

。

momentum

0。0

）

{

// momentum update

var

this

。

momentum

gsumi

［

］

this

。

learning_rate

gij

；

// step

gsumi

［

］

；

// back this up for next iteration of momentum

［

］

；

// apply corrected gradient

}

else

{

// vanilla sgd

［

］

this

。

learning_rate

gij

；

}

［

］

0。0

；

// zero out gradient so that we can begin accumulating anew

}

// appending softmax_loss for backwards compatibility， but from now on we will always use cost_loss

// in future， TODO： have to completely redo the way loss is done around the network as currently

// loss is a bit of a hack。 Ideally， user should specify arbitrary number of loss functions on any layer

// and it should all be computed correctly and automatically。

return

{

fwd_time

：

fwd_time

，

bwd_time

：

bwd_time

，

l2_decay_loss

：

l2_decay_loss

，

l1_decay_loss

：

l1_decay_loss

，

cost_loss

：

cost_loss

，

softmax_loss

：

cost_loss

，

loss

：

cost_loss

l1_decay_loss

l2_decay_loss

}

從原始碼分析看，沒有看到有相關的固定Q目標的處理。

雖然如此，但由於這對於所有人來講都是一樣的，也就是說，我們無法透過對網路訓練過程進行最佳化來提高成績，因此此問題可以忽略。

以上是對實戰環境的分析與熟悉，基於以上的認識，我們再來構建我們自己的DQN，看看能讓成績提升多少。

參考資料：

DeepTraffic - About | MIT 6。S094： Deep Learning for Self-Driving Cars

Deep Learning in your browser

ConvNetJS Deep Q Learning Reinforcement Learning with Neural Network demo

程式網頁中F12的原始碼convnet。js， deepqlearn。js

Driving Fast through Dense Traffic with Deep Reinforcement Learning

小蜜蜂問答

小蜜蜂問答

DQN實戰：MIT強化學習實戰—Deep Traffic（上）

推薦文章

小蜜蜂問答

小蜜蜂問答

DQN實戰：MIT強化學習實戰—Deep Traffic（上）

相關文章

Altium Designer原理圖裡面有VCC和GND，PCB裡面就沒了，怎麼在PCB裡面加入VCC和GND？

三個易混詞的區別.

相容IE8需要注意哪些問題

學計算機技術都上哪個網站？

推薦文章