本專案是對DQN知識掌握程度的檢驗,需要對強化學習基礎知識以及DQN有所瞭解,詳細可參閱文章:

先來了解一下該專案的背景:

DeepTraffic is a deep reinforcement learning competition part of the MIT Deep Learning for Self-Driving Cars course。 The goal is to create a neural network to drive a vehicle (or multiple vehicles) as fast as possible through dense highway traffic。 An instance of your neural network gets to control one of the cars (displayed in red) and has to learn how to navigate efficiently to go as fast as possible。 The car already comes with a safety system, so you don’t have to worry about the basic task of driving – the net only has to tell the car if it should accelerate/slow down or change lanes, and it will do so if that is possible without crashing into other cars。

—— from 《DeepTraffic: About》

DeepTraffic實際上就是一個DQN的實驗平臺。詳細的可參閱官方網站。

首先,DeepTraffic的目標是讓小車跑得越快越好。小車的限速是80MPH,因此可以理解為本專案的目標就是越接近這個極限越好。而這個專案的最低目標是讓小車車速超過65MPH。我們先來看看排行榜:

DQN實戰:MIT強化學習實戰—Deep Traffic(上)

資源來源:https://selfdrivingcars。mit。edu/deeptraffic-leaderboard/

可以看到,目前的排行榜第一名已經相當接近80的極限了。

我們再看看當我們什麼都不改變的時候,基礎模型的表現是多少:

DQN實戰:MIT強化學習實戰—Deep Traffic(上)

基礎模型

試了幾次,大概都是在51MPH左右。

好了,那我們從程式碼分析開始吧:

//<![CDATA[

// a few things don‘t have var in front of them - they update already existing variables the game needs

lanesSide

=

0

//可觀察多少條旁邊車道

patchesAhead

=

1

//可往前觀察多少格

patchesBehind

=

0

//可往後觀察多少格

trainIterations

=

10000

//訓練的迭代數

// the number of other autonomous vehicles controlled by your network

otherAgents

=

0

// 多輛自動駕駛車輛在路上行駛

// 計算狀態向量(不用修改)

var

num_inputs

=

lanesSide

*

2

+

1

*

patchesAhead

+

patchesBehind

);

// 動作數量(不用修改)

var

num_actions

=

5

// 參考的時間視窗

var

temporal_window

=

3

// 把狀態向量轉化為神經網路的輸入向量(不用修改)

var

network_size

=

num_inputs

*

temporal_window

+

num_actions

*

temporal_window

+

num_inputs

//神經網路的架構

var

layer_defs

=

[];

layer_defs

push

({

type

’input‘

out_sx

1

out_sy

1

out_depth

network_size

});

layer_defs

push

({

type

’fc‘

num_neurons

1

activation

’relu‘

});

layer_defs

push

({

type

’regression‘

num_neurons

num_actions

});

//訓練的超引數

var

tdtrainer_options

=

{

learning_rate

0。001

//學習率

momentum

0。0

//動量

batch_size

64

//批大小

l2_decay

0。01

//衰減

};

//其他超引數

var

opt

=

{};

opt

temporal_window

=

temporal_window

opt

experience_size

=

3000

//經驗暫存區大小

opt

start_learn_threshold

=

500

//開始訓練閥值

opt

gamma

=

0。7

//遠期獎勵折扣率

//this。epsilon = Math。min(1。0, Math。max(this。epsilon_min, 1。0-(this。age - this。learning_steps_burnin)/(this。learning_steps_total - this。learning_steps_burnin)));

opt

learning_steps_total

=

10000

//用於計算epsilon

opt

learning_steps_burnin

=

1000

//用於計算epsilon,開始訓練時隨機行動訓練次數

opt

epsilon_min

=

0。0

//訓練時的最小epsilon

opt

epsilon_test_time

=

0。0

//測試時的epsilon值

opt

layer_defs

=

layer_defs

opt

tdtrainer_options

=

tdtrainer_options

//構建神經網路

brain

=

new

deepqlearn

Brain

num_inputs

num_actions

opt

);

// 訓練並生成動作

learn

=

function

state

lastReward

{

brain

backward

lastReward

);

var

action

=

brain

forward

state

);

draw_net

();

draw_stats

();

return

action

}

//]]>

程式碼說明(大致的程式碼邏輯已經用中文註釋,下面只說關鍵點):

什麼是狀態向量?

由於小車的觀察範圍被網格化處理,如下圖:

DQN實戰:MIT強化學習實戰—Deep Traffic(上)

因此網格的狀態所組成的向量就是狀態向量。我們驗證一下:

首先,在程式碼加上輸出

lanesSide

=

0

patchesAhead

=

5

// 把前方觀察範圍設為5格

patchesBehind

=

0

//中間程式碼略。。。

learn

=

function

state

lastReward

{

console

log

state

);

// 加上輸出,檢視狀態

brain

backward

lastReward

);

var

action

=

brain

forward

state

);

draw_net

();

draw_stats

();

return

action

}

然後在瀏覽器的F12控制檯,可以看到不斷有狀態陣列輸出,如下圖:

DQN實戰:MIT強化學習實戰—Deep Traffic(上)

狀態值會在[0,1]的範圍,當網格沒有其他車輛時狀態值為1,當網格有其他車輛佔位時狀態值為(0,1)之間,當網格不可行駛(比如靠邊的車道)時狀態值為0。

我們再驗證一下觀察旁邊一條車道的情況:

DQN實戰:MIT強化學習實戰—Deep Traffic(上)

lanesSide

=

1

// 觀察旁邊一條車道

patchesAhead

=

5

// 把前方觀察範圍設為5格

patchesBehind

=

0

DQN實戰:MIT強化學習實戰—Deep Traffic(上)

可以看到,當狀態為一個

m\times n

的矩陣時,會被壓成一個長度為

m \times n

的一維陣列。

如何定義時間步?

在強化學習的理論框架裡,有個很重要的概念:時間步。智慧體在每個時間步,根據環境的狀態採取相應的行動,從而得到相應的獎勵反饋。那麼,在DeepTraffic模型中,時間步則是每隔30幀就取一幀代表一個時間步。

The simulation uses

frames

as an internal measure of time – so neither a slow computer, nor a slow net influences the result。 The

Simulation Speed

setting lets you control how the simulation is displayed to you – using the

Normal

setting the simulation tries to draw the frames matching real time, so it waits if the actual calculation is going faster –

Fast

displays frames as soon as they are finished, which may be much faster。

—— from 《DeepTraffic: About》

Further, there is one car (displayed in red) that is not using these random actions。 This is the car controlled by the deep reinforcement learning agent。 It is able to choose an action every 30 frames (the time it takes to make a lane change)

and gets a cutout of the state map as an input to compute its actions。

—— from 《Driving Fast through Dense Traffic with Deep Reinforcement Learning》

同時需要說明的是,系統不會因為執行機器的快慢而影響計算結果以及模擬效果。

狀態向量如何轉化為輸入向量?

從原來給定的程式碼可以看到,輸入向量大小與狀態向量大小並不相同:

//狀態向量經過temporal_window的處理進行轉換

var

network_size

=

num_inputs

*

temporal_window

+

num_actions

*

temporal_window

+

num_inputs

var

layer_defs

=

[];

layer_defs

push

({

type

’input‘

out_sx

1

out_sy

1

out_depth

network_size

});

temporal_window是什麼?

為了搞清這個問題,我把原始碼從頭到尾翻了一遍,在deepqlearn。js找到以下這段程式碼:

//摘自deepqlearn。js

getNetInput

function

xt

{

// return s = (x,a,x,a,x,a,xt) state vector。

// It’s a concatenation of last window_size (x,a) pairs and current state x

var

w

=

[];

w

=

w

concat

xt

);

// start with current state

// and now go backwards and append states and actions from history temporal_window times

var

n

=

this

window_size

for

var

k

=

0

k

<

this

temporal_window

k

++

{

// state

w

=

w

concat

this

state_window

n

-

1

-

k

]);

// action, encoded as 1-of-k indicator vector。 We scale it up a bit because

// we dont want weight regularization to undervalue this information, as it only exists once

var

action1ofk

=

new

Array

this

num_actions

);

for

var

q

=

0

q

<

this

num_actions

q

++

action1ofk

q

=

0。0

action1ofk

this

action_window

n

-

1

-

k

]]

=

1。0

*

this

num_states

w

=

w

concat

action1ofk

);

}

return

w

},

程式碼說明:

英文註釋為原始碼的註釋;

getNetInput函式會把一個當前時間步的狀態向量轉換成一個包含前若干個時間步的狀態向量以及動作向量的輸入向量;

temporal_window是往前追溯多個對

<s,a>

狀態動作組;

經過轉換後,神經網路的輸入向量就相當於讓狀態從原來的靜態特徵變為包含動態特徵的向量。

如何理解神經網路的架構?

以預設的網路為例,實際上就是一個MLP,一共三層:一個輸入層、一個隱藏層、一個輸出層。由於該網路最終要計算的是動作值,因此輸出層沒有使用啟用函式。

DQN實戰:MIT強化學習實戰—Deep Traffic(上)

神經網路的詳細生成過程可參考convnet。js的原始碼。

如何理解平均獎勵曲線?

learn

=

function

state

lastReward

{

brain

backward

lastReward

);

var

action

=

brain

forward

state

);

console

log

lastReward

);

//輸出觀察獎勵值

draw_net

();

draw_stats

();

return

action

}

透過瀏覽器控制檯可以觀察到,當時速達到80MPH時,獎勵值為3;當時速為40MPH時,獎勵值為0;獎勵函式是一個與時速線性相關的函式。

DQN實戰:MIT強化學習實戰—Deep Traffic(上)

上圖所示的曲線是平均獎勵曲線。當我們修改了程式碼然後點選“apply code”時,曲線起點會從上一次的計算點開始計算(這點導致我一度很困惑,但似乎對結果沒有影響,不明白的可以忽略)。這裡只需要知道,一個策略的平均獎勵是0分的話,那智慧車的平均時速就是40MPH;若平均獎勵是3分的話,那智慧車的平均時速就是80MPH。

策略的生成過程是怎樣的?

我們從DQN的理論可知,經驗回放與固定Q目標、以及其他關於DQN的改進方案都會對最終的函式逼近結果有顯著的影響。在預設的程式碼實現中,是否有考慮進行這些處理呢?我們又要翻原始碼了:

//摘自deepqlearn。js

backward

function

reward

{

this

latest_reward

=

reward

this

average_reward_window

add

reward

);

this

reward_window

shift

();

this

reward_window

push

reward

);

if

this

learning

{

return

}

// various book-keeping

this

age

+=

1

// it is time t+1 and we have to store (s_t, a_t, r_t, s_{t+1}) as new experience

// (given that an appropriate number of state measurements already exist, of course)

if

this

forward_passes

>

this

temporal_window

+

1

{

var

e

=

new

Experience

();

var

n

=

this

window_size

e

state0

=

this

net_window

n

-

2

];

e

action0

=

this

action_window

n

-

2

];

e

reward0

=

this

reward_window

n

-

2

];

e

state1

=

this

net_window

n

-

1

];

if

this

experience

length

<

this

experience_size

{

this

experience

push

e

);

}

else

{

// replace。 finite memory!

var

ri

=

convnetjs

randi

0

this

experience_size

);

this

experience

ri

=

e

}

}

// learn based on experience, once we have some samples to go on

// this is where the magic happens。。。

// 當經驗元組的快取經驗大於開始閥值時才開始訓練

if

this

experience

length

>

this

start_learn_threshold

{

var

avcost

=

0。0

for

var

k

=

0

k

<

this

tdtrainer

batch_size

k

++

{

//re是隨機數下標

var

re

=

convnetjs

randi

0

this

experience

length

);

//e是隨機經驗元組

var

e

=

this

experience

re

];

var

x

=

new

convnetjs

Vol

1

1

this

net_inputs

);

x

w

=

e

state0

//用policy函式生成最優策略

var

maxact

=

this

policy

e

state1

);

//r就是TD目標

var

r

=

e

reward0

+

this

gamma

*

maxact

value

var

ystruct

=

{

dim

e

action0

val

r

};

//

var

loss

=

this

tdtrainer

train

x

ystruct

);

avcost

+=

loss

loss

}

avcost

=

avcost

/

this

tdtrainer

batch_size

this

average_loss_window

add

avcost

);

}

},

程式碼說明:

英文註釋為原始碼註釋

中文註釋為筆者的註釋

從backward函式可知,在訓練時確實使用了經驗回放。

而固定Q目標的本質是預測時所用的函式Q與實際計算的TD目標時所用的函式Q‘不是同一個函式。那原始碼裡是什麼狀況呢?從上述的backward函式可知,每次訓練是先從policy函式生成一個maxact物件,然後再根據TD目標的計算公式計算目標函式值ystruct。val。檢視policy函式有:

//摘自deepqlearn。js

policy

function

s

{

// compute the value of doing any action in this state

// and return the argmax action and its value

var

svol

=

new

convnetjs

Vol

1

1

this

net_inputs

);

svol

w

=

s

//以當前的值函式進行計算

var

action_values

=

this

value_net

forward

svol

);

var

maxk

=

0

var

maxval

=

action_values

w

0

];

for

var

k

=

1

k

<

this

num_actions

k

++

{

if

action_values

w

k

>

maxval

{

maxk

=

k

maxval

=

action_values

w

k

];

}

}

return

{

action

maxk

value

maxval

};

},

再來查查this。value_net是什麼:

//摘自deepqlearn。js

this

value_net

=

new

convnetjs

Net

();

this

value_net

makeLayers

layer_defs

);

// and finally we need a Temporal Difference Learning trainer!

var

tdtrainer_options

=

{

learning_rate

0。01

momentum

0。0

batch_size

64

l2_decay

0。01

};

if

typeof

opt

tdtrainer_options

!==

’undefined‘

{

tdtrainer_options

=

opt

tdtrainer_options

// allow user to overwrite this

}

this

tdtrainer

=

new

convnetjs

SGDTrainer

this

value_net

tdtrainer_options

);

可以看到this。value_net實際上就是this。tdtrainer封裝的網路,那麼還要再深挖tdtrainer內部研究下才能下定論。

//摘自convnet。js

train

function

x

y

{

var

start

=

new

Date

()。

getTime

();

this

net

forward

x

true

);

// also set the flag that lets the net know we’re just training

var

end

=

new

Date

()。

getTime

();

var

fwd_time

=

end

-

start

var

start

=

new

Date

()。

getTime

();

var

cost_loss

=

this

net

backward

y

);

var

l2_decay_loss

=

0。0

var

l1_decay_loss

=

0。0

var

end

=

new

Date

()。

getTime

();

var

bwd_time

=

end

-

start

this

k

++

if

this

k

%

this

batch_size

===

0

{

var

pglist

=

this

net

getParamsAndGrads

();

// initialize lists for accumulators。 Will only be done once on first iteration

if

this

gsum

length

===

0

&&

this

method

!==

‘sgd’

||

this

momentum

>

0。0

))

{

// only vanilla sgd doesnt need either lists

// momentum needs gsum

// adagrad needs gsum

// adadelta needs gsum and xsum

for

var

i

=

0

i

<

pglist

length

i

++

{

this

gsum

push

global

zeros

pglist

i

]。

params

length

));

if

this

method

===

‘adadelta’

{

this

xsum

push

global

zeros

pglist

i

]。

params

length

));

}

else

{

this

xsum

push

([]);

// conserve memory

}

}

}

// perform an update for all sets of weights

for

var

i

=

0

i

<

pglist

length

i

++

{

var

pg

=

pglist

i

];

// param, gradient, other options in future (custom learning rate etc)

var

p

=

pg

params

var

g

=

pg

grads

// learning rate for some parameters。

var

l2_decay_mul

=

typeof

pg

l2_decay_mul

!==

‘undefined’

pg

l2_decay_mul

1。0

var

l1_decay_mul

=

typeof

pg

l1_decay_mul

!==

‘undefined’

pg

l1_decay_mul

1。0

var

l2_decay

=

this

l2_decay

*

l2_decay_mul

var

l1_decay

=

this

l1_decay

*

l1_decay_mul

var

plen

=

p

length

for

var

j

=

0

j

<

plen

j

++

{

l2_decay_loss

+=

l2_decay

*

p

j

*

p

j

/

2

// accumulate weight decay loss

l1_decay_loss

+=

l1_decay

*

Math

abs

p

j

]);

var

l1grad

=

l1_decay

*

p

j

>

0

1

-

1

);

var

l2grad

=

l2_decay

*

p

j

]);

var

gij

=

l2grad

+

l1grad

+

g

j

])

/

this

batch_size

// raw batch gradient

var

gsumi

=

this

gsum

i

];

var

xsumi

=

this

xsum

i

];

if

this

method

===

‘adagrad’

{

// adagrad update

gsumi

j

=

gsumi

j

+

gij

*

gij

var

dx

=

-

this

learning_rate

/

Math

sqrt

gsumi

j

+

this

eps

*

gij

p

j

+=

dx

}

else

if

this

method

===

‘windowgrad’

{

// this is adagrad but with a moving window weighted average

// so the gradient is not accumulated over the entire history of the run。

// it‘s also referred to as Idea #1 in Zeiler paper on Adadelta。 Seems reasonable to me!

gsumi

j

=

this

ro

*

gsumi

j

+

1

-

this

ro

*

gij

*

gij

var

dx

=

-

this

learning_rate

/

Math

sqrt

gsumi

j

+

this

eps

*

gij

// eps added for better conditioning

p

j

+=

dx

}

else

if

this

method

===

’adadelta‘

{

// assume adadelta if not sgd or adagrad

gsumi

j

=

this

ro

*

gsumi

j

+

1

-

this

ro

*

gij

*

gij

var

dx

=

-

Math

sqrt

((

xsumi

j

+

this

eps

/

gsumi

j

+

this

eps

))

*

gij

xsumi

j

=

this

ro

*

xsumi

j

+

1

-

this

ro

*

dx

*

dx

// yes, xsum lags behind gsum by 1。

p

j

+=

dx

}

else

if

this

method

===

’nesterov‘

{

var

dx

=

gsumi

j

];

gsumi

j

=

gsumi

j

*

this

momentum

+

this

learning_rate

*

gij

dx

=

this

momentum

*

dx

-

1。0

+

this

momentum

*

gsumi

j

];

p

j

+=

dx

}

else

{

// assume SGD

if

this

momentum

>

0。0

{

// momentum update

var

dx

=

this

momentum

*

gsumi

j

-

this

learning_rate

*

gij

// step

gsumi

j

=

dx

// back this up for next iteration of momentum

p

j

+=

dx

// apply corrected gradient

}

else

{

// vanilla sgd

p

j

+=

-

this

learning_rate

*

gij

}

}

g

j

=

0。0

// zero out gradient so that we can begin accumulating anew

}

}

}

// appending softmax_loss for backwards compatibility, but from now on we will always use cost_loss

// in future, TODO: have to completely redo the way loss is done around the network as currently

// loss is a bit of a hack。 Ideally, user should specify arbitrary number of loss functions on any layer

// and it should all be computed correctly and automatically。

return

{

fwd_time

fwd_time

bwd_time

bwd_time

l2_decay_loss

l2_decay_loss

l1_decay_loss

l1_decay_loss

cost_loss

cost_loss

softmax_loss

cost_loss

loss

cost_loss

+

l1_decay_loss

+

l2_decay_loss

}

}

從原始碼分析看,沒有看到有相關的固定Q目標的處理。

雖然如此,但由於這對於所有人來講都是一樣的,也就是說,我們無法透過對網路訓練過程進行最佳化來提高成績,因此此問題可以忽略。

以上是對實戰環境的分析與熟悉,基於以上的認識,我們再來構建我們自己的DQN,看看能讓成績提升多少。

參考資料:

DeepTraffic - About | MIT 6。S094: Deep Learning for Self-Driving Cars

Deep Learning in your browser

ConvNetJS Deep Q Learning Reinforcement Learning with Neural Network demo

程式網頁中F12的原始碼convnet。js, deepqlearn。js

Driving Fast through Dense Traffic with Deep Reinforcement Learning