做了一段時間的Sequence Labeling的工作,發現在NER任務上面,很多論文都採用LSTM-CRFs的結構。CRF在最後一層應用進來可以考慮到機率最大的最優label路徑,可以提高指標。

一般的深度學習框架是沒有CRF layer的,需要手動實現。最近在學習PyTorch,裡面有一個Bi-LSTM-CRF的tutorial實現。不得不說PyTorch的tutorial真是太良心了,基本涵蓋了NLP領域各個流行的model實現。在這裡從頭梳理一遍,也記錄下學習過程中的一些問題。

Bi-LSTM-CRF for Sequence Labeling

Bi-LSTM-CRF for Sequence Labeling

Bi-LSTM-CRF的結構一般如上,最後一層利用CRF來學習一個最優路徑。Bi-LSTM layer的輸出維度是tag size,這就相當於是每個詞

w_i

對映到tag的發射機率值,設Bi-LSTM的輸出矩陣為

P

,其中

P_{i,j}

代表詞

w_i

對映到

tag_j

的非歸一化機率。對於CRF來說,我們假定存在一個轉移矩陣

A

,則

A_{i,j}

代表

tag_i

轉移到

tag_j

的轉移機率。

對於輸入序列

X

對應的輸出tag序列

y

,定義分數為

Bi-LSTM-CRF for Sequence Labeling

Bi-LSTM-CRF for Sequence Labeling

利用Softmax函式,我們為每一個正確的tag序列

y

定義一個機率值(

Y_X

代表所有的tag序列,包括不可能出現的)

Bi-LSTM-CRF for Sequence Labeling

Bi-LSTM-CRF for Sequence Labeling

因而在訓練中,我們只需要最大化似然機率

p(y|X)

即可,這裡我們利用對數似然

Bi-LSTM-CRF for Sequence Labeling

Bi-LSTM-CRF for Sequence Labeling

所以我們將損失函式定義為

-log(p(y|X))

,就可以利用梯度下降法來進行網路的學習了。

在對損失函式進行計算的時候,

S(X,y)

的計算很簡單,而

log(\sum_{\widetilde{y}\in Y_X} e^{S(X, \widetilde{y})})

(下面記作logsumexp)的計算稍微複雜一些,因為需要計算每一條可能路徑的分數。這裡用一種簡便的方法,對於到詞

w_{i+1}

的路徑,可以先把到詞

w_i

的logsumexp計算出來,因為

Bi-LSTM-CRF for Sequence Labeling

Bi-LSTM-CRF for Sequence Labeling

因此先計算每一步的路徑分數和直接計算全域性分數相同,但這樣可以大大減少計算的時間。下面是PyTorch中的程式碼

def

_forward_alg

self

feats

):

# Do the forward algorithm to compute the partition function

init_alphas

=

torch

Tensor

1

self

tagset_size

fill_

-

10000。

# START_TAG has all of the score。

init_alphas

0

][

self

tag_to_ix

START_TAG

]]

=

0。

# Wrap in a variable so that we will get automatic backprop

forward_var

=

autograd

Variable

init_alphas

# Iterate through the sentence

for

feat

in

feats

alphas_t

=

[]

# The forward variables at this timestep

for

next_tag

in

range

self

tagset_size

):

# broadcast the emission score: it is the same regardless of

# the previous tag

emit_score

=

feat

next_tag

view

1

-

1

expand

1

self

tagset_size

# the ith entry of trans_score is the score of transitioning to

# next_tag from i

trans_score

=

self

transitions

next_tag

view

1

-

1

# The ith entry of next_tag_var is the value for the

# edge (i -> next_tag) before we do log-sum-exp

next_tag_var

=

forward_var

+

trans_score

+

emit_score

# The forward variable for this tag is log-sum-exp of all the

# scores。

alphas_t

append

log_sum_exp

next_tag_var

))

forward_var

=

torch

cat

alphas_t

view

1

-

1

terminal_var

=

forward_var

+

self

transitions

self

tag_to_ix

STOP_TAG

]]

alpha

=

log_sum_exp

terminal_var

return

alpha

在解碼時,採用Viterbi演算法

def

_viterbi_decode

self

feats

):

backpointers

=

[]

# Initialize the viterbi variables in log space

init_vvars

=

torch

Tensor

1

self

tagset_size

fill_

-

10000。

init_vvars

0

][

self

tag_to_ix

START_TAG

]]

=

0

# forward_var at step i holds the viterbi variables for step i-1

forward_var

=

autograd

Variable

init_vvars

for

feat

in

feats

bptrs_t

=

[]

# holds the backpointers for this step

viterbivars_t

=

[]

# holds the viterbi variables for this step

for

next_tag

in

range

self

tagset_size

):

# next_tag_var[i] holds the viterbi variable for tag i at the

# previous step, plus the score of transitioning

# from tag i to next_tag。

# We don‘t include the emission scores here because the max

# does not depend on them (we add them in below)

next_tag_var

=

forward_var

+

self

transitions

next_tag

best_tag_id

=

argmax

next_tag_var

bptrs_t

append

best_tag_id

viterbivars_t

append

next_tag_var

0

][

best_tag_id

])

# Now add in the emission scores, and assign forward_var to the set

# of viterbi variables we just computed

forward_var

=

torch

cat

viterbivars_t

+

feat

view

1

-

1

backpointers

append

bptrs_t

# Transition to STOP_TAG

terminal_var

=

forward_var

+

self

transitions

self

tag_to_ix

STOP_TAG

]]

best_tag_id

=

argmax

terminal_var

path_score

=

terminal_var

0

][

best_tag_id

# Follow the back pointers to decode the best path。

best_path

=

best_tag_id

for

bptrs_t

in

reversed

backpointers

):

best_tag_id

=

bptrs_t

best_tag_id

best_path

append

best_tag_id

# Pop off the start tag (we dont want to return that to the caller)

start

=

best_path

pop

()

assert

start

==

self

tag_to_ix

START_TAG

# Sanity check

best_path

reverse

()

return

path_score

best_path

全部程式碼實現可以移步Bi-LSTM-CRF。

參考

Bidirectional LSTM-CRF Models for Sequence Tagging

Neural Architectures for Named Entity Recognition

Advanced: Making Dynamic Decisions and the Bi-LSTM CRF