Bi-LSTM-CRF for Sequence Labeling

做了一段時間的Sequence Labeling的工作，發現在NER任務上面，很多論文都採用LSTM-CRFs的結構。CRF在最後一層應用進來可以考慮到機率最大的最優label路徑，可以提高指標。

一般的深度學習框架是沒有CRF layer的，需要手動實現。最近在學習PyTorch，裡面有一個Bi-LSTM-CRF的tutorial實現。不得不說PyTorch的tutorial真是太良心了，基本涵蓋了NLP領域各個流行的model實現。在這裡從頭梳理一遍，也記錄下學習過程中的一些問題。

Bi-LSTM-CRF的結構一般如上，最後一層利用CRF來學習一個最優路徑。Bi-LSTM layer的輸出維度是tag size，這就相當於是每個詞

對映到tag的發射機率值，設Bi-LSTM的輸出矩陣為

，其中

$P_{i,j}$

代表詞

對映到

的非歸一化機率。對於CRF來說，我們假定存在一個轉移矩陣

，則

$A_{i,j}$

代表

轉移到

的轉移機率。

對於輸入序列

對應的輸出tag序列

，定義分數為

利用Softmax函式，我們為每一個正確的tag序列

定義一個機率值（

代表所有的tag序列，包括不可能出現的）

因而在訓練中，我們只需要最大化似然機率

即可，這裡我們利用對數似然

所以我們將損失函式定義為

，就可以利用梯度下降法來進行網路的學習了。

在對損失函式進行計算的時候，

的計算很簡單，而

$log(\sum_{\widetilde{y}\in Y_X} e^{S(X, \widetilde{y})})$

（下面記作logsumexp）的計算稍微複雜一些，因為需要計算每一條可能路徑的分數。這裡用一種簡便的方法，對於到詞

$w_{i+1}$

的路徑，可以先把到詞

的logsumexp計算出來，因為

因此先計算每一步的路徑分數和直接計算全域性分數相同，但這樣可以大大減少計算的時間。下面是PyTorch中的程式碼

def

_forward_alg

（

self

，

feats

）：

# Do the forward algorithm to compute the partition function

init_alphas

torch

。

Tensor

（

，

self

。

tagset_size

）

。

fill_

（

10000。

）

# START_TAG has all of the score。

init_alphas

［

］［

self

。

tag_to_ix

［

START_TAG

］］

0。

# Wrap in a variable so that we will get automatic backprop

forward_var

autograd

。

Variable

（

init_alphas

）

# Iterate through the sentence

for

feat

feats

：

alphas_t

［］

# The forward variables at this timestep

for

next_tag

range

（

self

。

tagset_size

）：

# broadcast the emission score： it is the same regardless of

# the previous tag

emit_score

feat

［

next_tag

］

。

view

（

，

）

。

expand

（

，

self

。

tagset_size

）

# the ith entry of trans_score is the score of transitioning to

# next_tag from i

trans_score

self

。

transitions

［

next_tag

］

。

view

（

，

）

# The ith entry of next_tag_var is the value for the

# edge （i -> next_tag） before we do log-sum-exp

next_tag_var

forward_var

trans_score

emit_score

# The forward variable for this tag is log-sum-exp of all the

# scores。

alphas_t

。

append

（

log_sum_exp

（

next_tag_var

））

forward_var

torch

。

cat

（

alphas_t

）

。

view

（

，

）

terminal_var

forward_var

self

。

transitions

［

self

。

tag_to_ix

［

STOP_TAG

］］

alpha

log_sum_exp

（

terminal_var

）

return

alpha

在解碼時，採用Viterbi演算法

def

_viterbi_decode

（

self

，

feats

）：

backpointers

［］

# Initialize the viterbi variables in log space

init_vvars

torch

。

Tensor

（

，

self

。

tagset_size

）

。

fill_

（

10000。

）

init_vvars

［

］［

self

。

tag_to_ix

［

START_TAG

］］

# forward_var at step i holds the viterbi variables for step i-1

forward_var

autograd

。

Variable

（

init_vvars

）

for

feat

feats

：

bptrs_t

［］

# holds the backpointers for this step

viterbivars_t

［］

# holds the viterbi variables for this step

for

next_tag

range

（

self

。

tagset_size

）：

# next_tag_var［i］ holds the viterbi variable for tag i at the

# previous step， plus the score of transitioning

# from tag i to next_tag。

# We don‘t include the emission scores here because the max

# does not depend on them （we add them in below）

next_tag_var

forward_var

self

。

transitions

［

next_tag

］

best_tag_id

argmax

（

next_tag_var

）

bptrs_t

。

append

（

best_tag_id

）

viterbivars_t

。

append

（

next_tag_var

［

］［

best_tag_id

］）

# Now add in the emission scores， and assign forward_var to the set

# of viterbi variables we just computed

forward_var

（

torch

。

cat

（

viterbivars_t

）

feat

）

。

view

（

，

）

backpointers

。

append

（

bptrs_t

）

# Transition to STOP_TAG

terminal_var

forward_var

self

。

transitions

［

self

。

tag_to_ix

［

STOP_TAG

］］

best_tag_id

argmax

（

terminal_var

）

path_score

terminal_var

［

］［

best_tag_id

］

# Follow the back pointers to decode the best path。

best_path

［

best_tag_id

］

for

bptrs_t

reversed

（

backpointers

）：

best_tag_id

bptrs_t

［

best_tag_id

］

best_path

。

append

（

best_tag_id

）

# Pop off the start tag （we dont want to return that to the caller）

start

best_path

。

pop

（）

assert

start

self

。

tag_to_ix

［

START_TAG

］

# Sanity check

best_path

。

reverse

（）

return

path_score

，

best_path

全部程式碼實現可以移步Bi-LSTM-CRF。

參考

Bidirectional LSTM-CRF Models for Sequence Tagging

Neural Architectures for Named Entity Recognition

Advanced： Making Dynamic Decisions and the Bi-LSTM CRF

小蜜蜂問答

小蜜蜂問答

Bi-LSTM-CRF for Sequence Labeling

推薦文章

小蜜蜂問答

小蜜蜂問答

Bi-LSTM-CRF for Sequence Labeling

相關文章

Pytorch的那些坑

基於Pytorch構建三值化網路TWN

Python的1001種騷操作——基礎篇（3）

【053 智慧最佳化演算法】狀態轉移演算法 STA 求解函式極小值案例

推薦文章