帶有詳細註釋的PaddlePaddle的情緒識別專案

哈嘍，大家好！我是劉聰NLP。

最近，本人一直在研究PaddlePaddle框架，主要是為了支援國產（

白嫖GPU

），也是為了知識儲備。不過，看了一些官方或非官方的專案之後，

個人體驗不是很好

。因此抽了一個上午的時間，

整理了一份情緒識別專案的程式碼，內部帶有大量註釋

，與之前開源的GPT2專案相似，希望可以幫助到初學PaddlePaddle的朋友。

之前開源的GPT2專案，主要基於PyTorch，地址：

情緒識別專案地址：

本文主要是對專案中的程式碼進行講解，主要從

資料預處理

、

資料類實現

、

模型程式碼實現

、

模型訓練和模型測試

，五個部分進行介紹，如下：

資料預處理

資料集來自

SMP2020微博情緒分類評測比賽

中通用微博資料集。按照其蘊含的情緒分為以下六個類別之一：

積極、憤怒、悲傷、恐懼、驚奇和無情緒

。

比賽連結：SMP2020微博情緒分類評測SMP2020-EWECT

資料預處理程式碼，主要是將其原始資料格式進行轉換，檢視資料集中各個類別的佔比。其實，正常專案，還可以增加一些資料清洗的工作（本專案省略了資料清洗的部分）。

def

sentiment_analysis_trans_data

（

path

，

save_path

）：

“”“

資料預處理程式碼，將原始資料格式轉換成模型所需格式資料，並統計各標籤資料的數量

Args：

path：原始資料路徑

save_path：儲存資料路徑

Returns：

”“”

fin

open

（

save_path

，

“w”

，

encoding

“utf-8”

）

data_number

{}

with

open

（

path

，

“r”

，

encoding

“utf-8”

）

：

# 載入原始資料

data

json

。

load

（

）

# 對原始資料進行遍歷

for

，

line

enumerate

（

data

）：

sample

{

“text”

：

line

［

“content”

］，

“label”

：

line

［

“label”

］}

# 如果標籤在data_number中，直接對其value進行加1操作；如果不在，則將標籤加入的data_number中，value設為1。

line

［

“label”

］

not

data_number

：

data_number

［

line

［

“label”

］］

else

：

data_number

［

line

［

“label”

］］

# 將每一個文字和對應的標籤，寫入到儲存檔案中

fin

。

write

（

json

。

dumps

（

sample

，

ensure_ascii

False

）

“

”

）

（

“data_number： ”

，

data_number

）

詳細程式碼見AIStudio專案的data_helper。py檔案。

資料類實現

資料類的作用是將

文字資料轉換成模型可以使用的索引資料

，並預先儲存下來。避免模型每訓練一步，都進行無效的資料轉換操作。

（1）判斷是否存在快取檔案，如果存在，則直接載入；否則重新將文字資料轉換為索引資料，並存為快取。

if os。path。exists（cached_feature_file） and not is_overwrite：

logger。info（“已經存在快取檔案{}，直接載入”。format（cached_feature_file））

self。data_set = paddle。load（cached_feature_file）［“data_set”］

else：

# 如果不存在快取檔案，則呼叫load_data函式，進行資料預處理，再將其儲存成快取檔案。

logger。info（“不存在快取檔案{}，進行資料預處理操作”。format（cached_feature_file））

self。data_set = self。load_data（path_file）

logger。info（“資料預處理操作完成，將處理後的資料存到{}中，作為快取檔案”。format（cached_feature_file））

paddle。save（{“data_set”： self。data_set}， cached_feature_file）

（2）將文字資料轉換為索引資料的函式

def convert_featrue（self， sample）：

“”“

將單個樣本轉換成模型可用的id索引形式

Args：

sample：單條樣本

Returns：

”“”

# 獲取標籤索引

label = self。label2id［sample［“label”］］

# 將本文進行tokenize

tokens = self。tokenizer。tokenize（sample［“text”］）

# 進行長度判斷，若長於最大長度，則進行截斷

if len（tokens） > self。max_len - 2：

tokens = tokens［：self。max_len - 2］

# 將其頭尾加上［CLS］和［SEP］

tokens = ［“［CLS］”］ + tokens + ［“［SEP］”］

# 將token轉化成id

input_ids = self。tokenizer。convert_tokens_to_ids（tokens）

# 獲取模型所需的attention_mask，大小與input_ids一致

attention_mask = ［1］ * len（input_ids）

assert len（input_ids） == len（attention_mask）

return input_ids， attention_mask， label

（3）在模型訓練時，對batch資料進行tensor轉換的函式

def collate_func_sentiment_analysis（batch_data）：

“”“

DataLoader所需的collate_fun函式，將資料處理成tensor形式

Args：

batch_data： batch資料

Returns：

”“”

# 獲取batch資料的大小

batch_size = len（batch_data）

# 如果batch_size為0，則返回一個空字典

if batch_size == 0：

return {}

input_ids_list， attention_mask_list， labels_list = ［］，［］，［］

# 遍歷batch資料，將每一個數據，轉換成tensor的形式

for instance in batch_data：

input_ids_temp = instance［“input_ids”］

attention_mask_temp = instance［“attention_mask”］

labels_temp = instance［“label”］

input_ids_list。append（paddle。to_tensor（input_ids_temp， dtype=“int64”））

attention_mask_list。append（paddle。to_tensor（attention_mask_temp， dtype=“int64”））

labels_list。append（labels_temp）

# 對一個batch內的資料，進行padding

return {“input_ids”： Pad（pad_val=0， axis=0）（input_ids_list），

“attention_mask”： Pad（pad_val=0， axis=0）（attention_mask_list），

“label”： Stack（dtype=“int64”）（labels_list）}

這裡的寫法與Pytorch一致，感覺可擴充套件性更強。

模型程式碼實現

模型部分，主要使用PaddleNLP的transformers的BertPretrainedModel類實現模型程式碼。

class SentimentAnalysisModel（BertPretrainedModel）：

base_model_prefix = “bert”

def __init__（self， bert， number_label=3）：

“”“

情緒識別模型繼承paddlenlp。transformers。BertPretrainedModel類

Args：

bert： bert模型

number_label：標籤個數

”“”

super（SentimentAnalysisModel， self）。__init__（）

self。bert = bert

self。classifier = nn。layer。Linear（self。bert。config［“hidden_size”］， number_label）

self。loss_fct = nn。CrossEntropyLoss（soft_label=False， axis=-1）

def forward（self， input_ids， attention_mask， label=None）：

# 將attention_mask進行維度變換，從2維變成4維。paddlenlp。transformers的實現與torch或tf不一樣，不會自動進行維度擴充。

attention_mask = paddle。unsqueeze（attention_mask， axis=［1， 2］）

# 獲取［CLS］向量pooled_output

pooled_output = self。bert（input_ids=input_ids， attention_mask=attention_mask）［1］

# 對pooled_output進行全連線，對映到number_label上

logits = self。classifier（pooled_output）

# 使用softmax，獲取每個標籤類別的機率

probs = F。softmax（logits， axis=1）

# 獲取標籤類別機率最大的標籤

pred_label = paddle。argmax（logits， axis=-1）

outputs = （pred_label， probs）

# 如果label不是None，則使用CrossEntropyLoss求解loss

if label is not None：

loss = self。loss_fct（logits， label）

outputs = （loss，） + outputs

return outputs

注意：程式碼中將attention_mask進行維度變換，從2維變成4維。paddlenlp.transformers的實現與torch或tf不一樣，不會自動進行維度擴充。

模型訓練

模型訓練引數如下圖所示：

模型訓練執行程式碼如下：

python3 train。py

python3 train。py ——num_train_epochs 5 ——train_batch_size 64 ——test_batch_size 32 ——max_len 256 ——output_dir 。/output_dir

模型訓練檔案主要由以下幾個函式組成：（1）設定訓練模型所需引數函式set_args；（2）訓練模型函式train；（3）對測試資料集進行模型測試evaluate；（4）主函式main。

詳細程式碼見AIStudio專案的train。py檔案。

模型測試

模型測試部分，本專案提供了三種模型測試，分別是動態圖模型測試、ONNX模型測試和靜態圖模型測試。

由於PaddlePaddle2。0主要推的是動態圖操作，總所周知，動態圖方便程式碼編寫，便與debug；但是缺點就是

速度較慢（每一次運算都會載入一遍圖）

。在工業界上，不光光要看效果，還要看速度。因此將模型加速是必不可少的步驟。在不修改模型引數的情況下，我們可以修改框架進行提速，比如將模型轉成ONNX或者將動態圖轉成靜態圖。

（1）將單個文字，進行資料轉換，得到模型所使用的id索引資料

def convert_featrue（sample， max_len， tokenizer）：

“”“

將單個文字，進行資料轉換，得到模型所使用的id索引資料

Args：

sample：單個文字，str型別

max_len：最大長度

tokenizer：分詞器

Returns：

”“”

# 對文字進行tokenize操作

tokens = tokenizer。tokenize（sample）

# 進行長度判斷，若長於最大長度，則進行截斷

if len（tokens） > max_len - 2：

tokens = tokens［：max_len - 2］

# 將其頭尾加上［CLS］和［SEP］

tokens = ［“［CLS］”］ + tokens + ［“［SEP］”］

# 將token轉化成id，並獲取模型所需的attention_mask

input_ids = tokenizer。convert_tokens_to_ids（tokens）

attention_mask = ［1］ * len（input_ids）

assert len（input_ids） == len（attention_mask）

# 對input_ids和attention_mask進行補全操作，補到最大長度

# 補全到最大長度，是由於後面會對動態圖轉onnx和靜態圖，輸入需要定長

if len（input_ids） < max_len：

input_ids = input_ids + ［0］ * （max_len - len（input_ids））

attention_mask = attention_mask + ［0］ * （max_len - len（attention_mask））

return input_ids， attention_mask

注意：將input_ids和attention_mask補全到最大長度，是由於後面會對動態圖轉onnx和靜態圖，輸入需要定長。

（2）對模型（動態圖）進行測試

def predict_one_sample（sample_list， model， tokenizer， max_len， id2label）：

“”“

對資料進行批次預測，獲取每個樣本對應的預測標籤

Args：

sample_list：樣本序列，為一個list

model：模型

tokenizer：分詞器

max_len：最大長度

id2label：標籤字典

Returns：

”“”

# 將資料轉換成模型可使用的tensor形式

batch = batch_data（sample_list， max_len， tokenizer）

# 關掉模型的dropout

model。eval（）

# 關掉模型的梯度計算

with paddle。no_grad（）：

input_ids = batch［“input_ids”］

attention_mask = batch［“attention_mask”］

# 獲取模型預測結果

［pred_label， _］ = model。forward（input_ids， attention_mask）

pred_label = pred_label。numpy（）

# 將模型預測結果轉換成標籤

label_name = ［id2label［pred］ for pred in pred_label］

return zip（sample_list， label_name）

def test（args）：

“”“對模型（動態圖）進行測試”“”

# 設定顯示卡資訊

os。environ［“CUDA_DEVICE_ORDER”］ = “PCI_BUS_ID”

os。environ［“CUDA_VISIBLE_DEVICES”］ = args。device

# 獲取device資訊，用於模型訓練

device = “gpu：{}”。format（args。device） if paddle。fluid。is_compiled_with_cuda（） and int（args。device） >= 0 else “cpu”

paddle。device。set_device（device）

# 載入已儲存模型，進行模型初始化

model = SentimentAnalysisModel。from_pretrained（args。model_path， number_label=args。num_labels）

# 例項化tokenizer

tokenizer = BertTokenizer（args。vocab_path， do_lower_case=True）

model。to（device）

id2label = {0： “angry”， 1： “happy”， 2： “neutral”， 3： “surprise”， 4： “sad”， 5： “fear”}

# 計時，記錄開始時間

T1 = time。time（）

# 對測試集檔案進行遍歷，單條測試

with open（args。test_path， “r”， encoding=“utf-8”） as fh：

for i， line in enumerate（fh）：

if i >= 1000：

continue

sample_list = ［json。loads（line）［“text”］］

# 單條測試

# sample_list = ［“媽媽說想和我聊天，她一定是有難過的事了。。。我要上課，所以我好難過。。”］

result = predict_one_sample（sample_list， model， tokenizer， args。max_len， id2label）

# 列印每個樣本的結果

# for sample， label in result：

# print（“label： {}， text： {}”。format（label， sample））

# 計時，記錄開始時間

T2 = time。time（）

print（“paddle模型，1000次的執行時間為{}秒”。format（T2 - T1））

（3）對onnx模型進行測試

def save_onnx_model（args）：

“”“將paddle模型轉成onnx模型”“”

# 載入已儲存模型，並進行引數初始化

model = SentimentAnalysisModel。from_pretrained（args。model_path， number_label=args。num_labels）

model。eval（）

# 定義輸入節點input_ids和attention_mask

input_ids = paddle。static。InputSpec（［None， args。max_len］， “int64”， “input_ids”）

attention_mask = paddle。static。InputSpec（［None， args。max_len］， “int64”， “attention_mask”）

# 使用paddle。onnx。export函式將模型轉換成onnx模型，並保持

paddle。onnx。export（model， args。onnx_model_path， input_spec=［input_ids， attention_mask］， opset_version=12）

# 檢測onnx模型是否可用載入

onnx_model = onnx。load（args。onnx_model_path + “。onnx”）

onnx。checker。check_model（onnx_model）

def test_onnx（args）：

“”“對onnx模型進行測試”“”

# 設定顯示卡資訊

os。environ［“CUDA_DEVICE_ORDER”］ = “PCI_BUS_ID”

os。environ［“CUDA_VISIBLE_DEVICES”］ = args。device

# 例項化tokenizer

tokenizer = BertTokenizer（args。vocab_path， do_lower_case=True）

id2label = {0： “angry”， 1： “happy”， 2： “neutral”， 3： “surprise”， 4： “sad”， 5： “fear”}

# 載入onnx模型

ort_sess = onnxruntime。InferenceSession（args。onnx_model_path + “。onnx”）

# 計時，記錄開始時間

T1 = time。time（）

# 對測試集檔案進行遍歷，單條測試

with open（args。test_path， “r”， encoding=“utf-8”） as fh：

for i， line in enumerate（fh）：

if i >= 1000：

continue

sample_list = ［json。loads（line）［“text”］］

# sample_list = ［“媽媽說想和我聊天，她一定是有難過的事了。。。我要上課，所以我好難過。。”］

batch = batch_data（sample_list， args。max_len， tokenizer）

input_ids = batch［“input_ids”］

input_ids = input_ids。numpy（）

attention_mask = batch［“attention_mask”］

attention_mask = attention_mask。numpy（）

# 構建onnx所需的feed_dict

ort_inputs = {ort_sess。get_inputs（）［0］。name： input_ids， ort_sess。get_inputs（）［1］。name： attention_mask}

# 模型預測

pred_label = ort_sess。run（None， ort_inputs）［0］

# 標籤轉換

label_name = ［id2label［pred］ for pred in pred_label］

# 列印每個樣本的結果

# for sample， label in zip（sample_list， label_name）：

# print（“label： {}， text： {}”。format（label， sample））

T2 = time。time（）

print（“onnx模型，1000次的執行時間為{}秒”。format（T2 - T1））

（4）對靜態圖模型進行測試

def save_static_model（args）：

“”“將paddle動態圖轉成靜態圖”“”

# 載入已儲存模型，並進行引數初始化

model = SentimentAnalysisModel。from_pretrained（args。model_path， number_label=args。num_labels）

model。eval（）

# 定義輸入節點input_ids和attention_mask

input_ids = paddle。static。InputSpec（shape=［None， args。max_len］， dtype=‘int64’， name=‘input_ids’）

attention_mask = paddle。static。InputSpec（shape=［None， args。max_len］， dtype=‘int64’， name=‘attention_mask’）

# 使用paddle。jit。to_static函式，將動態圖轉成靜態圖

model = paddle。jit。to_static（model， input_spec=［input_ids， attention_mask］）

# 使用靜態圖進行模型預測

sample_list = ［“媽媽說想和我聊天，她一定是有難過的事了。。。我要上課，所以我好難過。。”］

tokenizer = BertTokenizer（args。vocab_path， do_lower_case=True）

batch = batch_data（sample_list， args。max_len， tokenizer）

input_ids = batch［“input_ids”］

attention_mask = batch［“attention_mask”］

outputs = model（input_ids， attention_mask）

# 對靜態進行儲存

paddle。jit。save（layer=model， path=args。static_model_path， input_spec=［input_ids， attention_mask］）

def test_static（args）：

“”“對靜態圖模型進行測試”“”

# 設定顯示卡資訊

os。environ［“CUDA_DEVICE_ORDER”］ = “PCI_BUS_ID”

os。environ［“CUDA_VISIBLE_DEVICES”］ = args。device

device = “gpu：{}”。format（args。device） if paddle。fluid。is_compiled_with_cuda（） and int（args。device） >= 0 else “cpu”

paddle。device。set_device（device）

if “gpu” in device：

use_gpu = True

else：

use_gpu = False

# 使用InferenceModel進行模型封裝

model = InferenceModel（modelpath=args。static_model_path， use_gpu=use_gpu， use_mkldnn=args。use_mkldnn）

model。eval（）

# 例項化tokenizer

tokenizer = BertTokenizer（args。vocab_path， do_lower_case=True）

id2label = {0： “angry”， 1： “happy”， 2： “neutral”， 3： “surprise”， 4： “sad”， 5： “fear”}

# 計時，記錄開始時間

T1 = time。time（）

# 對測試集檔案進行遍歷，單條測試

with open（args。test_path， “r”， encoding=“utf-8”） as fh：

for i， line in enumerate（fh）：

if i >=1000：

continue

sample_list = ［json。loads（line）［“text”］］

# sample_list = ［“媽媽說想和我聊天，她一定是有難過的事了。。。我要上課，所以我好難過。。”］

batch = batch_data（sample_list， args。max_len， tokenizer）

input_ids = batch［“input_ids”］

attention_mask = batch［“attention_mask”］

pred_label = model（input_ids， attention_mask）［0］

label_name = ［id2label［pred］ for pred in pred_label］

# label_name = ［id2label［pred］ for pred in pred_label。numpy（）］

# 列印每個樣本的結果

# for sample， label in zip（sample_list， label_name）：

# print（“label： {}， text： {}”。format（label， sample））

T2 = time。time（）

print（“paddle靜態圖，1000次的執行時間為{}秒”。format（T2 - T1））

測試結果如下：

動態圖執行1000次耗時27.93秒，onnx執行1000次耗時10.89秒，靜態圖執行1000次耗時7.66秒。

可以看出，動態圖最慢、靜態圖最快。其實這裡有些超出我的認知，我一直覺得onnx的最快的。不知道是不是跟onnx的版本有關。不過動態圖轉onnx還是有很多坑的，目前

paddlepaddle有很多操作轉onnx會報錯，所以還是轉靜態圖吧

。

總結

paddlepaddle2。0之後，跟torch已經非常像了，並且也有了與transformer相似的包。不過中間還是有一些坑存在的，例如：載入預訓練模型、attention_mask只能為4維，內部沒有維度轉換等等。

不過，畢竟是國產嘛，畢竟可以白嫖V100嘛，一切困難都可以克服，哈哈哈~~~

對於那些剛剛入門、沒有顯示卡的朋友，其實可以用用paddlepaddle的，還不錯（百度記得給我的廣告費哈~~）。

整理不易，喜歡的的同學們，記得點贊，關注，轉發喲！！！

同時也歡迎關注我得微信公眾號“NLP工作站”，我們的口號是“生命不止，學習不停”。

往期回顧

劉聰NLP：EMNLP 2021之SF：一種預訓練語言模型的片段微調（Span Fine-tuning）方法

劉聰NLP：EMNLP2021之AEDA：一種更簡單的文字分類資料增強技術

劉聰NLP：常用預訓練語言模型（PTMs）總結

劉聰NLP：回顧BART模型

劉聰NLP：ACL2021論文之ChineseBERT：融合字形與拼音資訊的中文預訓練模型

劉聰NLP：授人以魚不如授人以漁

劉聰NLP：ACL2021 Findings論文彙總及分類

劉聰NLP：ACL2021主會議論文彙總及分類

劉聰NLP：SIGIR2021論文：基於Text-to-Text多檢視學習的段落重排序

劉聰NLP：SIGIR2021之DvBERT模型：雙檢視蒸餾的句向量BERT模型

劉聰NLP：SIGIR2021之IDCM模型：文件內部級聯選擇段落服務於文件排序

劉聰NLP：超詳細中文註釋的GPT2新聞標題生成專案

劉聰NLP：Sentence-Bert論文筆記

劉聰NLP：MacBERT：MLM as correction BERT

劉聰NLP：BERT-QE：基於上下文化查詢擴充套件的文件ReRank

劉聰NLP：SIGIR 2020之MarkedBERT模型：加入傳統檢索線索的Rerank模型

劉聰NLP：SIGIR 2020之DC-BERT模型：解耦問題-文件編碼，提速QA-Rerank模組

劉聰NLP：開源啦！開源啦！UNILM中文模型開源啦！

劉聰NLP：ACL2020論文整理之問題生成、自然語言推理、預訓練語言模型及部分應用、QA問答系統及機器閱讀理解

劉聰NLP：智慧擴充機器人的“標準問”庫之Query生成

劉聰NLP：短文字相似度演算法研究

小蜜蜂問答

小蜜蜂問答

帶有詳細註釋的PaddlePaddle的情緒識別專案

推薦文章

小蜜蜂問答

小蜜蜂問答

帶有詳細註釋的PaddlePaddle的情緒識別專案

相關文章

input語句讀取資料的方式？

什麼是IDS、IPS及它們之間的區別

表單中<label>標籤有什麼用?

ShapeMask: 一個有效的弱監督 instance segmentation 方法

推薦文章