【深度學習系列】用PaddlePaddle和Tensorflow進行影象分類

上個月釋出了四篇文章，主要講了深度學習中的“hello world”——mnist影象識別，以及卷積神經網路的原理詳解，包括基本原理、自己手寫CNN和paddlepaddle的原始碼解析。這篇主要跟大家講講如何用PaddlePaddle和Tensorflow做影象分類。所有程式都在我的github裡，可以自行下載訓練。

在卷積神經網路中，有五大經典模型，分別是：LeNet-5，AlexNet，GoogleNet，Vgg和ResNet。本文首先自己設計一個小型CNN網路結構來對影象進行分類，再瞭解一下LeNet-5網路結構對影象做分類，並用比較流行的Tensorflow框架和百度的PaddlePaddle實現LeNet-5網路結構，並對結果對比。

什麼是影象分類

影象分類是根據影象的語義資訊將不同類別影象區分開來，是計算機視覺中重要的基本問題，也是影象檢測、影象分割、物體跟蹤、行為分析等其他高層視覺任務的基礎。影象分類在很多領域有廣泛應用，包括安防領域的人臉識別和智慧影片分析等，交通領域的交通場景識別，網際網路領域基於內容的影象檢索和相簿自動歸類，醫學領域的影象識別等（引用自官網）

cifar-10資料集

CIFAR-10分類問題是機器學習領域的一個通用基準，由60000張32*32的RGB彩色圖片構成，共10個分類。50000張用於訓練集，10000張用於測試集。其問題是將32X32畫素的RGB影象分類成10種類別：

飛機

，

手機

，

鳥

，

貓

，

鹿

，

狗

，

青蛙

，

馬

，

船

和

卡車。

更多資訊可以參考CIFAR-10和Alex Krizhevsky的演講報告。常見的還有cifar-100，分類物體達到100類，以及ILSVRC比賽的100類。

自己設計CNN

瞭解CNN的基本網路結構後，首先自己設計一個簡單的CNN網路結構對cifar-10資料進行分類。

網路結構

　程式碼實現

1.網路結構：simple_cnn.py

#coding：utf-8

‘’‘

Created by huxiaoman 2017。11。27

simple_cnn。py：自己設計的一個簡單的cnn網路結構

’‘’

import os

from PIL import Image

import numpy as np

import paddle。v2 as paddle

from paddle。trainer_config_helpers import *

with_gpu = os。getenv（‘WITH_GPU’， ‘0’）！= ‘1’

def simple_cnn（img）：

conv_pool_1 = paddle。networks。simple_img_conv_pool（

input=img，

filter_size=5，

num_filters=20，

num_channel=3，

pool_size=2，

pool_stride=2，

act=paddle。activation。Relu（））

conv_pool_2 = paddle。networks。simple_img_conv_pool（

input=conv_pool_1，

filter_size=5，

num_filters=50，

num_channel=20，

pool_size=2，

pool_stride=2，

act=paddle。activation。Relu（））

fc = paddle。layer。fc（

input=conv_pool_2， size=512， act=paddle。activation。Softmax（））

2.訓練程式：train_simple_cnn.py

#coding：utf-8

‘’‘

Created by huxiaoman 2017。11。27

train_simple—_cnn。py：訓練simple_cnn對cifar10資料集進行分類

’‘’

import sys， os

import paddle。v2 as paddle

from simple_cnn import simple_cnn

with_gpu = os。getenv（‘WITH_GPU’， ‘0’）！= ‘1’

def main（）：

datadim = 3 * 32 * 32

classdim = 10

# PaddlePaddle init

paddle。init（use_gpu=with_gpu， trainer_count=7）

image = paddle。layer。data（

name=“image”， type=paddle。data_type。dense_vector（datadim））

# Add neural network config

# option 1。 resnet

# net = resnet_cifar10（image， depth=32）

# option 2。 vgg

net = simple_cnn（image）

out = paddle。layer。fc（

input=net， size=classdim， act=paddle。activation。Softmax（））

lbl = paddle。layer。data（

name=“label”， type=paddle。data_type。integer_value（classdim））

cost = paddle。layer。classification_cost（input=out， label=lbl）

# Create parameters

parameters = paddle。parameters。create（cost）

# Create optimizer

momentum_optimizer = paddle。optimizer。Momentum（

momentum=0。9，

regularization=paddle。optimizer。L2Regularization（rate=0。0002 * 128），

learning_rate=0。1 / 128。0，

learning_rate_decay_a=0。1，

learning_rate_decay_b=50000 * 100，

learning_rate_schedule=‘discexp’）

# End batch and end pass event handler

def event_handler（event）：

if isinstance（event， paddle。event。EndIteration）：

if event。batch_id % 100 == 0：

print “\nPass %d， Batch %d， Cost %f， %s” % （

event。pass_id， event。batch_id， event。cost， event。metrics）

else：

sys。stdout。write（‘。’）

sys。stdout。flush（）

if isinstance（event， paddle。event。EndPass）：

# save parameters

with open（‘params_pass_%d。tar’ % event。pass_id， ‘w’） as f：

parameters。to_tar（f）

result = trainer。test（

reader=paddle。batch（

paddle。dataset。cifar。test10（）， batch_size=128），

feeding={‘image’： 0，

‘label’： 1}）

print “\nTest with Pass %d， %s” % （event。pass_id， result。metrics）

# Create trainer

trainer = paddle。trainer。SGD（

cost=cost， parameters=parameters， update_equation=momentum_optimizer）

# Save the inference topology to protobuf。

inference_topology = paddle。topology。Topology（layers=out）

with open（“inference_topology。pkl”， ‘wb’） as f：

inference_topology。serialize_for_inference（f）

trainer。train（

reader=paddle。batch（

paddle。reader。shuffle（

paddle。dataset。cifar。train10（）， buf_size=50000），

batch_size=128），

num_passes=200，

event_handler=event_handler，

feeding={‘image’： 0，

‘label’： 1}）

# inference

from PIL import Image

import numpy as np

import os

def load_image（file）：

im = Image。open（file）

im = im。resize（（32， 32）， Image。ANTIALIAS）

im = np。array（im）。astype（np。float32）

# The storage order of the loaded image is W（widht），

# H（height）， C（channel）。 PaddlePaddle requires

# the CHW order， so transpose them。

im = im。transpose（（2， 0， 1）） # CHW

# In the training phase， the channel order of CIFAR

# image is B（Blue）， G（green）， R（Red）。 But PIL open

# image in RGB mode。 It must swap the channel order。

im = im［（2， 1， 0），：，：］ # BGR

im = im。flatten（）

im = im / 255。0

return im

test_data = ［］

cur_dir = os。path。dirname（os。path。realpath（__file__））

test_data。append（（load_image（cur_dir + ‘/image/dog。png’），））

# users can remove the comments and change the model name

# with open（‘params_pass_50。tar’， ‘r’） as f：

# parameters = paddle。parameters。Parameters。from_tar（f）

probs = paddle。infer（

output_layer=out， parameters=parameters， input=test_data）

lab = np。argsort（-probs） # probs and lab are the results of one batch data

print “Label of image/dog。png is： %d” % lab［0］［0］

if __name__ == ‘__main__’：

main（）

3.結果輸出

I1128 21：44：30。218085 14733 Util。cpp：166］ commandline： ——use_gpu=True ——trainer_count=7

［INFO 2017-11-28 21：44：35，874 layers。py：2539］ output for __conv_pool_0___conv： c = 20， h = 28， w = 28， size = 15680

［INFO 2017-11-28 21：44：35，874 layers。py：2667］ output for __conv_pool_0___pool： c = 20， h = 14， w = 14， size = 3920

［INFO 2017-11-28 21：44：35，875 layers。py：2539］ output for __conv_pool_1___conv： c = 50， h = 10， w = 10， size = 5000

［INFO 2017-11-28 21：44：35，876 layers。py：2667］ output for __conv_pool_1___pool： c = 50， h = 5， w = 5， size = 1250

I1128 21：44：35。881502 14733 MultiGradientMachine。cpp：99］ numLogicalDevices=1 numThreads=7 numDevices=8

I1128 21：44：35。928449 14733 GradientMachine。cpp：85］ Initing parameters。。

I1128 21：44：36。056259 14733 GradientMachine。cpp：92］ Init parameters done。

Pass 0， Batch 0， Cost 2。302628， {‘classification_error_evaluator’： 0。9296875}

……………………………………………………………………。。

```

Pass 199， Batch 200， Cost 0。869726， {‘classification_error_evaluator’： 0。3671875}

……………………………………………………………………………………。。。

Pass 199， Batch 300， Cost 0。801396， {‘classification_error_evaluator’： 0。3046875}

………………………………………………………………………………I1128 23：21：39。443141 14733 MultiGradientMachine。cpp：99］ numLogicalDevices=1 numThreads=7 numDevices=8

Test with Pass 199， {‘classification_error_evaluator’： 0。5248000025749207}

Label of image/dog。png is： 9

我開了7個執行緒，用了8個Tesla K80 GPU訓練，batch_size = 128，迭代次數200次，耗時1h37min，錯誤分類率為0。5248，這個結果，emm，不算很高，我們可以把它作為一個baseline，後面對其進行調優。

LeNet-5網路結構

Lenet-5網路結構來源於Yan LeCun提出的，原文為《Gradient-based learning applied to document recognition》，論文裡使用的是mnist手寫數字作為輸入資料（32 * 32）進行驗證。我們來看一下網路結構。

LeNet-5一共有8層： 1個輸入層+3個卷積層（C1、C3、C5）+2個下采樣層（S2、S4）+1個全連線層（F6）+1個輸出層，每層有多個feature map（自動提取的多組特徵）。

Input輸入層

cifar10 資料集，每一張圖片尺寸：32 * 32

　　C1 卷積層

6個feature_map，卷積核大小 5 * 5 ，feature_map尺寸：28 * 28

每個卷積神經元的引數數目：5 * 5 = 25個和一個bias引數

連線數目：（5*5+1）* 6 *（28*28） = 122，304

引數共享：每個feature_map內共享引數，∴∴共（5*5+1）*6 = 156個引數

　　S2 下采樣層（池化層）

6個14*14的feature_map，pooling大小 2* 2

每個單元與上一層的feature_map中的一個2*2的滑動視窗連線，不重疊，因此S2每個feature_map大小是C1中feature_map大小的1/4

連線數：（2*2+1）*1*14*14*6 = 5880個

引數共享：每個feature_map內共享引數，有2 * 6 = 12個訓練引數

　　C3 卷積層

這層略微複雜，S2神經元與C3是多對多的關係，比如最簡單方式：用S2的所有feature map與C3的所有feature map做全連線（也可以對S2抽樣幾個feature map出來與C3某個feature map連線），這種全連線方式下：6個S2的feature map使用6個獨立的5×5卷積核得到C3中1個feature map（生成每個feature map時對應一個bias），C3中共有16個feature map，所以該層需要學習的引數個數為：（5×5×6+1）×16=2416個，神經元連線數為：2416×8×8=154624個。

　　S4 下采樣層

同S2，如果採用Max Pooling/Mean Pooling，則該層需要學習的引數個數為0個，神經元連線數為：（2×2+1）×16×4×4=1280個。

　　C5卷積層

類似C3，用S4的所有feature map與C5的所有feature map做全連線，這種全連線方式下：16個S4的feature map使用16個獨立的1×1卷積核得到C5中1個feature map（生成每個feature map時對應一個bias），C5中共有120個feature map，所以該層需要學習的引數個數為：（1×1×16+1）×120=2040個，神經元連線數為：2040個。

　　F6 全連線層

將C5層展開得到4×4×120=1920個節點，並接一個全連線層，考慮bias，該層需要學習的引數和連線個數為：（1920+1）*84=161364個。

輸出層

該問題是個10分類問題，所以有10個輸出單元，透過softmax做機率歸一化，每個分類的輸出單元對應84個輸入。

LeNet-5的PaddlePaddle實現

　　1.網路結構 lenet.py

#coding：utf-8

‘’‘

Created by huxiaoman 2017。11。27

lenet。py：LeNet-5

’‘’

import os

from PIL import Image

import numpy as np

import paddle。v2 as paddle

from paddle。trainer_config_helpers import *

with_gpu = os。getenv（‘WITH_GPU’， ‘0’）！= ‘1’

def lenet（img）：

conv_pool_1 = paddle。networks。simple_img_conv_pool（

input=img，

filter_size=5，

num_filters=6，

num_channel=3，

pool_size=2，

pool_stride=2，

act=paddle。activation。Relu（））

conv_pool_2 = paddle。networks。simple_img_conv_pool（

input=conv_pool_1，

filter_size=5，

num_filters=16，

pool_size=2，

pool_stride=2，

act=paddle。activation。Relu（））

conv_3 = img_conv_layer（

input = conv_pool_2，

filter_size = 1，

num_filters = 120，

stride = 1）

fc = paddle。layer。fc（

input=conv_3， size=84， act=paddle。activation。Sigmoid（））

return fc

　2.訓練程式碼 train_lenet.py

#coding：utf-8

‘’‘

Created by huxiaoman 2017。11。27

train_lenet。py：訓練LeNet-5對cifar10資料集進行分類

’‘’

import sys， os

import paddle。v2 as paddle

from lenet import lenet

with_gpu = os。getenv（‘WITH_GPU’， ‘0’）！= ‘1’

def main（）：

datadim = 3 * 32 * 32

classdim = 10

# PaddlePaddle init

paddle。init（use_gpu=with_gpu， trainer_count=7）

image = paddle。layer。data（

name=“image”， type=paddle。data_type。dense_vector（datadim））

# Add neural network config

# option 1。 resnet

# net = resnet_cifar10（image， depth=32）

# option 2。 vgg

net = lenet（image）

out = paddle。layer。fc（

input=net， size=classdim， act=paddle。activation。Softmax（））

lbl = paddle。layer。data（

name=“label”， type=paddle。data_type。integer_value（classdim））

cost = paddle。layer。classification_cost（input=out， label=lbl）

# Create parameters

parameters = paddle。parameters。create（cost）

# Create optimizer

momentum_optimizer = paddle。optimizer。Momentum（

momentum=0。9，

regularization=paddle。optimizer。L2Regularization（rate=0。0002 * 128），

learning_rate=0。1 / 128。0，

learning_rate_decay_a=0。1，

learning_rate_decay_b=50000 * 100，

learning_rate_schedule=‘discexp’）

# End batch and end pass event handler

def event_handler（event）：

if isinstance（event， paddle。event。EndIteration）：

if event。batch_id % 100 == 0：

print “\nPass %d， Batch %d， Cost %f， %s” % （

event。pass_id， event。batch_id， event。cost， event。metrics）

else：

sys。stdout。write（‘。’）

sys。stdout。flush（）

if isinstance（event， paddle。event。EndPass）：

# save parameters

with open（‘params_pass_%d。tar’ % event。pass_id， ‘w’） as f：

parameters。to_tar（f）

result = trainer。test（

reader=paddle。batch（

paddle。dataset。cifar。test10（）， batch_size=128），

feeding={‘image’： 0，

‘label’： 1}）

print “\nTest with Pass %d， %s” % （event。pass_id， result。metrics）

# Create trainer

trainer = paddle。trainer。SGD（

cost=cost， parameters=parameters， update_equation=momentum_optimizer）

# Save the inference topology to protobuf。

inference_topology = paddle。topology。Topology（layers=out）

with open（“inference_topology。pkl”， ‘wb’） as f：

inference_topology。serialize_for_inference（f）

trainer。train（

reader=paddle。batch（

paddle。reader。shuffle（

paddle。dataset。cifar。train10（）， buf_size=50000），

batch_size=128），

num_passes=200，

event_handler=event_handler，

feeding={‘image’： 0，

‘label’： 1}）

# inference

from PIL import Image

import numpy as np

import os

def load_image（file）：

im = Image。open（file）

im = im。resize（（32， 32）， Image。ANTIALIAS）

im = np。array（im）。astype（np。float32）

# The storage order of the loaded image is W（widht），

# H（height）， C（channel）。 PaddlePaddle requires

# the CHW order， so transpose them。

im = im。transpose（（2， 0， 1）） # CHW

# In the training phase， the channel order of CIFAR

# image is B（Blue）， G（green）， R（Red）。 But PIL open

# image in RGB mode。 It must swap the channel order。

im = im［（2， 1， 0），：，：］ # BGR

im = im。flatten（）

im = im / 255。0

return im

test_data = ［］

cur_dir = os。path。dirname（os。path。realpath（__file__））

test_data。append（（load_image（cur_dir + ‘/image/dog。png’），））

# users can remove the comments and change the model name

# with open（‘params_pass_50。tar’， ‘r’） as f：

# parameters = paddle。parameters。Parameters。from_tar（f）

probs = paddle。infer（

output_layer=out， parameters=parameters， input=test_data）

lab = np。argsort（-probs） # probs and lab are the results of one batch data

print “Label of image/dog。png is： %d” % lab［0］［0］

if __name__ == ‘__main__’：

main（）

　3.結果輸出

I1129 14：52：44。314946 15153 Util。cpp：166］ commandline： ——use_gpu=True ——trainer_count=7

［INFO 2017-11-29 14：52：50，490 layers。py：2539］ output for __conv_pool_0___conv： c = 6， h = 28， w = 28， size = 4704

［INFO 2017-11-29 14：52：50，491 layers。py：2667］ output for __conv_pool_0___pool： c = 6， h = 14， w = 14， size = 1176

［INFO 2017-11-29 14：52：50，491 layers。py：2539］ output for __conv_pool_1___conv： c = 16， h = 10， w = 10， size = 1600

［INFO 2017-11-29 14：52：50，492 layers。py：2667］ output for __conv_pool_1___pool： c = 16， h = 5， w = 5， size = 400

［INFO 2017-11-29 14：52：50，493 layers。py：2539］ output for __conv_0__： c = 120， h = 5， w = 5， size = 3000

I1129 14：52：50。498749 15153 MultiGradientMachine。cpp：99］ numLogicalDevices=1 numThreads=7 numDevices=8

I1129 14：52：50。545882 15153 GradientMachine。cpp：85］ Initing parameters。。

I1129 14：52：50。651103 15153 GradientMachine。cpp：92］ Init parameters done。

Pass 0， Batch 0， Cost 2。331898， {‘classification_error_evaluator’： 0。9609375}

```

……

Pass 199， Batch 300， Cost 0。004373， {‘classification_error_evaluator’： 0。0}

………………………………………………………………………………I1129 16：17：08。678097 15153 MultiGradientMachine。cpp：99］ numLogicalDevices=1 numThreads=7 numDevices=8

Test with Pass 199， {‘classification_error_evaluator’： 0。39579999446868896}

Label of image/dog。png is： 7

同樣是7個執行緒，8個Tesla K80 GPU，batch_size = 128，迭代次數200次，耗時1h25min，錯誤分類率為0。3957，相比與simple_cnn的0。5248提高了12。91%。當然，這個結果也並不是很好，如果輸出詳細的日誌，可以看到在訓練的過程中loss先降後升，說明有一定程度的過擬合，對於如何防止過擬合，我們在後面會詳細講解。

有一個視覺化CNN的網站可以對mnist和cifar10分類的網路結構進行視覺化，這是cifar-10 BaseCNN的網路結構：

LeNet-5的Tensorflow實現

tensorflow版本的LeNet-5版本的可以參照models/tutorials/image/cifar10/（

https：//

github。com/tensorflow/m

odels/tree/master/tutorials/image/cifar10

）的步驟來訓練，不過這裡面的程式碼包含了很多資料處理、權重衰減以及正則化的一些方法防止過擬合。按照官方寫的，batch_size=128時在Tesla K40上迭代10w次需要4小時，準確率能達到86%。不過如果不對資料做處理，直接跑的話，效果應該沒有這麼好。不過可以仔細借鑑cifar10_inputs。py裡的distorted_inouts函式對資料預處理增大資料集的思想，以及cifar10。py裡對於權重和偏置的衰減設定等。目前迭代到1w次左右，cost是0。98，acc是78。4%

對於未進行資料處理的cifar10我準備也跑一次，看看效果如何，與paddle的結果對比一下。不過得等到週末再補上了 = =

總結

本節用常規的cifar-10資料集做影象分類，用了三種實現方式，第一種是自己設計的一個簡單的cnn，第二種是LeNet-5，第三種是Tensorflow實現的LeNet-5，對比速度可以見一下表格：

可以看到LeNet-5相比於原始的simple_cnn在準確率和速度方面都有一定的的提升，等tensorflow版本跑完後可以把結果加上去再對比一下。不過用Lenet-5網路結構後，結果雖然有一定的提升，但是還是不夠理想，在日誌裡看到loss的資訊基本可以推斷出是過擬合，對於神經網路訓練過程中出現的過擬合情況我們應該如何避免，下期我們講著重講解。此外在下一節將介紹AlexNet，並對分類做一個實驗，對比其效果。

參考文獻

1。LeNet-5論文：《Gradient-based learning applied to document recognition》

2。視覺化CNN：

http：//

shixialiu。com/publicati

ons/cnnvis/demo/

作者：胡曉曼 Python愛好者社群專欄作者，請勿轉載，謝謝。

部落格專欄：CharlotteDataMining的部落格專欄

配套影片教程：三個月教你從零入門深度學習！| 深度學習精華實踐課程

公眾號：Python愛好者社群（微信ID：python_shequ），關注，檢視更多連載內容。

小蜜蜂問答

小蜜蜂問答

【深度學習系列】用PaddlePaddle和Tensorflow進行影象分類

推薦文章

小蜜蜂問答

小蜜蜂問答

【深度學習系列】用PaddlePaddle和Tensorflow進行影象分類

相關文章

DQN 簡單實現

C++ folly庫解讀（一） Fbstring —— 一個完美替代std::string的庫

什麼是資料庫的“快取池”？（萬字長文，絕對乾貨）

基於Objects as Points的嘗試(二)

推薦文章