GBDT分類例項

前言

GBDT是被互金行業廣泛運用的機器學習演算法，具有

泛化能力強，分類效果好

等優點，如果不知道GBDT的原理，不妨把它看作決策樹的增強版，這樣對GBDT的引數也會了解一二。

GBDT原理

（不願看可略過，完全不影響用它，反正他也是黑箱演算法）

機器學習系列————1。 GBDT演算法的原理

資料來源

電離層資料集

電離層資料集（Ionosphere Dataset）需要根據給定的電離層中的自由電子的雷達回波預測大氣結構。

它是一個二元分類問題。每個類的觀察值數量不均等，一共有 351 個觀察值，34 個輸入變數和1個輸出變數。變數名如下：

1 17對雷達回波資料。

2 。。。。。。

3 類別（g 表示好，b 表示壞）。

預測最普遍類的基準效能是約 64％的分類準確率，最佳結果達到約 94% 的分類準確率，

下載地址：

http：//

t。cn/Rf8GFY4

資料匯入

對於這種。data或，txt格式，如果不設引數直接匯入，再加之資料集沒有列名，則極有可能會把第一行資料的值當成列名，所以需要設定，資料匯入程式碼如下

##先批次生成列名

names=［］

for i in range（34）：

names。append（‘Var’+str（i+1））

names。append（‘gbflag’）

df = pd。read_csv（‘C：/Users/Hanyee/Desktop/ionosphere。txt’，header = None，names=names）

##檢視前五行資料

df。head

Var31 Var32 Var33 Var34 gbflag

0 0。42267 -0。54487 0。18641 -0。45300 g

1 -0。16626 -0。06288 -0。13738 -0。02447 b

2 0。60436 -0。24180 0。56045 -0。38238 g

3 0。25682 1。00000 -0。32382 1。00000 b

4 -0。05707 -0。59573 -0。04608 -0。65697 g

5 0。00000 0。00000 -0。00039 0。12011 b

6 -0。04262 -0。81318 -0。13832 -0。80975 g

##檢視描述性統計，發現無缺失值，異常值

df。describe（）

Var1 Var2 Var3 Var4 Var5 Var6 \

count 351。000000 351。0 351。000000 351。000000 351。000000 351。000000

mean 0。891738 0。0 0。641342 0。044372 0。601068 0。115889

std 0。311155 0。0 0。497708 0。441435 0。519862 0。460810

min 0。000000 0。0 -1。000000 -1。000000 -1。000000 -1。000000

25% 1。000000 0。0 0。472135 -0。064735 0。412660 -0。024795

50% 1。000000 0。0 0。871110 0。016310 0。809200 0。022800

75% 1。000000 0。0 1。000000 0。194185 1。000000 0。334655

max 1。000000 0。0 1。000000 1。000000 1。000000 1。000000

交叉驗證

因為資料樣本少，所以使用交叉驗證

from sklearn。metrics import accuracy_score

from sklearn。model_selection import KFold

kf = KFold（n_splits = 10）

scores = ［］

for train，test in kf。split（X）：

train_X，test_X，train_y，test_y = X。iloc［train］，X。iloc［test］，y。iloc［train］，y。iloc［test］

gbdt = GradientBoostingClassifier（max_depth=4，max_features=9，n_estimators=100）

gbdt。fit（train_X，train_y）

prediced = gbdt。predict（test_X）

print（accuracy_score（test_y，prediced））

scores。append（accuracy_score（test_y，prediced））

##交叉驗證後的平均得分

np。mean（scores）

GBDT調參

GBDT有很多引數，根據資料集大小和特徵數量可以手動調節引數，使得模型更最佳化，當然也可以遍歷引數集，根據評估score獲得最優引數

程式碼如下

#自動調參

from sklearn。grid_search import GridSearchCV

from sklearn。cross_validation import StratifiedKFold

gbdt = GradientBoostingClassifier（）

cross_validation = StratifiedKFold（y，n_folds = 10）

parameter_grid = {‘max_depth’：［2，3，4，5］，

‘max_features’：［1，3，5，7，9］，

‘n_estimators’：［10，30，50，70，90，100］}

grid_search = GridSearchCV（gbdt，param_grid = parameter_grid，cv =cross_validation，

scoring = ‘accuracy’）

grid_search。fit（X，y）

grid_search。best_score_

grid_search。best_params_

總結

1.GBDT演算法不復雜，複雜的是建模前的特徵工程；

2.雖然可以自動調參，但是先手動縮小引數範圍，再進行遍歷調參能更高效的獲得最優引數；

3.依據準確率指標，最優引數為 {'max_depth': 2, 'max_features': 9, 'n_estimators': 100}，此時準確率從92.02% 提高到93.45%

GBDT程式碼

import numpy as np

import pandas as pd

from sklearn。cross_validation import train_test_split

#資料匯入

##先批次生成列名

names=［］

for i in range（34）：

names。append（‘Var’+str（i+1））

names。append（‘gbflag’）

df = pd。read_csv（‘C：/Users/Hanyee/Desktop/ionosphere。rawdata。txt’，header = None，names=names）

##檢視前五行資料

df。head

##檢視資料有沒有缺失值，異常值

df。describe（）

x_columns = ［x for x in df。columns if x not in ‘gbflag’］

X = df［x_columns］

y = df［‘gbflag’］

#將資料集分成訓練集，測試集

X_train，X_test， y_train， y_test = train_test_split（X，y，test_size=0。2， random_state=1）

#GBDT

##重要引數max_depth=4，max_features=10，n_estimators=80

from sklearn。ensemble import GradientBoostingClassifier

gbdt = GradientBoostingClassifier（）

gbdt。fit（X_train，y_train）

pred = gbdt。predict（X_test）

pd。crosstab（y_test，pred）

#演算法評估指標

from sklearn。metrics import confusion_matrix

from sklearn。metrics import classification_report

print（classification_report（y， pred， digits=4））

#交叉驗證（資料樣本少，可以使用交叉驗證方法）

from sklearn。metrics import accuracy_score

from sklearn。model_selection import KFold

kf = KFold（n_splits = 10）

scores = ［］

for train，test in kf。split（X）：

train_X，test_X，train_y，test_y = X。iloc［train］，X。iloc［test］，y。iloc［train］，y。iloc［test］

gbdt = GradientBoostingClassifier（max_depth=4，max_features=9，n_estimators=100）

gbdt。fit（train_X，train_y）

prediced = gbdt。predict（test_X）

print（accuracy_score（test_y，prediced））

scores。append（accuracy_score（test_y，prediced））

##交叉驗證後的平均得分

np。mean（scores）

#自動調參

from sklearn。grid_search import GridSearchCV

from sklearn。cross_validation import StratifiedKFold

gbdt = GradientBoostingClassifier（）

cross_validation = StratifiedKFold（y，n_folds = 10）

parameter_grid = {‘max_depth’：［2，3，4，5］，

‘max_features’：［1，3，5，7，9］，

‘n_estimators’：［10，30，50，70，90，100］}

grid_search = GridSearchCV（gbdt，param_grid = parameter_grid，cv =cross_validation，

scoring = ‘accuracy’）

grid_search。fit（X，y）

#輸出最高得分

grid_search。best_score_

#輸出最佳引數

grid_search。best_params_

小蜜蜂問答

小蜜蜂問答

GBDT分類例項

推薦文章

小蜜蜂問答

小蜜蜂問答

GBDT分類例項

相關文章

澳洲技術移民可以考的語言有幾種？

pytorch使用resnet進行影象分類，為什麼驗證集準確率98%而訓練集準確率只有90%？

TensorFlow2.0教程-迴歸

平面設計中為什麼要用網格系統?

推薦文章