R語言機器學習：caret包使用及其黑箱模型解釋（連續變數預測)

作者：黃天元，復旦大學博士在讀，熱愛資料科學與開源工具（R），致力於利用資料科學迅速積累行業經驗優勢和學術知識發現。知乎專欄：

R語言資料探勘

郵箱：[email protected].歡迎合作交流。

caret包是R語言通用機器學習包之一，能夠在統一框架下使用各種不同的模型，從預處理、建模到後期的預測、評估都有非常友好的函式封裝。新近學習的DALEX包是給黑箱提供模型解釋性的利器。事實上，它不僅僅針對黑箱模型，它能夠面向所有模型給出表現的評估、變數的重要性等有價值的資訊。本文依照官方文件，嘗試習得通用的DALEX解釋caret包生成模型的套路。

1 包的載入與資料匯入

安裝三個包。

library（pacman）

p_load（DALEX，caret，tidyverse）

觀察我們要使用的目標資料：

apartments %>% as_tibble

# A tibble： 1，000 x 6

m2。price construction。year surface floor no。rooms district

1 5897 1953 25 3 1 Srodmiescie

2 1818 1992 143 9 5 Bielany

3 3643 1937 56 1 2 Praga

4 3517 1995 93 7 3 Ochota

5 3013 1992 144 6 5 Mokotow

6 5795 1926 61 6 2 Srodmiescie

7 2983 1970 127 8 5 Mokotow

8 2346 1985 105 8 4 Ursus

9 4745 1928 145 6 6 Srodmiescie

10 4284 1949 112 9 4 Srodmiescie

# 。。。 with 990 more rows

2 使用caret包迅速建模

這裡，以m2。price作為響應變數，其餘所有變數作為解釋變數，進行建模。嘗試模型包括：隨機森林、GBM和神經網路。其中，隨機森林設定樹的數量為100，GBM使用預設設定，神經網路在預處理的時候要進行中心化和標準化，最大迭代次數設定為500次，使用線性輸出單元，並設定網格對超引數進行最佳化的選項（這裡用了兩個隱藏層，權重衰減引數設為0，只設置了一個值，沒有用網格去最佳化）。程式碼如下：

#下面這串程式碼的執行可能要等待一段時間

set。seed（123）

regr_rf <- train（m2。price~。， data = apartments， method=“rf”， ntree = 100）

regr_gbm <- train（m2。price~。， data = apartments， method=“gbm”）

regr_nn <- train（m2。price~。， data = apartments，

method = “nnet”，

linout = TRUE，

preProcess = c（‘center’， ‘scale’），

maxit = 500，

tuneGrid = expand。grid（size = 2， decay = 0），

trControl = trainControl（method = “none”， seeds = 1））

3 對模型進行解釋

這裡直接利用DALEX包的explain函式對三個模型進行解釋性分析。需要注意的是，做這個分析需要包含4個資訊：1。模型資訊；2。標籤資訊（如果沒有，會自動從模型抽取）；3。驗證資料集；4。驗證資料集中哪個是響應變數。程式碼如下：

data（apartmentsTest）

explainer_regr_rf <- DALEX：：explain（regr_rf， label=“rf”，

data = apartmentsTest， y = apartmentsTest$m2。price）

explainer_regr_gbm <- DALEX：：explain（regr_gbm， label = “gbm”，

data = apartmentsTest， y = apartmentsTest$m2。price）

explainer_regr_nn <- DALEX：：explain（regr_nn， label = “nn”，

data = apartmentsTest， y = apartmentsTest$m2。price）

建模可能很久，但是解釋性驗證是非常快的，直接是黑箱的對映關係。

4 模型表現

對模型的表現，需要進行分析：

mp_regr_rf <- model_performance（explainer_regr_rf）

mp_regr_gbm <- model_performance（explainer_regr_gbm）

mp_regr_nn <- model_performance（explainer_regr_nn）

我們看看得到的結果是什麼樣子的：

mp_regr_rf

這是樣本的殘差分佈情況，讓我們對這個分佈進行視覺化（累計殘差分佈圖）：

plot（mp_regr_rf， mp_regr_nn， mp_regr_gbm）

這個圖的正確解釋方法是，少數的樣本（離群點）貢獻了大量的殘差（與真實值的偏差）。如果線在上面，那麼大量的樣本殘差都很大，此圖表明GBM模型大部分樣本的殘差都比較小，而神經網路很多樣本的殘差都比基於樹模型的高。讓我們採用另一種視覺化方法：

plot（mp_regr_rf， mp_regr_nn， mp_regr_gbm， geom = “boxplot”）

高下立判，紅點為均值，箱線圖則為分位數。

5 變數重要性分析

需要看每個模型中，不同變數對於模型預測的相對重要性，可以用如下方法。

vi_regr_rf <- variable_importance（explainer_regr_rf， loss_function = loss_root_mean_square）

vi_regr_gbm <- variable_importance（explainer_regr_gbm， loss_function = loss_root_mean_square）

vi_regr_nn <- variable_importance（explainer_regr_nn， loss_function = loss_root_mean_square）

plot（vi_regr_rf， vi_regr_gbm， vi_regr_nn）

損失函式使用的是RMSE，這裡解釋為：如果模型少了這個變數，將會給響應變數的預測值帶來多大影響？

6 變數解析

6。1 連續型變數解析

Partial Dependence Plots （PDP），是解釋單個連續型解釋變數與響應變數關係的方法。專門有相關的包和論文描述這個方法的機理，詳情請去找pdp包的官方文件。比如我們想要研究房屋建築年份（construction。year）對響應變數房價的影響，我們這樣做：

pdp_regr_rf <- variable_response（explainer_regr_rf， variable = “construction。year”， type = “pdp”）

pdp_regr_gbm <- variable_response（explainer_regr_gbm， variable = “construction。year”， type = “pdp”）

pdp_regr_nn <- variable_response（explainer_regr_nn， variable = “construction。year”， type = “pdp”）

plot（pdp_regr_rf， pdp_regr_gbm， pdp_regr_nn）

從隨機森林和GBM模型可以看出來，建築年份與房價具有非線性關係。特別老的房子和新建的房子房價都很貴，但是40年代到90年代的房子則價格較低。不過，神經網路模型不能很好地捕捉這個規律。此外，還有一種方法稱為Acumulated Local Effects （ALE），是為了解決變數相關性的問題設計的，本質上是PDP方法的延伸。實現方法如下：

ale_regr_rf <- variable_response（explainer_regr_rf， variable = “construction。year”， type = “ale”）

ale_regr_gbm <- variable_response（explainer_regr_gbm， variable = “construction。year”， type = “ale”）

ale_regr_nn <- variable_response（explainer_regr_nn， variable = “construction。year”， type = “ale”）

plot（ale_regr_rf， ale_regr_gbm， ale_regr_nn）

6。2 離散型變數解析

對於離散型變數，DALEX包目前的解析方法是呼叫了factorMerger包的mergeFactors函式。

mpp_regr_rf <- variable_response（explainer_regr_rf， variable = “district”， type = “factor”）

mpp_regr_gbm <- variable_response（explainer_regr_gbm， variable = “district”， type = “factor”）

mpp_regr_nn <- variable_response（explainer_regr_nn， variable = “district”， type = “factor”）

plot（mpp_regr_rf， mpp_regr_gbm， mpp_regr_nn）

這個方法的本質是根據響應變數的分佈對單個因子變數進行聚類。就上面這個圖而言，我們可以看到，對於不同地區的房價是不同的，可以明顯分為6類。

本帖子主要參考官網的例子：

如果感興趣，請移步官網。

小蜜蜂問答

小蜜蜂問答

R語言機器學習：caret包使用及其黑箱模型解釋（連續變數預測)

推薦文章

小蜜蜂問答

小蜜蜂問答

R語言機器學習：caret包使用及其黑箱模型解釋（連續變數預測)

相關文章

知道頻率和組距怎麼求樣本容量？

勤能補拙的讀音是什麼？

預算的定義及作用？

騰龍 35mm F1.4 還是佳能 RF 35mm F1.8?

推薦文章