评价指标的补充
1 前言2 数据及模型的准备2.1 读入数据2.2 切分训练集测试集2.3 模型预测
3 绘制ROC曲线3.1 什么是ROC曲线?如何绘制的?3.2 Python代码绘制ROC曲线3.3 封装函数绘制ROC曲线
4 计算AUC的两种方式4.1 什么是AUC?4.2 直接使用封装好的API4.3 使用定义
5 PR曲线5.1 什么是PR曲线5.2 PR曲线和ROC曲线的对比5.3 Python绘制PR曲线
6 K-S曲线以及KS值6.1 KS值的计算6.2 K-S曲线
7 参考8 数据
1 前言
关于分类模型的评价指标,之前的一篇博客已经涉及到,详情见 机器学习 | 评价指标 ,本篇博文将重点介绍ROC曲线和PR曲线以及KS。
2 数据及模型的准备
2.1 读入数据
import pandas
as pd
df
= pd
.read_csv
('data/accepts.csv')
df
= df
.fillna
(0)
print(df
.shape
)
df
.head
()
(5845, 25)
application_idaccount_numberbad_indvehicle_yearvehicle_makebankruptcy_indtot_derogtot_trage_oldest_trtot_open_tr...purch_pricemsrpdown_pytloan_termloan_amtltvtot_incomeveh_mileageused_indweight
023140491161311998.0FORDN7.09.064.02.0...17200.0017350.00.003617200.0099.06550.0024000.011.001635391344902000.0DAEWOON0.021.0240.011.0...19588.5419788.0683.546019588.5499.04666.6722.004.75273285101432311998.0PLYMOUTHN7.010.060.00.0...13595.0011450.00.006010500.0092.02000.0019600.011.00387251871535911997.0FORDN3.010.035.05.0...12999.0012100.03099.006010800.00118.01500.0010000.011.00442751271581202000.0TOYOTAN0.010.0104.02.0...26328.0422024.00.006026328.04122.04144.0014.004.75
5 rows × 25 columns
X
= df
.drop
(['application_id', 'bad_ind', 'vehicle_make', 'bankruptcy_ind'], axis
=1)
y
= df
['bad_ind'].values
2.2 切分训练集测试集
from sklearn
.model_selection
import train_test_split
X_train
, X_test
, y_train
, y_test
= train_test_split
(X
, y
, test_size
= 0.3, random_state
= 23)
print(X_train
.shape
, X_test
.shape
, y_train
.shape
, y_test
.shape
)
(4091, 21) (1754, 21) (4091,) (1754,)
2.3 模型预测
from sklearn
.linear_model
import LogisticRegression
from sklearn
.metrics
import classification_report
lr
= LogisticRegression
()
lr
.fit
(X_train
, y_train
)
pre
= lr
.predict
(X_test
)
print(classification_report
(y_test
, pre
))
precision recall f1-score support
0 0.82 1.00 0.90 1430
1 0.40 0.01 0.02 324
micro avg 0.81 0.81 0.81 1754
macro avg 0.61 0.50 0.46 1754
weighted avg 0.74 0.81 0.74 1754
/Users/apple/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
print(metrics
.confusion_matrix
(y_test
, pre
))
[[1424 6]
[ 320 4]]
3 绘制ROC曲线
3.1 什么是ROC曲线?如何绘制的?
ROC曲线的绘制方法如下:
首先将数据集中预测结果的概率从高到低进行排序然后选择不同的阈值,分别得到FPR(1-0类召回,实际为0预测为1的,横轴),TPR(1类召回,实际为1预测为1,纵轴)将不同的阈值对应的FPR和TPR在图形上描出来,然后连接成一条曲线!即为ROC曲线
3.2 Python代码绘制ROC曲线
scores
= lr
.predict_proba
(X_test
)[:,1]
fpr
, tpr
, thresholds
= metrics
.roc_curve
(y_test
, scores
, pos_label
=1)
auc
= metrics
.auc
(fpr
, tpr
)
'''
fpr:假正率,其实就是1-0类的召回!越小越好,这样0类召回也就越大越好!对应横轴!
tpr:真正率,其实就是1类的召回率!越大越好!对应纵轴
thresholds:对应阈值!
'''
import matplotlib
.pyplot
as plt
plt
.figure
()
lw
= 2
plt
.figure
(figsize
=(10,10))
plt
.plot
(fpr
, tpr
, color
='darkorange',
lw
=lw
, label
='ROC curve (area = %0.2f)' % auc
)
plt
.plot
([0, 1], [0, 1], color
='navy', lw
=lw
, linestyle
='--')
plt
.xlim
([0.0, 1.0])
plt
.ylim
([0.0, 1.05])
plt
.xlabel
('False Positive Rate')
plt
.ylabel
('True Positive Rate')
plt
.title
('ROC Curve')
plt
.legend
(loc
="lower right")
plt
.show
()
<Figure size 432x288 with 0 Axes>
3.3 封装函数绘制ROC曲线
def Plot_ROC(model
, X_test
, y_test
):
import matplotlib
.pyplot
as plt
from sklearn
import metrics
'''
函数作用:绘制模型在测试集上的ROC曲线
model:模型
X_test:测试集
y_test:测试集的真实标签
'''
scores
= model
.predict_proba
(X_test
)[:,1]
fpr
, tpr
, thresholds
= metrics
.roc_curve
(y_test
, scores
, pos_label
=1)
auc
= metrics
.auc
(fpr
, tpr
)
'''
fpr:假正率,其实就是1-0类的召回!越小越好,这样0类召回也就越大越好!对应横轴!
tpr:真正率,其实就是1类的召回率!越大越好!对应纵轴
thresholds:对应阈值!
'''
plt
.figure
()
lw
= 2
plt
.figure
(figsize
=(10,10))
plt
.plot
(fpr
, tpr
, color
='darkorange',
lw
=lw
, label
='ROC curve (area = %0.2f)' % auc
)
plt
.plot
([0, 1], [0, 1], color
='navy', lw
=lw
, linestyle
='--')
plt
.xlim
([0.0, 1.0])
plt
.ylim
([0.0, 1.05])
plt
.xlabel
('False Positive Rate')
plt
.ylabel
('True Positive Rate')
plt
.title
('ROC Curve')
plt
.legend
(loc
="lower right")
plt
.show
()
Plot_ROC
(lr
, X_test
, y_test
)
<Figure size 432x288 with 0 Axes>
4 计算AUC的两种方式
4.1 什么是AUC?
AUC是ROC曲线下方的面积取值一般为0.5-1,越大表明分类性能越好!
4.2 直接使用封装好的API
from sklearn
import metrics
scores
= lr
.predict_proba
(X_test
)[:,1]
metrics
.roc_auc_score
(y_test
, scores
)
0.6989812656479323
有一个坑要注意,roc_auc_score中第一个参数是真实标签值,第二个是预测为1类的概率值!不要弄反了!roc_auc_score的官方文档:https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
4.3 使用定义
from sklearn
import metrics
scores
= lr
.predict_proba
(X_test
)[:,1]
fpr
, tpr
, thresholds
= metrics
.roc_curve
(y_test
, scores
, pos_label
=1)
metrics
.auc
(fpr
, tpr
)
0.6989812656479323
总结:推荐使用第一种!比较简单!
5 PR曲线
5.1 什么是PR曲线
PR曲线的横轴是R:Recall,纵轴为P:PrecisionPR曲线的具体画法和ROC基本一致,也是将所有样本的预测的概率从高到低进行排序,然后选择不同的阈值,也即得到了不同的P和R 然后连接起来!
5.2 PR曲线和ROC曲线的对比
相同点:
PR曲线展示的是Precision vs Recall的曲线,PR曲线与ROC曲线的相同点是都采用了TPR (Recall),都可以用AUC来衡量分类器的效果。
不同点:
不同点是ROC曲线使用了FPR,而PR曲线使用了Precision,因此PR曲线的两个指标都聚焦于正例。类别不平衡问题中由于主要关心正例,所以在此情况下PR曲线被广泛认为优于ROC曲线。
5.3 Python绘制PR曲线
import matplotlib
.pyplot
as plt
from sklearn
.metrics
import precision_recall_curve
scores
= lr
.predict_proba
(X_test
)[:,1]
precision
, recall
, thresholds
= precision_recall_curve
(y_test
, scores
)
plt
.figure
(1)
plt
.plot
(precision
, recall
)
plt
.title
('Precision/Recall Curve')
plt
.xlabel
('Recall')
plt
.ylabel
('Precision')
plt
.show
()
6 K-S曲线以及KS值
6.1 KS值的计算
scores
= lr
.predict_proba
(X_test
)[:,1]
fpr
, tpr
, thresholds
= metrics
.roc_curve
(y_test
, scores
, pos_label
=1)
ks
= max(tpr
-fpr
)
print('lr的KS值为: %.4f' % ks
)
lr的KS值为: 0.3135
6.2 K-S曲线
K-S曲线是正样本洛伦兹曲线与负样本洛伦兹曲线的差值曲线,用来度量阳性与阴性分类区分程度的。K-S曲线的最高点(最大值)定义为KS值,KS值越大,模型的区分度越好。K-S值一般是很难达到0.6的,在0.2~0.6之间都不错。
7 参考
https://www.jianshu.com/p/2ca96fce7e81https://blog.csdn.net/u014568921/article/details/53843311https://blog.csdn.net/teminusign/article/details/51982877https://www.jianshu.com/p/fec4105a60d7https://blog.csdn.net/cymy001/article/details/79613787如何向门外汉讲解ks值(风控模型术语)?:https://www.zhihu.com/question/34820996https://blog.csdn.net/weixin_39750084/article/details/80558587
8 数据
accepts