机器学习回归之商品x的网络消费购买预测实例（sklearn）包含数据集的训练与预测

xiaoxiao2023-10-21 161

请预测谁会购买商品X，并描述过程分析原因

此次大作业如题即：根用户的一些信息（包括，性别、年龄、经济状况、消费行为等），对其是否购买商品X进行预测。注意： 1 论文的打印版（两个文档）都需要提交（《论文首页》不要装订，《报告》左侧装订），提交地点请见网站 2 注意不要尝试考验老师分析你是否有copy嫌疑的能力，如果两个同学，被证明有copy嫌疑，两个人的最终成绩都会是59.99分。

2 数据介绍下面详细介绍数据（6-数据集文件夹中）及需要的结果集：

Train.csv 6000行， 86列，最后一列为用户是否购买过商品X的状态： 1-为购买 0-为没有购买 Test.csv 3822行， 85列

相关文件：

Train.csv：用于训练和构建预测模型的数据集（6000条消费者记录）。每条记录包含了86个特征，包括了社会人口数据（特征1-43）和拥有商品情况（特征44-86）。社会人口特征由邮政编码得到。所有生活在邮政编码相同地区的消费者拥有相同的社会人口特征。特征86（PURCHASE）是目标特征。

Test.csv: 用于测试的数据集（3800条消费者记录）。除了目标特征缺失以外，数据集与Train.csv拥有相同格式。你应当只返回包含对目标特征预测的列表。所有数据集均为制表符分隔文本。

各个特征的含义和特征值如下：

数值特证名描述取值 1 CUSTYPE 消费者类型 L0 2 NUMHOUSE 房屋数量 1-10 3 AVGHOUHO 家庭规模 1-6 4 AVGAGE 平均年龄 L1 5 CUSMAITY 消费者大类 L2 以下为每个邮政编码下的比例，取值见L3 6 MRELIRK 天主教徒 L3 7 MRELIPK 新教徒 L3 8 MRELIOV 其他宗教 L3 9 MRELIGE 无宗教信仰 L3 10 MMARGE 已婚 L3 11 MMARSA 同居 L3 12 MMAROR 其他感情关系 L3 13 MFSING 单身 L3 14 MFWOKD 家庭没有小孩 L3 15 MFWIKD 家庭有小孩 L3 16 MEHIGH 高教育水平 L3 17 MEMIDD 中等教育水平 L3 18 MELOWE 低等教育水平 L3 19 MBHIST 高社会地位 L3 20 MBENTR 企业家 L3 21 MBFARM 农民 L3 22 MBMIMA 中等管理人员 L3 23 MBSKLA 有技能的工人 L3 24 MBUSLA 没有技能的工人 L3 25 MSCA 社会等级A L3 26 MSCB1 社会等级B1 L3 27 MSCB2 社会等级B2 L3 28 MSCC 社会等级C L3 29 MSCD 社会等级D L3 30 MHRENT 租房住 L3 31 MHOWNE 拥有房屋 L3 32 MCAR1 有一辆车 L3 33 MCAR2 有两辆车 L3 34 MCAR0 没有车 L3 35 MHSFOND 公立医疗服务 L3 36 MHSPRIV 商业医疗保险 L3 37 MINCO30 收入<30,000 L3 38 MINC3045 30,000<收入<45,000 L3 39 MINC4575 45,000<收入<75,000 L3 40 MINC7512 75,000<收入<122,000 L3 41 MINCO123 收入>123,000 L3 42 MAVEIN 平均收入 L3 43 MPURKL 购买力级别 L3 以下为每个邮编内的该变量的总数，取值见L4 44 PTAMOA 购买A类商品的开支 L4 45 PTAMOB 购买B类商品的开支 L4 46 PTAMOC 购买C类商品的开支 L4 47 PTAMOD 购买商品D的开支 L4 48 PTAMOE 购买商品E的开支 L4 49 PTAMOF 购买商品F的开支 L4 50 PTAMOG 购买商品G的开支 L4 51 PTAMOH 购买商品H的开支 L4 52 PTAMOI 购买商品I的开支 L4 53 PTAMOJ 购买商品J的开支 L4 54 PTAMOK 购买商品K的开支 L4 55 PTAMOL 购买商品L的开支 L4 56 PTAMOM 购买商品M的开支 L4 57 PTAMON 购买商品N的开支 L4 58 PTAMOO 购买商品O的开支 L4 59 PTAMOP 购买商品P的开支 L4 60 PTAMOQ 购买商品Q的开支 L4 61 PTAMOR 购买商品R的开支 L4 62 PTAMOS 购买商品S的开支 L4 63 PTAMOT 购买商品T的开支 L4 64 PTAMOU 购买商品U的开支 L4 65 NOAMOA 购买A类商品的数目 1-12 66 NOAMOB 购买B类商品的数目 1-12 67 NOAMOC 购买C类商品的数目 1-12 68 NOAMOD 购买商品D的数目 1-12 69 NOAMOE 购买商品E的数目 1-12 70 NOAMOF 购买商品F的数目 1-12 71 NOAMOG 购买商品G的数目 1-12 72 NOAMOH 购买商品H的数目 1-12 73 NOAMOI 购买商品I的数目 1-12 74 NOAMOJ 购买商品J的数目 1-12 75 NOAMOK 购买商品K的数目 1-12 76 NOAMOL 购买商品L的数目 1-12 77 NOAMOM 购买商品M的数目 1-12 78 NOAMON 购买商品N的数目 1-12 79 NOAMOO 购买商品O的数目 1-12 80 NOAMOP 购买商品P的数目 1-12 81 NOAMOQ 购买商品Q的数目 1-12 82 NOAMOR 购买商品R的数目 1-12 83 NOAMOS 购买商品S的数目 1-12 84 NOAMOT 购买商品T的数目 1-12 85 NOAMOU 购买商品U的数目 1-12 86 PURCHASE 是否购买商品X 0-1

L1、L2、L3、L4代表的取值如下：

L0：值标签 1 1 高收入，高育儿费用 2 2 重要人物 3 3 高地位人士 4 4 豪华高档别墅 5 5 混合人种老年人 6 6 事业有成，有子女 7 7 丁克族 8 8 中产阶级家庭 9 9 现代完整家庭 10 10 稳定的家庭 11 11 新建家庭 12 12 富有的年轻家庭 13 13 年轻的非跨国家庭 14 14 年轻的cosmopolitan 15 15 年长的cosmopolitan 16 16 住公寓的学生 17 17 新的城市人 18 18 单身青年 19 19 郊区青年 20 20 不同种族 21 21 年轻城市贫民 22 22 混合的公寓居民 23 23 成长中的年轻人 24 24 教育水平低的年轻人 25 25 年轻的城市上层人口 26 26 拥有房屋的老年人 27 27 住公寓的老年人 28 28 老年居民 29 29 没有前院的老年人 30 30 信教老年单身人士 31 31 低收入天主教徒 32 32 混合人种老年人 33 33 下层大家族 34 34 大家族，有童工 35 35 乡村家庭 36 36 奉子成婚 37 37 混血小镇居民 38 38 传统家庭 39 39 有宗教信仰的大家族 40 40 大家族农场 41 41 混合农村人口

L1: 1 20-30 岁 2 30-40 岁 3 40-50 岁 4 50-60 岁 5 60-70 岁 6 70-80 岁

L2：

1 成功的享乐主义者 2 努力的种植者 3 一般家庭 4 事业上孤独的人 5 生活不错 6 老年旅行者 7 信教的退休人士 8 有小孩的家庭 9 保守的家庭 10 农民

L3:

0 0% 1 1 - 10% 2 11 - 23% 3 24 - 36% 4 37 - 49% 5 50 - 62% 6 63 - 75% 7 76 - 88% 8 89 - 99% 9 100%

L4:

0 0 1 1 - 49 2 50 - 99 3 100 - 199 4 200 - 499 5 500 - 999 6 1000 - 4999 7 5000 - 9999 8 10.000 - 19.999 9 20.000 -

3结果文件结果文件为1个csv文件，内容为：3822行，每一行为0或者 1或者介于0-1 的一个概率值（即此用户购买X的概率）参考保存csv代码如下： #注意 predictY是pandas DataFrame 类型 Import pandas as pd Import numpy as np predictY = pd.DataFrame(np.random.uniform(0,1,3822).reshape(3822,1)) predictY.to_csv(‘Results_1.csv’, encoding = ‘utf-8’, index=False , header=False) 4评价标准 AUC

详细介绍请见B开头网站： https://baike.baidu.com/item/AUC/19282953?fr=aladdin

使用示例 http://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

import sklearn Score_AUC = sklearn.metrics.auc(y_true, y_predict)

例如：

import numpy as np from sklearn import metrics y = np.array([1, 1, 2, 2]) pred = np.array([0.1, 0.4, 0.35, 0.8]) fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2) metrics.auc(fpr, tpr)

5 参考代码见压缩包中 5-参考案例-强力推荐文件夹

# -*- coding: utf-8 -*- import pandas as pd #from keras.utils import np_utils from sklearn.model_selection import train_test_split, KFold, cross_val_score from sklearn.preprocessing import LabelEncoder from sklearn.utils.class_weight import compute_class_weight import xgboost from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc import matplotlib.pyplot as plt from xgboost import plot_importance #0116 import plot #0116 # load dataset dataframe = pd.read_csv("Train.csv", header=None) dataset = dataframe.values X = dataset[:, 0:85].astype(float) Y = dataset[:, 85].astype(float) # encode class values as integers #encoder = LabelEncoder() #encoded_Y = encoder.fit_transform(Y) ## convert integers to dummy variables (one hot encoding) #dummy_y = np_utils.to_categorical(encoded_Y) #######[1] using full dataset to train X_train = X Y_train = Y dataframe = pd.read_csv("Test.csv", header=None) dataset = dataframe.values X_test = dataset.astype(float) #######[2] #split Train.csv 0.7 train, 0.3 test by random #X_train, X_test, Y_train, Y_test = train_test_split(X,dummy_y, test_size=0.3, random_state=True) weight = compute_class_weight('balanced', [0,1], Y_train) # In[1]: lr_clf2 = LogisticRegression(penalty="l1",max_iter=100, C=1.5, class_weight={0:weight[0] , 1: weight[1]}) lr_clf2.fit(X_train, Y_train) y1_train_pred_LR = lr_clf2.predict(X_train) print("LR Confusion matrix (train):\n {0}\n".format(confusion_matrix(Y_train, y1_train_pred_LR))) print("LR Classification report (train):\n {0}".format(classification_report(Y_train, y1_train_pred_LR))) LR_pred = lr_clf2.predict_proba(X_train) fpr_lr2, tpr_lr2, thresholds_lr2 = roc_curve(Y_train, LR_pred[:,1]) roc_auc_lr2 = auc(fpr_lr2, tpr_lr2) #0116 plot.plot_feature_scores([i for i in range (len(lr_clf2.coef_[0]))], (lr_clf2.coef_[0]), ['f'+str(i) for i in range (len(lr_clf2.coef_[0]))]) # In[2]: xgb =xgboost.XGBClassifier(n_estimators=1000, learning_rate=0.003, scale_pos_weight=(weight[1]/weight[0]), random_state=1) #xgb =xgboost.XGBClassifier(n_estimators=10000, learning_rate=0.001, scale_pos_weight=(weight[1]/weight[0]), random_state=1) #xgb =xgboost.XGBClassifier(n_estimators=10000, learning_rate=0.0001, scale_pos_weight=(weight[1]/weight[0]), random_state=1) #xgb =xgboost.XGBClassifier(n_estimators=1000, learning_rate=0.0001, scale_pos_weight=(weight[1]/weight[0]), random_state=1) #xgb =xgboost.XGBClassifier(n_estimators=100, learning_rate=0.0001, scale_pos_weight=(weight[1]/weight[0]), random_state=1) xgb= xgb.fit(X_train, Y_train) y1_train_pred_XGB = xgb.predict(X_train) print("XGBoost Confusion matrix (train):\n {0}\n".format(confusion_matrix(Y_train, y1_train_pred_XGB))) print("XGBoost Classification report (train):\n {0}".format(classification_report(Y_train, y1_train_pred_XGB))) XGB_pred = xgb.predict_proba(X_train) fpr_xgb1, tpr_xgb1, thresholds_xgb1 = roc_curve(Y_train, XGB_pred[:,1]) roc_auc_xgb1 = auc(fpr_xgb1, tpr_xgb1) #xgb.feature_importances_ fig, ax = plt.subplots(figsize=(12,8)) #0116 plot_importance(xgb,ax=ax,height=0.5) #0116 plt.savefig('xgb.png') #0116 # In[plot]: plt.plot(fpr_lr2, tpr_lr2, lw=2, alpha=.6) plt.plot(fpr_xgb1, tpr_xgb1, lw=2, alpha=.6) plt.plot([0, 1], [0, 1], lw=2, linestyle="--") plt.xlim([0, 1]) plt.ylim([0, 1.05]) plt.xlabel("FPR") plt.ylabel("TPR") plt.title("ROC curve") plt.legend(["Logistic Reg (AUC {:.4f})".format(roc_auc_lr2), "XGBoost (AUC {:.4f})".format(roc_auc_xgb1)], fontsize=8, loc=2) # In[3]: y1_test_pred_LR = lr_clf2.predict_proba(X_test) y1_test_pred_XGB = xgb.predict_proba(X_test) f1=open('./dataset/TestLR.csv','w+') for i in range(len(X_test)): for j in range(len(X_test[0])): f1.write(str(int(X_test[i][j]))+',') f1.write(str((y1_test_pred_LR[:,1][i]))+'\n') f1.close() f2=open('./dataset/TestXGB.csv','w+') for i in range(len(X_test)): for j in range(len(X_test[0])): f2.write(str(int(X_test[i][j]))+',') f2.write(str((y1_test_pred_XGB[:,1][i]))+'\n') f2.close()

最新回复(0)