【机器学习案例二】商品购买量的预测(回归)

    xiaoxiao2022-07-02  214

    基于回归分析的PM2.5预测

    案例背景数据预处理Lasso回归要求模型参数格子搜索确定最优惩罚因子用最优惩罚因子训练模型训练效果 决策树要求模型参数格子搜索确定最参数用最优惩罚因子训练模型将得到的决策树对属性重要性进行评价

    案例背景

    数据集 BlackFriday 中给出了与商品销售量(Purchase)相关的因素,包 括 Gender 、 Age 、 City_Category 、 Stay_In_City 、Stay_In_Current_City_Years、Marital_Status、Product_Category,上述变量均为类别型变量。请将原始数据集划分为训练集(80%)和测试集(20%),并建立模型对商品销售量进行预测。

    数据预处理

    导入库 import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import os from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import GridSearchCV import warnings # filter warnings warnings.filterwarnings('ignore') import matplotlib.pyplot as plt from sklearn import linear_model 读取数据 df=pd.read_csv('BlackFriday') 数据了解 df.columns

    [‘User_ID’, ‘Product_ID’, ‘Gender’, ‘Age’, ‘Occupation’, ‘City_Category’, ‘Stay_In_Current_City_Years’, ‘Marital_Status’, ‘Product_Category_1’, ‘Purchase’]

    df.dtypes

    处理缺失值 查看缺失值情况 df.isna().sum()

    2. 删除无用列

    df.drop('User_ID',axis=1,inplace=True) df.drop('Product_ID',axis=1,inplace=True) 更改数据类型 df['Product_Category_1']=df['Product_Category_1'].astype('str') 划分训练集测试集 y=df['Purchase'] x=df.drop('Purchase',axis=1) x=pd.get_dummies(x) x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 1)

    Lasso回归

    要求

    建立 Lasso 回归模型,并利用基于 5 折交叉验证的格子搜索技术确定最优惩罚因子;在最优惩罚因子下,分别评价 Lasso 回归在训练集和测试集的预测精度。

    模型参数

    from sklearn.linear_model import Lasso lasso=Lasso() lasso.get_params()

    {‘alpha’: 1.0, ‘copy_X’: True, ‘fit_intercept’: True, ‘max_iter’: 1000, ‘normalize’: False, ‘positive’: False, ‘precompute’: False, ‘random_state’: None, ‘selection’: ‘cyclic’, ‘tol’: 0.0001, ‘warm_start’: False}

    格子搜索确定最优惩罚因子

    lasso=Lasso() parameters={'alpha':np.arange(0.1,1,0.1)} lasso_cv=GridSearchCV(lasso,param_grid=parameters,cv=5) lasso_cv.fit(x,y) print(lasso_cv.best_params_) print(lasso_cv.best_score_)

    best_params_:{‘alpha’: 0.1} best_score:0.628258

    lasso=Lasso() parameters={'alpha':np.arange(0.01,0.1,0.01)} lasso_cv=GridSearchCV(lasso,param_grid=parameters,cv=5) lasso_cv.fit(x,y) print(lasso_cv.best_params_) print(lasso_cv.best_score_)

    best_params_:{‘alpha’: 0.01} best_score:0.628261

    继续调参模型改进的效果并没有很明显,所以不再进行调参,最优参数alpha=0.01

    用最优惩罚因子训练模型

    lasso=Lasso(alpha=0.1) lasso.fit(x_train,y_train) lasso.score(x_train,y_train) lasso.score(x_test,y_test)

    训练集精度:0.631 测试集精度:0.628

    训练效果

    均方根误差 y_pre=lasso.predict(x_test) y_hat=lasso.predict(x_train) rmse_lasso=((y_hat-y_train).T.dot(y_hat-y_train)/len(y_train))**(0.5) rfmse_lasso=((y_pre-y_test).T.dot(y_pre-y_test)/len(y_test))**(0.5)

    rmse_lasso=3026 rfmse_lasso= 3030

    画图 #选1000个点进行画图 df_random=df.sample(1500) #样本太少的话,可能就不会包含某个属性某个类别的样例,这样get_dummies之后列数就会变少 x_rand=df_random.drop('Purchase',axis=1) x_rand=pd.get_dummies(x_rand) y_rand=df_random['Purchase'] y_rand_pre=lasso_cv.predict(x_rand) plt.figure() plt.plot(list(range(len(y_rand_pre))),y_rand,color='blue',label='label') plt.scatter(list(range(len(y_rand_pre))),y_rand_pre,color='red',label='predict') plt.title('blackfriday') plt.legend()

    决策树

    要求

    决策树也可以用于回归分析,请参考 sklearn.tree 模块中的DecisionTreeRegressor 类相关说明,建立模型对商品销售量进行预测(仍然需要对模型进行调优),并对所选最优模型在训练集和测试集的预测精度进行评价。

    模型参数

    from sklearn.tree import DecisionTreeRegressor tree=DecisionTreeRegressor() tree.get_params()

    {‘criterion’: ‘mse’, ‘max_depth’: None, ‘max_features’: None, ‘max_leaf_nodes’: None, ‘min_impurity_decrease’: 0.0, ‘min_impurity_split’: None, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2, ‘min_weight_fraction_leaf’: 0.0, ‘presort’: False, ‘random_state’: None, ‘splitter’: ‘best’}

    格子搜索确定最参数

    tree=DecisionTreeRegressor() parameters={'max_depth':np.arange(10,20,1)} tree_cv=GridSearchCV(tree,param_grid=parameters,cv=5) tree_cv.fit(x,y) print(tree_cv.best_params_) print(tree_cv.best_score_)

    best_params_:{‘max_depth’: 17} best_score:0.64

    用最优惩罚因子训练模型

    tree=DecisionTreeRegressor(max_depth=17) tree.fit(x_train,y_train) tree.score(x_train,y_train) tree.score(x_test,y_test)

    训练集精度:0.67 测试集精度:0.63

    将得到的决策树对属性重要性进行评价

    stat=pd.DataFrame(columns=['importance','feature']) stat['importance']=tree.feature_importances_ stat['feature']=x.columns stat.sort_values(by='importance',ascending=False,inplace=True)

    最新回复(0)