阿里巴巴笔试题:数据分析与建模测试
请阅读以下文字答题。
Field Descriptions:isbuyer - Past purchaser of product buy_freq - How many times purchased in the past visit_freq - How many times visited website in the past buy_interval - Average time between purchases sv_interval - Average time between website visits expected_time_buy - ? expected_time_visit - ? last_buy - Days since last purchase. last_visit - Days since last website visit. multiple_buy - ? multiple_visit - ? uniq_url - Number of unique urls we observed web browser on. num_checkins - Number of times we observed web browser.y_buy - Outcome variable of interest, Did they purchase in period of interest. Question: Each observation in the provided training/test dataset is a web browser (or cookie) in our observed Universe. The goal is to model the behavior of a future purchase and classify cookies into those that will purchase in the future and those that will not. y_buy is the outcome variable that represents if a cookie made a purchase in the period of interest. All of the rest of the columns in the data set were recorded prior to this purchase and may be used to predict purchase. Please use ‘ads_train.csv’ as training data to create at least two different classes of models (e.g. logistic regression, random forest, etc.) to classify these cookies into future buyers or not. Explain your choice of model, how you did model selection, how you validated the quality of the model, and which variables are most informative of purchase. Also, comment on any general trends or anomalies in the data you can identify as well as propose a meaning for those fields not defined. The deliverable is a document with text and figures illustrating your thought process, how you began to explore the data, and a comparison of the models that you created. When evaluating your models, consider metrics such as AUC of Precision-Recall Curve, precision, recall. This should take about 6 hours and can be done using any programming language or statistical package (R or Python are preferred). Finally, perform prediction on test dataset ‘ads_test.csv’ using your chosen model(s) and report predicted probabilities of future purchase and predicted labels of future purchase.
Please also do include codes with your document (Python Jupyter/R knitr is recommended)
题目:
所提供的训练/测试数据集中的每个观察都是我们观察到的宇宙中的一个Web浏览器(或cookie)。目标是对未来购买行为进行建模,并将cookies分为未来购买和不购买两类。y_buy是一个结果变量,它表示一个cookie是否在感兴趣的期间内进行了购买。数据集中的所有其他列都是在此次购买之前记录的,可以用来预测购买情况。请使用“ads-train.csv”作为训练数据,创建至少两种不同类型的模型(如逻辑回归、随机森林等),以将这些cookies分类。解释您对模型的选择,您是如何进行模型选择的,您是如何验证模型的质量的,以及哪些变量是购买时最有用的信息。此外,对您可以识别的数据中的任何一般趋势或异常进行评论,并对那些未定义的字段提出含义。可交付结果是一个文档,其中包含说明您的思想过程、您如何开始探索数据以及您创建的模型的比较的文本和数字。在评估您的模型时,请考虑精度召回曲线、精度、召回的AUC等指标。这需要大约6个小时,并且可以使用任何编程语言或统计包(首选R或Python)。最后,使用您选择的模型对测试数据集“ads_test.csv”进行预测,并报告预测的未来购买概率和预测的未来购买标签。请在文档中包含代码(建议使用python jupyter/r knitr)
数据含义分析
isbuyer - Past purchaser of product 过去是否购买产品
buy_freq - How many times purchased in the past 过去购买过多少次
visit_freq - How many times visited website in the past 过去访问过多少次网站
buy_interval - Average time between purchases 平均购买间隔时间
sv_interval - Average time between website visits 网站访问之间的平均时间
expected_time_buy - ? 预期购买时间
expected_time_visit - ? 预期访问时间
last_buy - Days since last purchase. 上次购买后的天数
last_visit - Days since last website visit. 上次访问网站后的天数
multiple_buy - ? 之前是否多次购买商品
multiple_visit - ? 之前是否多次访问网站
uniq_url - Number of unique urls we observed web browser on.
在Web浏览器上观察到的唯一URL数
num_checkins - Number of times we observed web browser. 观察到的Web浏览器的次数
y_buy - Outcome variable of interest, Did they purchase in period of interest.
利息的结果变量,他们是否在利息期内购买。
1.第一天,简单的看了一下数据,发现正负样本类别及其不均衡,所以将预测指标多样化,正负样本预测P、R、F1分开看,将缺失值简单填充0,多个模型预测,最好结果为逻辑回归:
可以发现,正样本均没有预测出来,模型基本没有学习到什么东西。
2.采用Smote方法尝试解决类别不均衡问题,,正样本系数为10,负样本为1,最好结果:
可以看出,虽然AUC值下降,但是正样本预测准确率小有上升,还是有用的。
3.使用分类阈值移动来进一步处理类别不均衡问题,根据公式p=m/(m+n)=0.005以及模型本身预测能力综合考虑,将阈值设为0.1,结果如下:
可以看出虽然整体Accuracy下降,但是正样本各数据指标均有提升,可以认为类别不均衡得到进一步缓和。
4.继续研究数据,将与y_buy的Person相关性不大的特征和某一类别高达90%以上的一些特征删除。同时发现last_buy和last_visit是同一组数据,删除其中一组,结果如下:
虽然正样本召回率有所提升,但是整体指标下降。放弃更改。
5.尝试特征工程,将除了id以及四个强特以外的特征分箱并onehot编码,结果如下:
效果一般,虽然准确率提升,但是正样本各个指标并没有显著提升。(而且在使用集成模型后反而效果下降,所以放弃更改)
6.模型调参
因为发现KNN效果很差,所以选择三种模型调参:LR,RF,DTree
使用roc_auc作为评价指标。
调参结果:
效果对比(Dtree):
调参前:
调参后:
其余两模型也没有提升,初步判断是因为选取评价指标的问题,放弃更改。
7.模型集成
效果如下:
先看看效果最好的LR:
模型集成:
可以发现各指标基本有提升,还是有用的。
8.特征选择,使用递归特征消除(RFE)和交叉验证(CV)选择有效特征。
9. 使用LightGBM模型+5折交叉验证替代之前的弱分类器,效果拔群。
最后将数据处理方法运用到test文件,预测结果,保存成CSV。
总结:在使用lgb模型前试了好几个方法其实效果都感觉好,类别不均衡如此严重的情况比想象中的更难处理,以为是方法问题但是理论上都是对不均衡问题有效的,百思不得其解,现在一换lgb,效果直接上来了,说明之前的方法还是有用的,只是因为模型太弱导致没有效果。之后还可以提升的方法我感觉可以有:
1.lgb+xgb模型stacking,提升模型泛化能力。
2.之前否决掉的一些操作可以再尝试一下,如:分箱+one hot。
3.lgb调参。
但是由于诸事缠身,没有时间尝试,就只能先到这里了。
源码:
import numpy as np import pandas as pd from scipy import sparse #Common Model Algorithms from sklearn import metrics,svm,tree from sklearn.utils import shuffle from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report from sklearn.linear_model import LogisticRegression from xgboost import XGBClassifier import lightgbm as lgb from sklearn.ensemble import VotingClassifier #Common Model Helpers from sklearn.preprocessing import OneHotEncoder, LabelEncoder from sklearn import feature_selection from sklearn import model_selection from sklearn import metrics from sklearn.model_selection import KFold from sklearn.metrics import mean_squared_error from sklearn.metrics import log_loss from sklearn.model_selection import train_test_split from sklearn.externals import joblib #Visualization import matplotlib as mpl import matplotlib.pyplot as plt import matplotlib.pylab as pylab import seaborn as sns from pandas.tools.plotting import scatter_matrix #Configure Visualization Defaults #%matplotlib inline = show plots in Jupyter Notebook browser %matplotlib inline mpl.style.use('ggplot') sns.set_style('white') pylab.rcParams['figure.figsize'] = 12,8 #ignore warnings import warnings warnings.filterwarnings('ignore') import random from sklearn.neighbors import NearestNeighbors #采用Smote方法负采样 class Smote: def __init__(self,samples,N=10,k=3): self.n_samples,self.n_attrs=samples.shape self.N=N self.k=k self.samples=samples self.newindex=0 def over_sampling(self): N=int(self.N/1) self.synthetic = np.zeros((self.n_samples * N, self.n_attrs)) neighbors=NearestNeighbors(n_neighbors=self.k).fit(self.samples) for i in range(len(self.samples)): nnarray=neighbors.kneighbors(self.samples[i].reshape(1,-1),return_distance=False)[0] #print nnarray self._populate(N,i,nnarray) return self.synthetic # for each minority class samples,choose N of the k nearest neighbors and generate N synthetic samples. def _populate(self,N,i,nnarray): for j in range(N): nn=random.randint(0,self.k-1) dif=self.samples[nnarray[nn]]-self.samples[i] gap=random.random() self.synthetic[self.newindex]=self.samples[i]+gap*dif self.newindex+=1 #加载数据 train = pd.read_csv('ads_train.csv',header=0,names=['id', 'isbuyer', 'buy_freq', 'visit_freq', 'buy_interval', 'sv_interval', 'expected_time_buy', 'expected_time_visit', 'last_buy', 'last_visit', 'multiple_buy', 'multiple_visit', 'uniq_urls', 'num_checkins', 'y_buy'],encoding = 'utf-8') test = pd.read_csv('ads_test.csv',header=0,names=['id', 'isbuyer', 'buy_freq', 'visit_freq', 'buy_interval', 'sv_interval', 'expected_time_buy', 'expected_time_visit', 'last_buy', 'last_visit', 'multiple_buy', 'multiple_visit', 'uniq_urls', 'num_checkins'],encoding = 'utf-8')数据预览部分略
数据分析
stats = [] for col in train.columns: stats.append((col, train[col].nunique(), train[col].isnull().sum() * 100 / train.shape[0], train[col].value_counts(normalize=True, dropna=False).values[0] * 100, train[col].dtype)) stats_df = pd.DataFrame(stats, columns=['Feature', 'Unique_values', 'Percentage of missing values', 'Percentage of values in the biggest category', 'type']) stats_df.sort_values('Percentage of missing values', ascending=False) stats = [] for col in test.columns: stats.append((col, test[col].nunique(), test[col].isnull().sum() * 100 / test.shape[0], test[col].value_counts(normalize=True, dropna=False).values[0] * 100, train[col].dtype)) stats_df = pd.DataFrame(stats, columns=['Feature', 'Unique_values', 'Percentage of missing values', 'Percentage of values in the biggest category', 'type']) stats_df.sort_values('Percentage of missing values', ascending=False) #Discrete Variable Correlation by y_buy using Target = ['y_buy'] for x in train.columns: if train[x].nunique() < 100 and x != 'y_buy': print('Survival Correlation by:', x) print(train[[x, Target[0]]].groupby(x, as_index=False).mean()) print('-'*10, '\n') #correlation heatmap of dataset def correlation_heatmap(df): _ , ax = plt.subplots(figsize =(14, 12)) colormap = sns.diverging_palette(220, 10, as_cmap = True) _ = sns.heatmap( df.corr(), cmap = colormap, square=True, cbar_kws={'shrink':.9 }, ax=ax, annot=True, linewidths=0.1,vmax=1.0, linecolor='white', annot_kws={'fontsize':12 } ) plt.title('Pearson Correlation of Features', y=1.05, size=15) correlation_heatmap(train)数据清洗
#将数据合并方便操作 target = train['y_buy'] del train['y_buy'] data = pd.concat([train,test],axis=0,ignore_index=True) ##删除重复率过高、Person相关性过低的特征,效果不好,放弃更改。 #data.drop(['id','buy_interval','sv_interval','expected_time_buy','num_checkins'],axis=1,inplace = True) data.drop(['id'],axis=1,inplace = True) #缺失值填充:本来打算先将isbuyer=0的buy_freq填补为0,剩余的空值用建模法填充,然后发现第一步处理完后没有空值了,,, data.loc[data['buy_freq'].isnull(),'buy_freq']=data[data['buy_freq'].isnull()]['isbuyer']特征工程
# #构建新特征,分箱+onehot,效果不好,放弃 # column = train.columns # for x in column: # if max(train[x]) - min(train[x]) > 80 and x!='id': # s = x + 'Bin' # train[s] = pd.cut(train[x], 10,labels = False) # train.drop(x,axis=1,inplace = True) # train = pd.get_dummies(train, columns=[s])模型构建
train = data[:train.shape[0]] test = data[train.shape[0]:] train['y_buy'] = target #分割数据 def split_data(data): data_len = data['y_buy'].count() split1 = int(data_len*0.76) train_data = data[:split1] test_data = data[split1:] return train_data, test_data #Smote样本重采样 def resample_train_data(train_data, n, frac): numeric_attrs = train_data.drop('y_buy',axis=1).columns pos_train_data_original = train_data[train_data['y_buy'] == 1] pos_train_data = train_data[train_data['y_buy'] == 1] new_count = n * pos_train_data['y_buy'].count() neg_train_data = train_data[train_data['y_buy'] == 0].sample(frac=frac) train_list = [] if n != 0: pos_train_X = pos_train_data[numeric_attrs] pos_train_X2 = pd.concat([pos_train_data.drop(numeric_attrs, axis=1)] * n) pos_train_X2.index = range(new_count) s = Smote(pos_train_X.values, N=n, k=3) pos_train_X = s.over_sampling() pos_train_X = pd.DataFrame(pos_train_X, columns=numeric_attrs, index=range(new_count)) pos_train_data = pd.concat([pos_train_X, pos_train_X2], axis=1) pos_train_data = pd.DataFrame(pos_train_data, columns=pos_train_data_original.columns) train_list = [pos_train_data, neg_train_data, pos_train_data_original] else: train_list = [neg_train_data, pos_train_data_original] print("Size of positive train data: {} * {}".format(pos_train_data_original['y_buy'].count(), n+1)) print("Size of negative train data: {} * {}".format(neg_train_data['y_buy'].count(), frac)) train_data = pd.concat(train_list, axis=0) return shuffle(train_data) #打乱数据 def train_evaluate(train_data, test_data, classifier, n=1, frac=1.0, threshold = 0.5,save = False): train_data = resample_train_data(train_data, n, frac) train_X = train_data.drop('y_buy',axis=1) train_y = train_data['y_buy'] test_X = test_data.drop('y_buy', axis=1) test_y = test_data['y_buy'] print(classifier) classifier = classifier.fit(train_X, train_y) prodict_prob_y = classifier.predict_proba(test_X)[:,1] report = classification_report(test_y, prodict_prob_y > threshold, target_names = ['no', 'yes']) prodict_y = (prodict_prob_y > threshold).astype(int) accuracy = np.mean(test_y.values == prodict_y) print("Accuracy: {}".format(accuracy)) print(report) fpr, tpr, thresholds = metrics.roc_curve(test_y, prodict_prob_y) precision, recall, thresholds = metrics.precision_recall_curve(test_y, prodict_prob_y) test_auc = metrics.auc(fpr, tpr) plot_pr(test_auc, precision, recall, "yes") if save: joblib.dump(classifier,'vote_soft.model') return prodict_y # In[27]: def plot_pr(auc_score, precision, recall, label=None): pylab.figure(num=None, figsize=(6, 5)) pylab.xlim([0.0, 1.0]) pylab.ylim([0.0, 1.0]) pylab.xlabel('Recall') pylab.ylabel('Precision') pylab.title('P/R (AUC=%0.2f) / %s' % (auc_score, label)) pylab.fill_between(recall, precision, alpha=0.2) pylab.grid(True, linestyle='-', color='0.75') pylab.plot(recall, precision, lw=1) pylab.show() # In[28]: def select_model(train_data, cv_data): forest = RandomForestClassifier(n_estimators=400, oob_score=True) lr = LogisticRegression(max_iter=100, C=1, random_state=0) tree = DecisionTreeClassifier(max_depth=3) #knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski') #模型集成 vote_est = [ ('rfc', forest), ('lr', lr), ('dtc', tree) ] #Soft Vote or majority rules from sklearn.ensemble import VotingClassifier vote_soft = VotingClassifier(estimators = vote_est , voting = 'soft') #train_evaluate(train_data, cv_data, knn, n=9, frac=1.0, threshold=0.1) train_evaluate(train_data, cv_data, tree, n=9, frac=1.0, threshold=0.1) train_evaluate(train_data, cv_data, forest, n=9, frac=1.0, threshold=0.1) train_evaluate(train_data, cv_data, lr, n=9, frac=1.0, threshold=0.1) train_evaluate(train_data, cv_data, vote_soft, n=9, frac=1.0, threshold=0.1, save = True) # In[29]: train_data, cv_data = split_data(train) select_model(train_data, cv_data) # # 模型优化 # ## 模型调参 # In[30]: #数据先处理一下 train_data = resample_train_data(train, n=9, frac=1.0) X_train = train_data.drop('y_buy',axis=1) y_train = train_data['y_buy'] # In[32]: #base model 1 clf = RandomForestClassifier(random_state = 0) cv_split = 5 base_results = model_selection.cross_validate(clf, X_train, y_train, cv = cv_split) clf.fit(X_train, y_train) print('BEFORE DT Parameters: ', clf.get_params()) print("BEFORE DT Training w/bin score mean: {:.2f}". format(base_results['train_score'].mean()*100)) print("BEFORE DT Test w/bin score mean: {:.2f}". format(base_results['test_score'].mean()*100)) print("BEFORE DT Test w/bin score 3*std: +/- {:.2f}". format(base_results['test_score'].std()*100*3)) print('-'*10) #tune hyper-parameters param_grid = {'criterion': ['gini'], 'max_depth': [None], #max depth tree can grow; default is none 'min_samples_split': [2], #minimum subset size BEFORE new split (fraction is % of total); default is 2 'min_samples_leaf': [1], #minimum subset size AFTER new split split (fraction is % of total); default is 1 'max_features': ['auto'], #max features to consider when performing split; default none or all 'n_estimators':[10,100,200], 'random_state': [0] } #print(list(model_selection.ParameterGrid(param_grid))) #choose best model with grid_search tune_model = model_selection.GridSearchCV(clf, param_grid=param_grid, scoring = 'roc_auc', cv = cv_split) tune_model.fit(X_train, y_train) print('AFTER DT Parameters: ', tune_model.best_params_) print("AFTER DT Training w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_train_score'][tune_model.best_index_]*100)) print("AFTER DT Test w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_test_score'][tune_model.best_index_]*100)) print("AFTER DT Test w/bin score 3*std: +/- {:.2f}". format(tune_model.cv_results_['std_test_score'][tune_model.best_index_]*100*3)) print('-'*10) clf.set_params(**tune_model.best_params_) #duplicates gridsearchcv # tune_results = model_selection.cross_validate(tune_model, data1[data1_x_bin], data1[Target], cv = cv_split) # print('AFTER DT Parameters: ', tune_model.best_params_) # print("AFTER DT Training w/bin set score mean: {:.2f}". format(tune_results['train_score'].mean()*100)) # print("AFTER DT Test w/bin set score mean: {:.2f}". format(tune_results['test_score'].mean()*100)) # print("AFTER DT Test w/bin set score min: {:.2f}". format(tune_results['test_score'].min()*100)) # print('-'*10) # In[33]: #base model 2 dtree = DecisionTreeClassifier(random_state = 0) cv_split = 5 base_results = model_selection.cross_validate(dtree, X_train, y_train, cv = cv_split) dtree.fit(X_train, y_train) print('BEFORE DT Parameters: ', dtree.get_params()) print("BEFORE DT Training w/bin score mean: {:.2f}". format(base_results['train_score'].mean()*100)) print("BEFORE DT Test w/bin score mean: {:.2f}". format(base_results['test_score'].mean()*100)) print("BEFORE DT Test w/bin score 3*std: +/- {:.2f}". format(base_results['test_score'].std()*100*3)) print('-'*10) #tune hyper-parameters param_grid = {'criterion': ['entropy'], #scoring methodology; two supported formulas for calculating information gain - default is gini 'splitter': ['best'], #splitting methodology; two supported strategies - default is best 'max_depth': [None], #max depth tree can grow; default is none 'min_samples_split': [2], #minimum subset size BEFORE new split (fraction is % of total); default is 2 'min_samples_leaf': [5], #minimum subset size AFTER new split split (fraction is % of total); default is 1 'max_features': [None, 'auto'], #max features to consider when performing split; default none or all 'random_state': [0] } #print(list(model_selection.ParameterGrid(param_grid))) #choose best model with grid_search tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split) tune_model.fit(X_train, y_train) print('AFTER DT Parameters: ', tune_model.best_params_) print("AFTER DT Training w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_train_score'][tune_model.best_index_]*100)) print("AFTER DT Test w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_test_score'][tune_model.best_index_]*100)) print("AFTER DT Test w/bin score 3*std: +/- {:.2f}". format(tune_model.cv_results_['std_test_score'][tune_model.best_index_]*100*3)) print('-'*10) dtree.set_params(**tune_model.best_params_) # In[34]: #base model 3 lr = LogisticRegression(random_state=0) cv_split = 5 base_results = model_selection.cross_validate(dtree, X_train, y_train, cv = cv_split) dtree.fit(X_train, y_train) print('BEFORE DT Parameters: ', dtree.get_params()) print("BEFORE DT Training w/bin score mean: {:.2f}". format(base_results['train_score'].mean()*100)) print("BEFORE DT Test w/bin score mean: {:.2f}". format(base_results['test_score'].mean()*100)) print("BEFORE DT Test w/bin score 3*std: +/- {:.2f}". format(base_results['test_score'].std()*100*3)) print('-'*10) #tune hyper-parameters param_grid = {'max_iter': [50,100,200], 'C': [0.1,0.5,1.0], 'penalty': ['l1','l2'], 'random_state': [0] } #choose best model with grid_search tune_model = model_selection.GridSearchCV(LogisticRegression(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split) tune_model.fit(X_train, y_train) print('AFTER DT Parameters: ', tune_model.best_params_) print("AFTER DT Training w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_train_score'][tune_model.best_index_]*100)) print("AFTER DT Test w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_test_score'][tune_model.best_index_]*100)) print("AFTER DT Test w/bin score 3*std: +/- {:.2f}". format(tune_model.cv_results_['std_test_score'][tune_model.best_index_]*100*3)) print('-'*10) lr.set_params(**tune_model.best_params_) # In[35]: train_data, cv_data = split_data(train) train_evaluate(train_data, cv_data, dtree, n=9, frac=1.0, threshold=0.1) train_evaluate(train_data, cv_data, clf, n=9, frac=1.0, threshold=0.1) train_evaluate(train_data, cv_data, lr, n=9, frac=1.0, threshold=0.1) # ## 模型集成 # In[37]: vote_est = [ ('rfc', clf), ('lr', lr), ('dtc', dtree) ] #Soft Vote or majority rules from sklearn.ensemble import VotingClassifier vote_soft = VotingClassifier(estimators = vote_est , voting = 'soft') vote_soft_cv = model_selection.cross_validate(vote_soft, X_train, y_train, cv = cv_split) vote_soft.fit(X_train, y_train) print("Soft Voting Training w/bin score mean: {:.2f}". format(vote_soft_cv['train_score'].mean()*100)) print("Soft Voting Test w/bin score mean: {:.2f}". format(vote_soft_cv['test_score'].mean()*100)) print("Soft Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_soft_cv['test_score'].std()*100*3)) print('-'*10) # In[42]: train_evaluate(train_data, cv_data, vote_soft, n=9, frac=1.0, threshold=0.1) # ## 特征选择 # In[68]: #更多的预测变量并不是一个更好的模型,但正确的预测变量确实如此。 #因此,数据建模的另一个步骤是特征选择。 Sklearn有几个选项,我们将使用递归特征消除(RFE)和交叉验证(CV)。 # In[74]: #base model print('BEFORE DT RFE Training Shape Old: ', X_train.shape) print('BEFORE DT RFE Training Columns Old: ', X_train.columns.values) print("BEFORE DT RFE Training w/bin score mean: {:.2f}". format(base_results['train_score'].mean()*100)) print("BEFORE DT RFE Test w/bin score mean: {:.2f}". format(base_results['test_score'].mean()*100)) print("BEFORE DT RFE Test w/bin score 3*std: +/- {:.2f}". format(base_results['test_score'].std()*100*3)) print('-'*10) #feature selection dtree_rfe = feature_selection.RFECV(dtree, step = 1, scoring = 'roc_auc', cv = cv_split) dtree_rfe.fit(X_train, y_train) #transform x&y to reduced features and fit new model #alternative: can use pipeline to reduce fit and transform steps: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html X_rfe = X_train.columns.values[dtree_rfe.get_support()] rfe_results = model_selection.cross_validate(dtree, X_train, y_train, cv = cv_split) #print(dtree_rfe.grid_scores_) print('AFTER DT RFE Training Shape New: ', train[X_rfe].shape) print('AFTER DT RFE Training Columns New: ', X_rfe) print("AFTER DT RFE Training w/bin score mean: {:.2f}". format(rfe_results['train_score'].mean()*100)) print("AFTER DT RFE Test w/bin score mean: {:.2f}". format(rfe_results['test_score'].mean()*100)) print("AFTER DT RFE Test w/bin score 3*std: +/- {:.2f}". format(rfe_results['test_score'].std()*100*3)) print('-'*10) # ## 使用lgb # In[46]: test # In[42]: X_train = X_train.values y_train = y_train.values # In[48]: #5折交叉验证 from sklearn.metrics import roc_auc_score param = {'num_leaves': 120, 'min_data_in_leaf': 30, 'objective':'binary', 'max_depth': -1, 'learning_rate': 0.1, "min_child_samples": 30, "boosting": "gbdt", "feature_fraction": 0.9, "bagging_freq": 1, "bagging_fraction": 0.9 , "bagging_seed": 11, "metric": {'l2', 'auc'}, "lambda_l1": 0.1, "verbosity": -1} folds = KFold(n_splits=5, shuffle=True, random_state=2018) oof_lgb = np.zeros(len(X_train)) predictions_lgb = np.zeros(len(test)) for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)): print("fold n°{}".format(fold_+1)) trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx]) val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx]) num_round = 10000 clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=200, early_stopping_rounds = 100) oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration) predictions_lgb += clf.predict(test, num_iteration=clf.best_iteration) / folds.n_splits # In[50]: len(oof_lgb) # In[54]: predictions_lgb # In[59]: report = classification_report(y_train, oof_lgb > 0.1, target_names = ['no', 'yes']) prodict_y = (oof_lgb > 0.1).astype(int) accuracy = np.mean(y_train == prodict_y) print("Accuracy: {}".format(accuracy)) print('The roc of train is:', roc_auc_score(y_train, oof_lgb)) print(report) # ## 结果保存 # In[55]: predict_y = (predictions_lgb > 0.1).astype(int) # In[58]: final = pd.read_csv('ads_test.csv') del final['Unnamed: 0'] final['y_buy'] = predict_y final.to_csv('ads_pre.csv')