搞清楚有哪些特征,各自代表的意义是什么。(看特征说明结合head) 对将要预测的连续变量做一个describe,有一个直观的认识
**首先,**依据直觉将数值类特征和类型类特征分别进行绘图处理,查看他们与标签的关系。 数值类特征,通过绘制散点图观察特征与标签的关系,来估计特征的重要程度。如
var = 'GrLivArea' data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1) data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));类别类特征,观察不同类别时对标签的影响,如
var = 'OverallQual' data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1) f, ax = plt.subplots(figsize=(8, 6)) fig = sns.boxplot(x=var, y="SalePrice", data=data) fig.axis(ymin=0, ymax=800000);其次,通过协方差图来观看数据之间的关系,
corrmat = df_train.corr() f, ax = plt.subplots(figsize=(12, 9)) sns.heatmap(corrmat, vmax=.8, square=True);挑选比较重要的特征再次通过协方差图进行分析,
k = 10 #number of variables for heatmap cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index cm = np.corrcoef(df_train[cols].values.T) sns.set(font_scale=1.25) hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values) plt.show()对比较感兴趣的特征进行散点图的绘制,部分图像,可以看出来是对称性的
sns.set() cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt'] sns.pairplot(df_train[cols], size = 2.5) plt.show();对缺失值有一个基本的认识,查看有哪些缺失值,部分数据
total = df_train.isnull().sum().sort_values(ascending=False) percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False) missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent']) missing_data.head(20)当缺失的数据超过百分之十五时,不对其进行操作,而是选择直接删除特征列,如’PoolQC’, ‘MiscFeature’ and ‘FireplaceQu’ (存疑)。 当考虑到缺失值的其他特征存在可以直接排除的特征时,我们也不予进行补充,如GarageX’ 和BsmtX’ ,认为他们本身可以用其他特征来代替。(所以还是根据之前对数据进行分析时看到的特征重要程度来决定如何对特征进行处理) 而当特征缺失值仅有一个时,可以选择将样本删除掉,而不是对特征进行补充。
#dealing with missing data df_train = df_train.drop((missing_data[missing_data['Total'] > 1]).index,1) df_train = df_train.drop(df_train.loc[df_train['Electrical'].isnull()].index) df_train.isnull().sum().max() #just checking that there's no missing data missing...单变量分析,将标签进行标准化处理,显示了前十个和后十个值。查看最大和最小值是否呈现明显的离群趋势。
#standardizing data saleprice_scaled = StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]); low_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][:10] high_range= saleprice_scaled[saleprice_scaled[:,0].argsort()][-10:] print('outer range (low) of the distribution:') print(low_range) print('\nouter range (high) of the distribution:') print(high_range)双变量分析,如图一中的两个点很明显为异常点,可以对其进行删除处理。
#deleting points df_train.sort_values(by = 'GrLivArea', ascending = False)[:2] df_train = df_train.drop(df_train[df_train['Id'] == 1299].index) df_train = df_train.drop(df_train[df_train['Id'] == 524].index)数据应当通过的假设,1、数据应当满足正态分布;2、同方差性;3、大部分特征都具有线性关系;4、关联性错误。 1、观测数据的分布状况
#histogram and normal probability plot sns.distplot(df_train['SalePrice'], fit=norm); fig = plt.figure() res = stats.probplot(df_train['SalePrice'], plot=plt)从图上可以看出,偏差很严重(且为正的时,即向上偏移),且尖峰也存在,此时可以使用log函数来对其进行处理,将其全部转换为线性分布(仅仅是此人的一种手段,是否具有推广性存疑),
#applying log transformation df_train['SalePrice'] = np.log(df_train['SalePrice']) #transformed histogram and normal probability plot sns.distplot(df_train['SalePrice'], fit=norm); fig = plt.figure() res = stats.probplot(df_train['SalePrice'], plot=plt)通过上述处理,能够很好地解决掉同方差性的问题,即绘出来的散点图不再具有锥形性质。
通过分析,发现存在着偏差。进而对学习提出优化。 第一,通过对数据本身进行优化(如将年龄进行分段表示)。 通过其他属性来对age特征进行补充。 其中,彩色部分表示年龄的分布,黑线代表偏差程度。
查看两个变量间的关系,也可以实现与输出变量的关系观测。 # Plot bar plot (titles, age and sex) plt.figure(figsize=(15,5)) sns.barplot(x=df['Title'], y=df_raw['Age']); 对age缺失样本的补充(by title) # Means per title df_raw['Title'] = df['Title'] # To simplify data handling means = df_raw.groupby('Title')['Age'].mean() # Transform means into a dictionary for future mapping map_means = means.to_dict() # Impute ages based on titles idx_nan_age = df.loc[np.isnan(df['Age'])].index df.loc[idx_nan_age,'Age'].loc[idx_nan_age] = df['Title'].loc[idx_nan_age].map(map_means)查看某个特征对其他特征的影响
# Compare with other variables df.groupby(['Embarked']).mean()第二,对特征进行优化。 将非数值类的数据均转换为categorial,然后dummies
# Transform object into categorical df['Embarked'] = pd.Categorical(df['Embarked']) df['Pclass'] = pd.Categorical(df['Pclass']) # Transform categorical features into dummy variables df = pd.get_dummies(df, drop_first=1) df.head()利用boxcox实现非线性变量的变换
# Apply Box-Cox transformation from scipy.stats import boxcox X_train_transformed = X_train.copy() X_train_transformed['Fare'] = boxcox(X_train_transformed['Fare'] + 1)[0] X_test_transformed = X_test.copy() X_test_transformed['Fare'] = boxcox(X_test_transformed['Fare'] + 1)[0]创建更多的特征,通过多项式。 首先,将已有特征标准化
# Rescale data from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() X_train_transformed_scaled = scaler.fit_transform(X_train_transformed) X_test_transformed_scaled = scaler.transform(X_test_transformed)然后,通过多项式产生新的特征
# Get polynomial features from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2).fit(X_train_transformed) X_train_poly = poly.transform(X_train_transformed_scaled) X_test_poly = poly.transform(X_test_transformed_scaled)最后,特征的选取
# Select features using chi-squared test from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 ## Get score using original model logreg = LogisticRegression(C=1) logreg.fit(X_train, y_train) scores = cross_val_score(logreg, X_train, y_train, cv=10) print('CV accuracy (original): %.3f +/- %.3f' % (np.mean(scores), np.std(scores))) highest_score = np.mean(scores) ## Get score using models with feature selection for i in range(1, X_train_poly.shape[1]+1, 1): # Select i features select = SelectKBest(score_func=chi2, k=i) select.fit(X_train_poly, y_train) X_train_poly_selected = select.transform(X_train_poly) # Model with i features selected logreg.fit(X_train_poly_selected, y_train) scores = cross_val_score(logreg, X_train_poly_selected, y_train, cv=10) print('CV accuracy (number of features = %i): %.3f +/- %.3f' % (i, np.mean(scores), np.std(scores))) # Save results if best score if np.mean(scores) > highest_score: highest_score = np.mean(scores) std = np.std(scores) k_features_highest_score = i elif np.mean(scores) == highest_score: if np.std(scores) < std: highest_score = np.mean(scores) std = np.std(scores) k_features_highest_score = i # Print the number of features print('Number of features when highest score: %i' % k_features_highest_score)第三,对算法进行优化。