第二章端对端的机器学习项目 Part II

xiaoxiao2022-07-14 166

这篇文章是本人学习《Hands-On-Machine-Learning-with-Scikit-Learn-and-TensorFlow》的读书笔记第二篇。整理出来是希望在巩固自己的学习效果的同时，希望能够帮助到同样想学习的人。本人也是小白，可能很多地方理解和翻译不是很到位，希望大家多多谅解和提意见。

4. 为机器学习算法准备数据

把特征和目标值分开，方便后续做特征转换。

housing = strat_train_set.drop('median_house_value',axis=1) #删除目标值 housing_labels = strat_train_set['median_house_value'].copy()

数据清洗

total_bedrooms 属性中存在缺失值，缺失值的处理：

删除有缺失值的数据点删除整个 total_bedrooms 属性用值来填充缺失值（0，平均数，中位数等） housing.dropna(subset['total_bedrooms']) #option1 housing.drop('total_bedrooms',axis=1) #option2 median = housing['total_bedrooms'].median() housing['total_bedrooms'].fillna(median) #option3

使用第三种方法来填充缺失值时，在测试集上也应该使用同样的中位数值填充缺失值。使用 Scikit-Learn 的 Imputer来实现缺失值的填充。

try: from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+ except ImportError: from sklearn.preprocessing import Imputer as SimpleImputer #create an imputer instances imputer = SimpleImputer(strategy='median') #specify median method housing_num = housing.drop("ocean_proximity", axis=1) #drop non-numerical attribute imputer.fit(housing_num) #fit the imputer instance to the training data X = imputer.transform(housing_num) #replacing missing values with learned medians housing_tr = pd.DataFrame(X, columns=housing_num.columns,index=housing.index) #convert Numpy arrays into pandas dataframe

文本和类别数据的处理

使用 Scikit-Learn 的 LabelEncoder 将文本数据转变为数值型数据。

try: from sklearn.preprocessing import OrdinalEncoder except ImportError: from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20 ordinal_encoder = OrdinalEncoder() housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat) ordinal_encoder.categories_ [array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'], dtype=object)]

Scikit-Learn 中提供 OneHotEncoder 编码可以将字符型的类别变量转换成独热编码的向量。

try: from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20 from sklearn.preprocessing import OneHotEncoder except ImportError: from future_encoders import OneHotEncoder # Scikit-Learn < 0.20 cat_encoder = OneHotEncoder() housing_cat_1hot = cat_encoder.fit_transform(housing_cat) housing_cat_1hot <16512x5 sparse matrix of type '<class 'numpy.float64'>' with 16512 stored elements in Compressed Sparse Row format>

得到的 housing_cat_1hot 是一个SciPy格式的稀疏矩阵而不是一个 NumPy的数组，可以使用 toarray()的方法将它转换为稠密的 Numpy 数组。

housing_cat_1hot.toarray()

自定义转换器

使用 Scikit-Learn的 FunctionTransformer类可以基于转换函数构建转换器。

from sklearn.preprocessing import FunctionTransformer def add_extra_features(X, add_bedrooms_per_room=True): rooms_per_household = X[:, rooms_ix] / X[:, household_ix] population_per_household = X[:, population_ix] / X[:, household_ix] if add_bedrooms_per_room: bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix] return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room] else: return np.c_[X, rooms_per_household, population_per_household] attr_adder = FunctionTransformer(add_extra_features, validate=False, kw_args={'add_bedrooms_per_room':False}) housing_extra_attribs = attr_adder.fit_transform(housing.values)

特征缩放

机器学习算法的效果不会太好，当各特征的取值在不同范围时，我们的数据集中total_rooms 的特征取值为[6,39320]，median_income的取值为[0,15]。注意，一般目标值是不需要做特征缩放的。常见的特征缩放的方法有最小-最大值缩放，标准化。

Min-max Scaling：（num - min）/ （max - min），使数据缩放到（0,1）。Scikit-Learn 中提供了 MinMaxScaler可实现该功能。Standardization：（num - mean）/ variance。不像 Min-max Scaling把数据缩放到0-1的范围，标准化将数据缩放到0均值，单位方差。这对于像神经网络这种希望收入范围在0-1之间的模型来说，标准化可能不是最佳。但是标准化对异常值不敏感，假如median_income中有个错误值为100，则 min-max scaling会将数据缩放到（0,0.15）的范围，而影响数据整体的分布。Scikit-Learn 中提供了StandardScaler实现标准化缩放。

Transformation Pipelines

Scikit-Learn 中提供了 Pipeline类来完成转换序列，使得程序能够按顺序执行每个转换。

from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler num_pipeline = Pipeline([ ('imputer',Imputer(strategy='median')), ('attribs_adder', CombinedAttributesAdder()), ('std_scaler', StandardScaler()), ]) housing_num_tr = num_pipeline.fit_transform(housing_num)

同样，我们也可以为类别型的变量设置 transformer pipeline。Scikit-Learn 中提供了ColumnTransformer 类整合数值型和类别型的转换器。传入一个转换器的列表，当需要用到 fit()或者transform()的方法时，它并行地运行各个转换器的 fit()或者transform()方法，然后等待他们的结果并最终把他们合并到一起输出。

try: from sklearn.compose import ColumnTransformer except ImportError: from future_encoders import ColumnTransformer num_attribs = list(housing_num) cat_attribs = ['ocean_proximity'] full_pipeline = ColumnTransformer([ ('num', num_pipeline, num_attribs), ('cat', OneHotEncoder(), cat_attribs), ]) housing_prepared = full_pipeline.fit_transform(housing)

5 在训练集上进行训练和验证

from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(housing_prepared, housing_labels)

在部分数据上查看预测效果

# try it out on some training instances some_data = housing.iloc[:5] some_labels = housing_labels.iloc[:5] some_data_prepared = full_pipeline.transform(some_data) #data transformation print('Predictions:\t\t', lin_reg.predict(some_data_prepared))

使用 Scikit-Learn 中的 mean_squared_error函数，计算 RMSE。

# calculate the mean_squared_error from sklearn.metrics import mean_squared_error housing_predictions = lin_reg.predict(housing_prepared) lin_mse = mean_squared_error(housing_labels,housing_predictions) lin_rmse = np.sqrt(lin_mse)

使用 Scikit-Learn 中的 mean_absolute_error函数，计算 MAE。

# calculate the mean_absolute_error from sklearn.metrics import mean_absolute_error lin_mae = mean_absolute_error(housing_labels, housing_predictions)

计算后发现 RMSE和 MAE都很多，考虑可能是模型欠拟合，使用决策树算法对数据进行拟合。

from sklearn.tree import DecisionTreeRegressor tree_reg = DecisionTreeRegressor(random_state=42) tree_reg.fit(housing_prepared, housing_labels) housing_predictions = tree_reg.predict(housing_prepared) tree_mse = mean_squared_error(housing_labels, housing_predictions) tree_mse = np.sqrt(tree_mse)

此时计算出来的 tree_mse 为 0。很明显，决策树算法过拟合了。

6 微调模型

计算交叉验证的得分

from sklearn.model_selection import cross_val_score scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring='neg_mean_squared_error', cv=10) tree_rmse_scores = np.sqrt(-scores)

计算线性回归的交叉验证得分。

# cross validation scores for linear regression lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10) lin_rmse_scores = np.sqrt(-lin_scores) display_scores(lin_rmse_scores)

使用随机森林来作预测，同时计算其交叉验证得分。

# choose Random Forest as a regressor from sklearn.ensemble import RandomForestRegressor forest_reg = RandomForestRegressor(n_estimators=10, random_state=42) forest_reg.fit(housing_prepared, housing_labels) # calculate the mean_squared_error for Random Forest Regressor housing_predictions = forest_reg.predict(housing_prepared) forest_mse = mean_squared_error(housing_labels, housing_predictions) forest_rmse = np.sqrt(forest_mse) forest_scores = cross_val_score(forest_reg, housing_prepared,housing_labels, scoring='neg_mean_squared_error',cv=10) forest_rmse_scores = np.sqrt(-forest_scores) display_scores(forest_rmse_scores)

使用线性核的SVM作为分类器，并计算其 RMSE。

from sklearn.svm import SVR svm_reg = SVR(kernel='linear') svm_reg.fit(housing_prepared, housing_labels) housing_predictions = svm_reg.predict(housing_prepared) svm_mse = mean_squared_error(housing_labels, housing_predictions) svm_rmse = np.sqrt(svm_mse)

使用 Scikit-Learn 的 GridSearchCV来帮助选择参数

from sklearn.model_selection import GridSearchCV param_grid = [ # try 12 (3×4) combinations of hyperparameters {'n_estimators':[3,10,30],'max_features':[2,4,6,8]}, # then try 6 (2×3) combinations with bootstrap set as False {'bootstrap':[False],'n_estimators':[3,10],'max_features':[2,3,4]}, ] forest_reg = RandomForestRegressor(random_state=42) # train across 5 folds, that's a total of (12+6)*5=90 rounds of training grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True) grid_search.fit(housing_prepared, housing_labels)

输出最佳的参数组合和最优的估计参数。

# look at the score of each hyperparameter combination tested during the grid search cvres = grid_search.cv_results_ for mean_score, params in zip(cvres['mean_test_score'],cvres['params']): print(np.sqrt(-mean_score),params) # 以 dataframe 的方式显示结果 pd.DataFrame(grid_search.cv_results_)

使用随机搜索来进行参数选择。

from sklearn.model_selection import RandomizedSearchCV from scipy.stats import randint param_distribs = { 'n_estimators': randint(low=1, high=200), 'max_features': randint(low=1, high=8), } forest_reg = RandomForestRegressor(random_state=42) rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs, n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42) rnd_search.fit(housing_prepared, housing_labels)

输出每个属性值对于正确预测的相对重要程度。

feature_importances = grid_search.best_estimator_.feature_importances_ extra_attribs = ['rooms_per_hhold','pop_per_hhold', 'bedrooms_per_room'] cat_encoder = full_pipeline.named_transformers_['cat'] cat_one_hot_attribs = list(cat_encoder.categories_[0]) attributes = num_attribs + extra_attribs + cat_one_hot_attribs sorted(zip(feature_importances, attributes), reverse=True)

有了这个信息，可以考虑删除点一些不是很重要的变量。

在测试集上评估系统性能

final_model = grid_search.best_estimator_ X_test = strat_test_set.drop('median_house_value', axis=1) y_test = strat_test_set['median_house_value'].copy() X_test_prepared = full_pipeline.transform(X_test) final_predictions = final_model.predict(X_test_prepared) final_mse = mean_squared_error(y_test, final_predictions) final_rmse = np.sqrt(final_mse)

计算测试集的 RMSE 95%的置信区间。

# we can compute a 95% confidence interval for the test RMSE from scipy import stats confidence = 0.95 squared_errors = (final_predictions - y_test) ** 2 mean = squared_errors.mean() m = len(squared_errors) np.sqrt(stats.t.interval(confidence, m-1,loc=np.mean(squared_errors), scale=stats.sem(squared_errors))) # we could also compute the interval manually like this tscore = stats.t.ppf((1 + confidence)/2, df=m-1) tmargin = tscore * squared_errors.std(ddof=1) / np.sqrt(m) np.sqrt(mean - tmargin), np.sqrt(mean + tmargin) # Alternatively, we could use a z-scores rather than t-scores zscore = stats.norm.ppf((1 + confidence) / 2) zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m) np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)

整合数据准备和预测的Pipeline

full_pipeline_with_predictor = Pipeline([ ('preparation', full_pipeline), ('linear', LinearRegression()) ]) full_pipeline_with_predictor.fit(housing, housing_labels) full_pipeline_with_predictor.predict(some_data)

使用 joblib保存模型

my_model = full_pipeline_with_predictor from sklearn.externals import joblib joblib.dump(my_model, 'my_model.pkl') #save model my_model_loaded = joblib.load('my_model.pkl') #load model

7 上线、监督、维护你的系统

需要编写程序监督你的系统运行，当性能出问题时应及时预警。评估系统性能时需要对系统的预测进行抽样，评估是否准确，可能需要人为的分析。时常评估系统输入的数据质量。定期使用新数据重新训练模型。

8 练习题的解答

Question: 构建一个SVM回归算法，尝试使用多种参数，比如 kernel=‘linear’（C有多种值）或者 kernel=‘rbf’（C和gamma有多种值）。 from sklearn.model_selection import GridSearchCV param_grid = [ {'kernel': ['linear'], 'C': [10., 30., 100., 300., 1000., 3000., 10000., 30000.0]}, {'kernel': ['rbf'], 'C': [1.0, 3.0, 10., 30., 100., 300., 1000.0], 'gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]}, ] svm_reg = SVR() grid_search = GridSearchCV(svm_reg, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=2, n_jobs=4) grid_search.fit(housing_prepared, housing_labels) negative_mse = grid_search.best_score_ rmse = np.sqrt(-negative_mse) Question: 使用RandomizedSearchCV代替GridSearchCV 。 from sklearn.model_selection import RandomizedSearchCV from scipy.stats import expon, reciprocal param_distribs = { 'kernel': ['linear', 'rbf'], 'C': reciprocal(20, 200000), 'gamma': expon(scale=1.0), } svm_reg = SVR() rnd_search = RandomizedSearchCV(svm_reg, param_distributions=param_distribs, n_iter=50, cv=5, scoring='neg_mean_squared_error', verbose=2, n_jobs=4, random_state=42) rnd_search.fit(housing_prepared, housing_labels) negative_mse = rnd_search.best_score_ rmse = np.sqrt(-negative_mse) Question：在数据处理的Pipeline中加入转换器用来选择最重要的特征 from sklearn.base import BaseEstimator, TransformerMixin # np.argpartition(arr,k) 将数组arr中所有元素（包括重复元素）从小到大排列，比第k大的元素 # 小的放在前面，大的放在后面，输出新数组索引 def indices_of_top_k(arr, k): return np.sort(np.argpartition(np.array(arr),-k)[-k:]) class TopFeatureSelector(BaseEstimator, TransformerMixin): def __init__(self, feature_importances, k): self.feature_importances = feature_importances self.k = k def fit(self, X, y=None): self.feature_indices = indices_of_top_k(self.feature_importances, self.k) return self def transform(self,X): return X[:, self.feature_indices]

找到最大的 k 个特征对应的序号

k = 5 top_k_feature_indices = indices_of_top_k(feature_importances, k) top_k_feature_indices

preparation_and_feature_selection_pipeline = Pipeline([ ('preparation', full_pipeline), ('feature_selection', TopFeatureSelector(feature_importances, k)) ]) housing_prepared_top_k_features = preparation_and_feature_selection_pipeline.fit_transform(housing)

4. Question：创建一个Pipeline完成所有的数据处理过程和最后的预测。

prepare_select_and_predict_pipeline = Pipeline([ ('preparation', full_pipeline), ('feature_selection', TopFeatureSelector(feature_importances, k)), ('svm_reg', SVR(**rnd_search.best_params_)) ]) prepare_select_and_predict_pipeline.fit(housing, housing_labels)

5. Question：使用GridSearchCV自动发现一些数据处理的方法。

param_grid = [{ 'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'], 'feature_selection__k': list(range(1, len(feature_importances) + 1)) }] grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=2, n_jobs=4) grid_search_prep.fit(housing, housing_labels)

程序

我把书中的程序都用 Python 3运行了一遍，确保没有Bug并且都加了注释，方便大家理解。原书的数据集和代码在这个网站上，我自己运行的程序在我的GitHub上。

最新回复(0)

第二章 端对端的机器学习项目 Part II