在拿破仑·希尔(Napolean Hill)所著的《思考致富》(Think and Grow Rich)一书中,他为我们引述了Darby苦挖金矿多年后,就在离矿脉一步之遥的时候与宝藏失之交臂的故事。
思考致富中文版的豆瓣阅读链接:
http://read.douban.com/reader/ebook/10954762/
根据该书内容进行的修改
如今,我虽然不知道这故事是真是假,但是我明确知道在我身边有不少这样的“数据Darby”。这些人了解机器学习的目的和执行,对待任何研究问题只使用2-3种算法。他们不用更好的算法和技术来更新自身,只因为他们太顽固,或者他们只是在耗费时间而不求进步。
像Darby这一类人,他们总是在接近终点的时候而错失良机。最终,他们以计算量大、难度大或是无法设定合适的阈值来优化模型等借口,放弃了机器学习。这有什么意义?你听说过这些人吗?
今天给出的速查表旨在改变这群“数据Darby”对机器学习的态度,使他们成为身体力行的倡导者。这里收集了10个最为常用的机器学习算法,附上了Python和R代码。
考虑到机器学习方法在建模中得到了更多的运用,以下速查表可以作为代码指南来帮助你掌握机器学习算法运用。祝你好运!
对于那些超级懒惰的数据Darbies,我们将让你的生活过得更轻松。你可以在此下载PDF版的速查表,便可直接复制粘贴代码。
机器学习算法
类 型
监督学习
非监督学习
增强学习
决策树
K-近邻算法
随机决策森林
Logistics回归分析
Apriori算法
K-均值算法
系统聚类
马尔科夫决策过程
增强学习算法(Q-学习)
线性回归
#Import Library
#Import other necessary libraries like pandas,
#numpy...
from sklearn import linear_model
#Load Train and Test datasets
#Identify feature and response variable(s) and
#values must be numeric and numpy arrays
x_train=input_variables_values_training_datasets y_train=target_variables_values_training_datasets x_test=input_variables_values_test_datasets
#Create linear regression objectlinear = linear_model.LinearRegression()
#Train the model using the training sets and #check scorelinear.fit(x_train, y_train) linear.score(x_train, y_train)
#Equation coefficient and Intercept print('Coefficient: \n', linear.coef_) print('Intercept: \n', linear.intercept_) #Predict Output
predicted= linear.predict(x_test)
#Load Train and Test datasets
#Identify feature and response variable(s) and
#values must be numeric and numpy arrays
x_train <- input_variables_values_training_datasets
y_train <- target_variables_values_training_datasets
x_test <- input_variables_values_test_datasets
x <- cbind(x_train,y_train)
#Train the model using the training sets and
#check score
linear <- lm(y_train ~ ., data = x)summary(linear)
#Predict Output
predicted= predict(linear,x_test)
逻辑回归
#Import Library
from sklearn.linear_model import LogisticRegression
#Assumed you have, X (predictor) and Y (target)
#for training data set and x_test(predictor)
#of test_dataset
#Create logistic regression object
model = LogisticRegression()
#Train the model using the training sets
#and check score
model.fit(X, y)
model.score(X, y)
#Equation coefficient and Intercept
print('Coefficient: \n', model.coef_)
print('Intercept: \n', model.intercept_)
#Predict Output
predicted= model.predict(x_test)
x <- cbind(x_train,y_train)
#Train the model using the training sets and check #score
logistic <- glm(y_train ~ ., data = x,family='binomial') summary(logistic)
#Predict Outputpredicted= predict(logistic,x_test)
决
策
树
#Import Library
#Import other necessary libraries like pandas, numpy... from sklearn import tree
#Assumed you have, X (predictor) and Y (target) for
#training data set and x_test(predictor) of #test_dataset
#Create tree objectmodel = tree.DecisionTreeClassifier(criterion='gini') #for classification, here you can change the #algorithm as gini or entropy (information gain) by
#default it is gin
#model = tree.DecisionTreeRegressor() for
#regression
#Train the model using the training sets and check #score
model.fit(X, y)
model.score(X, y)
#Predict Outputpredicted= model.predict(x_test)
#Import Library
library(rpart)
x <-cbind(x_train,y_train)
#grow tree
fit <- rpart(y_train ~ ., data = x,method="class") summary(fit)
#Predict Outputpredicted= predict(fit,x_test)
支持
向量机
#Import Library
from sklearn import svm
#Assumed you have, X (predictor) and Y (target) for #training data set and x_test(predictor) of test_dataset
#Create SVM classification objectmodel = svm.svc()
#there are various options associatedwith it, this is simple for classification.
#Train the model using the training sets and check #score
model.fit(X, y)
model.score(X, y)
#Predict Outputpredicted= model.predict(x_test)
#Import Library
library(e1071)
x <- cbind(x_train,y_train) #Fitting model
fit <-svm(y_train ~ ., data = x) summary(fit)
#Predict Outputpredicted= predict(fit,x_test)
贝叶斯算法
#Import Libraryfrom sklearn.naive_bayes import GaussianNB
#Assumed you have, X (predictor) and Y (target) for
#training data set and x_test(predictor) of test_dataset
#Create SVM classification object model = GaussianNB()
#there is other distribution for multinomial classes like Bernoulli Naive Bayes
#Train the model using the training sets and check
#scoremodel.fit(X, y)
#Predict Outputpredicted= model.predict(x_test)
#Import Librarylibrary(e1071)
x <- cbind(x_train,y_train)#Fitting model
fit <-naiveBayes(y_train ~ ., data = x) summary(fit)
#Predict Outputpredicted= predict(fit,x_test)
k-近邻算法析
#Import Library
from sklearn.neighbors import KNeighborsClassifier
#Assumed you have, X (predictor) and Y (target) for
#training data set and x_test(predictor) of test_dataset
#Create KNeighbors classifier object model KNeighborsClassifier(n_neighbors=6)
#default value for n_neighbors is 5
#Train the model using the training sets and check score model.fit(X, y)
#Predict Outputpredicted= model.predict(x_test)
#Import Librarylibrary(knn)
x <- cbind(x_train,y_train)
#Fitting model
fit <-knn(y_train ~ ., data = x,k=5) summary(fit)
#Predict Output
predicted= predict(fit,x_test)
硬聚类算法
#Import Library
from sklearn.cluster import KMeans
#Assumed you have, X (attributes) for training data set
#and x_test(attributes) of test_dataset
#Create KNeighbors classifier object model
k_means = KMeans(n_clusters=3, random_state=0)
#Train the model using the training sets and check score model.fit(X)
#Predict Outputpredicted= model.predict(x_test)
#Import Library
library(cluster)
fit <- kmeans(X, 3)
#5 cluster solution
随机森林算法
#Import Libraryfrom sklearn.ensemble import RandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for
#training data set and x_test(predictor) of test_dataset
#Create Random Forest objectmodel= RandomForestClassifier()
#Train the model using the training sets and check score model.fit(X, y)
#Predict Outputpredicted= model.predict(x_test)
#Import Library
library(randomForest)
x <- cbind(x_train,y_train)
#Fitting model
fit <- randomForest(Species ~ ., x,ntree=500) summary(fit)
#Predict Outputpredicted= predict(fit,x_test)
降维算法
#Import Library
from sklearn import decomposition
#Assumed you have training and test data set as train and
#test
#Create PCA object pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)
#For Factor analysis
#fa= decomposition.FactorAnalysis()
#Reduced the dimension of training dataset using PCA train_reduced = pca.fit_transform(train)
#Reduced the dimension of test datasettest_reduced = pca.transform(test)
#Import Library
library(stats)
pca <- princomp(train, cor = TRUE)
train_reduced <- predict(pca,train)
test_reduced <- predict(pca,test)
GB
D
T
#Import Library
from sklearn.ensemble import GradientBoostingClassifier
#Assumed you have, X (predictor) and Y (target) for
#training data set and x_test(predictor) of test_dataset
#Create Gradient Boosting Classifier object
model= GradientBoostingClassifier(n_estimators=100, \ learning_rate=1.0, max_depth=1, random_state=0)
#Train the model using the training sets and check score model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
#Import Library
library(caret)
x <- cbind(x_train,y_train)
#Fitting modelfitControl <- trainControl( method = "repeatedcv", + number = 4, repeats = 4)
fit <- train(y ~ ., data = x, method = "gbm",+ trControl = fitControl,verbose = FALSE)
predicted= predict(fit,x_test,type= "prob")[,2]
原文发布时间为:2015-12-02
本文来自云栖社区合作伙伴“大数据文摘”,了解相关信息可以关注“BigDataDigest”微信公众号
相关资源:敏捷开发V1.0.pptx