一,前记

铺垫了很久的基础知识终于到sklearn了。sklearn是Python里面利用机器学习的一个模块。

二,通用学习模型

在sklearn里面用很多机器学习的传统模型,像KNN,SVM模型。但是所谓的通用学习模型,就是大部分模型都能套用这一个模板对机器学习进行学习。

通用学习的模板步骤:
第一步:首先导入几个我们所需要的库

#encoding:utf-8
from numpy import *
from sklearn import *
from sklearn.model_selection import *
from sklearn.linear_model import *
from sklearn.neighbors import *
from matplotlib.pyplot import *
from sklearn.datasets.samples_generator import *
from sklearn.svm import *

第二步:导入整理好的数据

iris=datasets.load_iris()
iris_X=iris.data
iris_Y=iris.target

第三步:区分测试集和训练集
格式:X_train,X_test,Y_train,Y_test=train_test_split(train_data,train_target,test_size=,random_state=)
在这个里面train_test_split是用于将矩阵随机划分为训练子集和测试子集,并返回划分好的训练集和测试集的样本和标签。test_size一般以小数点的形式出现,代表的是测试集占整个数据集的百分之多少。random_state代表的是随机数种子,即:当随机数种子相同的时候,同一数据集得到的结果会是一样的。默认情况下random_state=None,每次的结果都不一样。
例:

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3)

第四步:建立模型

knn = KNeighborsClassifier()    #在这里我们使用的是KNN的模型,还可以使用其他的模型
knn.fit(X_train, Y_train)    #在这里我们使用的是X_train,Y_train进行模型的训练

第五步:测试模型
测试模型的目的就是将测试集的数据通过模型得到的结果与实际结果进行比较来查看模型的准确性。

print(knn.predict(X_test))
print(Y_test)
"""
this is final predict data==
[1 0 1 1 1 1 1 0 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1
 0 1 0 0 0 1 0 1 1 1 0 0 1 0 0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 1 0 1 0 1 0 1 1
 1 1 0 1 1 1 0 0 0 1 1 1 1 0 0 1]
this is final true data==
[1 0 0 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 1 1 0 0 1 0 0 0 0 1 1 0 0 1
 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 1 0 1 0 1 0 1 1
 1 1 0 0 1 1 0 0 0 1 1 1 1 0 0 1]
 """

第六步:查看模型得分
格式:print(knn.score(X_train,Y_train))

"""
this is final score==
0.9380952380952381
"""

二、数据归一化

这一做法是将数据进行预处理。如果单个特征没有或多或少接近于标准正态分布,那么它可能并没有在项目里面表现出很好的性能。我们可以通过scale将数据进行缩放的预处理。

格式:

X=preprocessing.scale(x)

例:

#encodinfg:utf-8
from numpy import *
from sklearn import *
from sklearn.model_selection import *
from sklearn.linear_model import *
from sklearn.neighbors import *
from matplotlib.pyplot import *
from sklearn.datasets.samples_generator import *
from sklearn.svm import *
from sklearn.datasets import *

digit=load_digits()
X=digit.data
Y=digit.target

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3)
model=SVC()
model.fit(X_train,Y_train)
print("this is original predict data==")
print(model.predict(X_test))
print("this is original true data==")
print(Y_test)
print("this is original score==")
print(model.score(X_test,Y_test))

X1=preprocessing.scale(X)
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3)
model=SVC()
model.fit(X_train,Y_train)
print("this is final predict data==")
print(model.predict(X_test))
print("this is final true data==")
print(Y_test)
print("this is final score==")
print(model.score(X_train,Y_train))
"""
this is final predict data==
[3 5 7 4 5 5 5 5 5 4 5 5 5 5 3 5 3 5 5 5 5 5 4 3 5 5 5 5 6 5 5 4 6 5 5 5 0
 5 5 1 5 5 5 5 5 6 5 4 7 5 5 5 5 4 0 5 5 5 0 5 5 5 5 5 0 5 5 3 2 5 5 4 5 1
 3 5 4 5 5 2 5 5 5 5 5 5 5 5 3 5 1 5 5 5 5 5 5 5 2 5 5 5 4 5 4 5 5 0 4 5 5
 5 5 5 1 2 5 4 0 4 5 5 5 5 4 5 0 5 6 5 5 5 5 3 5 6 5 5 5 5 1 5 5 2 5 5 2 5
 5 5 5 4 4 5 5 6 5 5 5 5 5 5 0 5 5 5 5 0 5 5 0 5 5 5 7 5 5 5 5 5 3 9 2 5 5
 5 5 1 4 5 0 5 5 5 5 4 5 5 5 5 5 5 5 6 5 5 5 9 5 1 5 7 4 5 5 5 5 5 0 3 5 5
 5 2 5 5 5 5 5 5 5 5 5 5 6 5 5 5 5 5 5 5 5 0 5 5 3 4 5 5 6 6 5 5 5 5 5 5 0
 6 5 5 5 5 5 5 5 5 5 5 5 5 9 5 5 5 5 5 5 5 5 5 5 5 5 5 0 5 5 5 2 5 5 1 5 6
 5 5 5 5 5 4 5 5 5 6 5 0 5 5 2 5 5 5 1 5 6 5 5 2 5 5 5 5 5 2 5 5 9 5 5 5 2
 5 5 5 5 0 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 0 4 5 5 7 5 5 1 0 5 5 6 5 5 6
 3 5 5 5 5 5 0 5 7 5 5 5 5 5 4 5 5 5 5 5 5 5 4 5 5 5 5 5 5 4 5 5 5 5 5 5 5
 6 4 5 3 5 5 6 5 5 5 5 5 5 4 2 5 5 5 5 1 0 6 1 5 4 5 5 5 6 5 6 5 5 5 5 6 5
 5 5 3 0 0 5 0 0 2 5 5 5 5 1 5 6 5 5 6 5 5 5 5 3 5 5 5 4 5 5 5 5 0 5 5 5 1
 4 5 5 5 5 5 5 5 5 5 5 2 5 5 4 5 5 5 3 5 5 3 5 5 5 5 5 5 5 5 8 5 5 5 5 2 5
 5 5 5 5 1 6 5 5 5 5 5 6 5 6 5 5 5 5 5 5 5 5]
this is final true data==
[3 8 7 4 2 9 9 8 8 4 4 5 2 1 3 7 3 9 8 6 0 2 4 3 5 3 1 9 6 8 9 4 6 9 7 0 0
 7 9 1 9 7 7 7 9 6 6 4 7 7 9 3 4 4 0 0 2 5 0 5 3 9 0 8 0 5 4 3 2 8 7 4 2 1
 3 3 4 3 8 2 8 7 4 9 8 5 3 5 3 0 1 2 1 1 9 8 7 9 2 8 2 0 4 2 4 3 9 0 4 0 7
 5 1 0 1 2 3 4 0 4 3 0 7 5 4 2 0 5 6 1 1 3 8 3 7 6 9 1 9 8 1 0 1 2 7 8 2 7
 8 1 4 4 4 4 9 6 8 7 9 7 9 1 0 5 5 0 7 0 1 3 0 7 8 5 7 2 3 7 8 9 3 9 2 6 1
 5 8 1 4 5 0 8 9 7 7 4 6 1 5 9 9 5 5 6 6 3 9 9 7 1 2 7 4 3 7 8 2 5 0 3 8 2
 5 2 0 9 8 0 5 5 9 1 7 8 6 8 3 2 5 8 5 8 2 0 2 7 3 4 8 5 6 6 9 3 3 9 5 6 0
 6 2 2 4 2 1 8 3 8 1 9 9 1 9 0 7 9 5 7 8 9 9 2 4 7 5 1 0 7 5 1 2 0 8 1 0 6
 7 7 0 5 5 4 7 5 1 6 0 0 3 6 2 4 5 7 1 2 6 5 9 2 4 9 7 1 9 2 7 0 9 0 3 5 2
 3 3 7 9 0 9 6 9 9 9 2 7 8 9 7 5 3 7 8 9 6 1 0 4 0 5 7 8 7 1 0 9 8 6 8 0 6
 3 3 9 7 5 8 0 9 7 8 6 4 7 9 4 3 8 7 1 7 0 9 4 9 3 7 9 4 1 4 1 3 6 5 2 8 4
 6 4 1 3 2 9 6 4 6 1 9 5 8 4 2 8 8 2 2 1 0 6 1 3 4 2 6 7 6 8 6 9 9 7 6 6 4
 6 1 3 0 0 1 0 0 2 0 1 6 7 1 1 6 5 5 6 2 8 1 1 3 9 9 3 4 1 7 1 1 0 9 5 8 1
 4 3 9 5 8 7 2 2 4 4 0 2 1 3 4 8 9 2 3 4 2 3 5 8 5 3 4 7 8 7 8 0 6 7 9 2 8
 7 3 2 8 1 6 2 6 0 7 7 6 2 6 7 1 8 0 5 3 3 1]
this is final score==
1.0
"""

可以看出来,准确率还是有点吓人的。

三、交叉验证

交叉验证就是在适当的数据集里面可以反复将数据划分成训练集和测试集

1、cross_val_score
格式:cross_val_score(knn,X,y,cv=n,scoring='')
参数解释:knn:模型名称
cv:分成n组
scoring:评估器scoring方法(有很多种方法,详情百度)
例:

#encodinfg:utf-8
from numpy import *
from sklearn import *
from sklearn.model_selection import *
from sklearn.linear_model import *
from sklearn.neighbors import *
from matplotlib.pyplot import *
from sklearn.datasets.samples_generator import *
from sklearn.svm import *
from sklearn.datasets import *

digits = load_digits()
X = digits.data
y = digits.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)

k_range=range(1,31)
k_scores=[]
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores=cross_val_score(knn,X,y,cv=50,scoring='accuracy')
    k_scores.append(scores.mean())
print("this is k's scores==")
print(k_scores)


plot(k_range,k_scores)
show()
"""
this is k's scores==
[0.987114709851552, 0.9846147098515519, 0.9864418285335094, 0.986086989823832, 0.9857536564904986, 0.9829203231571654, 0.9831133056133057, 0.982586989823832, 0.9804075026443446, 0.9802536564904984, 0.9812522522522521, 0.9794594594594593, 0.9796133056133055, 0.976100485100485, 0.9744553238101623, 0.9732751618668426, 0.9739418285335093, 0.9727623413540221, 0.9714147832696218, 0.9709147832696219, 0.9704147832696219, 0.9694147832696219, 0.9682352960901348, 0.9677352960901348, 0.9677481166029552, 0.9672481166029553, 0.9672481166029553, 0.965475389330228, 0.964475389330228, 0.964975389330228]
#上面的结果是在不同k-neighbor下的准确度百分比
"""

700-2--2AC8R-O-01@PO8VG

2、learning_curve
这个函数的作用是:对于大小不同的训练集,确定交叉验证训练和测试的分数。一个交叉验证发生器将整个数据集分割k次
格式:learning_curve(estimator,X,y,train_sizes=[],cv=,scoring='')
参数解释:estimator:所使用的分类器
X:样本数据集
y:样本分类值
train_sizes:训练值的百分比∈(0,1).是每一轮得到的平均方差
返回值为:train_sizes_abs:用于生成的训练样本数
train_scores:训练集上的分数
test_scores:测试集上的分数

例:

#encodinfg:utf-8
from numpy import *
from sklearn import *
from sklearn.model_selection import *
from sklearn.linear_model import *
from sklearn.neighbors import *
from matplotlib.pyplot import *
from sklearn.datasets.samples_generator import *
from sklearn.svm import *
from sklearn.datasets import *

digits = load_digits()
X = digits.data
y = digits.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)

#learning_curve
train_sizes,train_accuracy,test_accuracy=learning_curve(SVC(gamma=0.001),X,y,cv=10,scoring='accuracy',train_sizes=[0.1,0.25,0.5,0.75,1])
print("this is train_sizes==")
print(train_sizes)
print("this is trian_loaccuracy==")
print(train_accuracy)
print("this is test_accuracy==")
print(test_accuracy)

train_mean=mean(train_accuracy,axis=1)
test_mean=mean(test_accuracy,axis=1)
print("this is train's mean==")
print(train_mean)
print("this is test's mean==")
print(test_mean)

plot(train_sizes,train_mean,"o",color='blue',linestyle='--',label='train')
plot(train_sizes,test_mean,"o-",color='red',label='test')
legend(loc='best')

show()

"""
this is train_sizes==
[ 161  403  806 1209 1612]
this is trian_loaccuracy==
[[1.         0.99378882 0.99378882 0.99378882 0.99378882 0.99378882
  0.99378882 0.99378882 0.99378882 0.99378882]
 [1.         0.99751861 0.99751861 0.99751861 0.99751861 0.99751861
  0.99751861 0.99751861 0.99751861 0.99751861]
 [1.         0.99875931 0.99875931 0.99875931 0.99875931 0.99875931
  0.99875931 0.99875931 0.99875931 0.99875931]
 [1.         0.99834574 0.99917287 0.99917287 0.99917287 0.99917287
  0.99917287 0.99917287 0.99917287 0.99917287]
 [0.99937965 0.99875931 0.99875931 0.99875931 0.99875931 0.99875931
  0.99875931 0.99875931 0.99875931 0.99937965]]
this is test_accuracy==
[[0.93513514 0.91803279 0.77348066 0.69444444 0.7877095  0.78212291
  0.91061453 0.85955056 0.83615819 0.82954545]
 [0.91351351 0.93989071 0.87845304 0.88888889 0.91061453 0.87709497
  0.97206704 0.94382022 0.93785311 0.93181818]
 [0.94054054 0.9726776  0.93370166 0.98888889 0.97206704 0.97765363
  0.97765363 0.97191011 0.94350282 0.9375    ]
 [0.94594595 0.9726776  0.95027624 0.98888889 0.96089385 0.98324022
  0.99441341 0.98876404 0.97175141 0.95454545]
 [0.95135135 1.         0.95027624 0.99444444 0.98324022 0.98882682
  0.99441341 0.99438202 0.96610169 0.96022727]]
this is train's mean==
[0.99440994 0.99776675 0.99888337 0.99917287 0.99888337]
this is test's mean==
[0.83267942 0.91940142 0.96160959 0.97113971 0.97832635]
"""

N-T_RY_--6EMSIQEG-5Z-H9

3、validation_curve
这个函数的主要内容就是用来查看分类器下不同禅师的取值导致模型的准确性
格式:validation_curve(estimator,X,y,param_name='',param_range=[],cv=,scoring='')
参数解释:param_name:代表是变化的哪一些参数.
param_range=[]:参数变化范围
返回值:训练的分数和测试的分数
例:

#encodinfg:utf-8
from numpy import *
from sklearn import *
from sklearn.model_selection import *
from sklearn.linear_model import *
from sklearn.neighbors import *
from matplotlib.pyplot import *
from sklearn.datasets.samples_generator import *
from sklearn.svm import *
from sklearn.datasets import *

digits = load_digits()
X = digits.data
y = digits.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)

#validation_curve
param_range = np.logspace(-6, 0, 5)
print("param_range")
print(param_range)

train_loss, test_loss = validation_curve(SVC(), X, y, param_name='gamma', param_range=param_range, cv=10, scoring='accuracy')

train_loss_mean = -np.mean(train_loss, axis=1)
test_loss_mean = -np.mean(test_loss, axis=1)

plot(param_range, train_loss_mean, 'o-', color="r",
         label="Training")
plot(param_range, test_loss_mean, 'o-', color="g",
        label="Cross-validation")

xlabel("gamma")
ylabel("Loss")
legend(loc="best")

"""
param_range
[1.00000000e-06 3.16227766e-05 1.00000000e-03 3.16227766e-02
 1.00000000e+00]
 """

四、保存模型

保存模型的话我们使用sklearn里面的joblib

from sklearn.externals import joblib #jbolib模块

#保存Model(注:save文件夹要预先建立,否则会报错)
joblib.dump(clf, 'save/clf.pkl')

#读取Model
clf3 = joblib.load('save/clf.pkl')

#测试读取后的Model
print(clf3.predict(X[0:1]))

五、后记

刚了那么多天的机器学习终于要完成了,开心的完结撒花~~