正确使用交叉验证

引子:在封装cell预测模块的交叉验证功能时想到,到底是应该直接把交叉验证中效果较好的模型持久化下来用于预测?还是像现在这样子,用全部数据集重新训练一个模型用于后期的预测?

原文

You do cross-validation when you want to do any of these two things:

  • Model Selection
  • Error Estimation of a Model

Model selection can come in different scenarios:

Selecting one algorithm vs others for a particular problem/dataset
Selecting hyper-parameters of a particular algorithm for a particular problem/dataset
(please notice that if you are both selecting an algorithm - better to call it model - and also doing hyper-parameters search, you need to do Nested Cross Validation . Is Nested-CV really necessary?)

Cross-validation ensures up to some degree that the error estimate is the closest possible as generalization error for that model (although this is very hard to approximate). When observing the average error among folds you can have a good projection of the expected error for a model built on the full dataset. Also is importance to observe the variance of the prediction, this is, how much the error varies from fold to fold. If the variation is too high (considerably different values) then the model will tend to be unstable. Bootstrapping is the other method providing good approximation in this sense. I suggest to read carefully the section 7 on “Elements of Statistical Learning” Book, freely available at: ELS-Standford

As it has been mentioned before you must not take the built model in none of the folds. Instead, you have to rebuild the model with the full dataset (the one that was split into folds). If you have a separated test set, you can use it to try this final model, obtaining a similar (and must surely higher) error than the one obtained by CV. You should, however, rely on the estimated error given by the CV procedure.

After performing CV with different models (algorithm combination, etc) chose the one that performed better regarding error and its variance among folds. You will need to rebuild the model with the whole dataset. Here comes a common confusion in terms: we commongly refer to model selection, thinking that the model is the ready-to-predict model built on data, but in this case it refers to the combination of algorithm+preprocesing procedures you apply. So, to obtain the actual model you need for making predictions/classification you need to build it using the winner combination on the whole dataset.

Last thing to note is that if you are applying any kind of preprocessing the uses the class information (feature selection, LDA dimensionality reduction, etc) this must be performed in every fold, and not previously on data. This is a critical aspect. Should do the same thing if you are applying preprocessing methods that involve direct information of data (PCA, normalization, standardization, etc). You can, however, apply preprocessing that is not depend from data (deleting a variable following expert opinion, but this is kinda obvious). This video can help you in that direction: CV the right and the wrong way

Here, a final nice explanation regarding the subject: CV and model selection

理解

  1. 在什么时候需要使用交叉验证:
    • 做模型选择的时候
    • 需要评估模型误差的时候
  2. 做模型选择又分两种:
    • 算法选择 - 普通cv
    • 算法超参选择 - 格搜索等针对超参优化
  3. 交叉验证能看出什么:
    • 较为接近的衡量一个模型的误差表现
    • 观察方差,判断模型的稳定性

重点来了:不要将交叉验证中效果最好的模型直接作最终预测使用,而应该用全部数据集重新训练一个模型。(看来没用错,这也是探索这一波的起因 =。=)

意外收获:重大附加发现

如果数据集需要进行下来动作,请在交叉验证切分完数据集后进行:

  • 标准化、MinMax等结合数据规模的数据预处理手段。
  • 特征选择、LDA等结合类标的相关处理。

但是cross_val_score如何对切分后的数据集做处理呢?
官方文档给出了答案:

1
2
3
from sklearn.pipeline import make_pipeline
clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
cross_val_score(clf, iris.data, iris.target, cv=cv)