You do cross-validation when you want to do any of these two things:
- Model Selection
- Error Estimation of a Model
Model selection can come in different scenarios:
Selecting one algorithm vs others for a particular problem/dataset
Selecting hyper-parameters of a particular algorithm for a particular problem/dataset
(please notice that if you are both selecting an algorithm – better to call it model – and also doing hyper-parameters search, you need to do Nested Cross Validation . Is Nested-CV really necessary?)
Cross-validation ensures up to some degree that the error estimate is the closest possible as generalization error for that model (although this is very hard to approximate). When observing the average error among folds you can have a good projection of the expected error for a model built on the full dataset. Also is importance to observe the variance of the prediction, this is, how much the error varies from fold to fold. If the variation is too high (considerably different values) then the model will tend to be unstable. Bootstrapping is the other method providing good approximation in this sense. I suggest to read carefully the section 7 on “Elements of Statistical Learning” Book, freely available at: ELS-Standford
As it has been mentioned before you must not take the built model in none of the folds. Instead, you have to rebuild the model with the full dataset (the one that was split into folds). If you have a separated test set, you can use it to try this final model, obtaining a similar (and must surely higher) error than the one obtained by CV. You should, however, rely on the estimated error given by the CV procedure.
After performing CV with different models (algorithm combination, etc) chose the one that performed better regarding error and its variance among folds. You will need to rebuild the model with the whole dataset. Here comes a common confusion in terms: we commongly refer to model selection, thinking that the model is the ready-to-predict model built on data, but in this case it refers to the combination of algorithm+preprocesing procedures you apply. So, to obtain the actual model you need for making predictions/classification you need to build it using the winner combination on the whole dataset.
Last thing to note is that if you are applying any kind of preprocessing the uses the class information (feature selection, LDA dimensionality reduction, etc) this must be performed in every fold, and not previously on data. This is a critical aspect. Should do the same thing if you are applying preprocessing methods that involve direct information of data (PCA, normalization, standardization, etc). You can, however, apply preprocessing that is not depend from data (deleting a variable following expert opinion, but this is kinda obvious). This video can help you in that direction: CV the right and the wrong way
Here, a final nice explanation regarding the subject: CV and model selection
- 算法选择 – 普通cv
- 算法超参选择 – 格搜索等针对超参优化
from sklearn.pipeline import make_pipeline clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1)) cross_val_score(clf, iris.data, iris.target, cv=cv)