DISCOVER: Cross-validation

Estimate the performance of a machine learning model over new, unseen samples of data using TeselaGen's DISCOVER module.

E
Written by Eduardo Abeliuk
Updated over a week ago

Cross-validation is a method for estimating the performance of a machine learning model over new, unseen samples of data.

At the DISCOVER module, after a predictive or evolutive machine learning model is trained, the model's results page will display cross-validation (CV) results in the summary card:

The score inside the red rectangle can be seen as an estimation of the metric R2 over a novel dataset. This cross-validated metric is very helpful as all the data provided in model's creation was used for training, and the training scores will have better results than scores calculated with new samples.

A detailed view of CV results is displayed within the Performance Statistics card. This card shows all scoring metrics calculated with the CV approach, as seen below:

The above view show the statistics of the different metrics that were calculated using the CV approach. A more detailed explanation on how those statistics are calculated is discussed within the following title.

What is Cross-validation?

Cross-validation (CV) [1] is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (called the validation dataset or testing set). The goal of cross-validation is to test the model's ability to predict new data that was not used in training, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem).

One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to give an estimate of the model's predictive performance.

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used, but in general k remains an unfixed parameter.

In repeated k-fold CV the data is randomly split into k partitions r times. The performance of the model can thereby be averaged over several runs. Repetition is useful to reduce variance in score estimation, and especially useful when datasets are small (<= 200 samples).

How is Cross-validation applied in DISCOVER?

By default DISCOVER uses k = 5 and r = 5.

In DISCOVER, CV metrics are displayed in terms of their statistics over the many CV partitions. These statistics are:

  • μ ± σ: The mean (μ) and standard deviations (σ) of the metric over all evaluation sets. Standard deviation is calculated with 0 degrees of freedom.

  • Median: The median of the metric for all sets in evaluation. It is the value separating the higher half from the lower half of the calculated score values. The median is a location estimator that is commonly considered to be more robust to outliers than the mean.

  • IQR: The interquartile range, also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles. It is usually considered more robust to outliers than the standard deviation.

FAQ:

  • Can I change cross-validation parameters used by the platform?

Not right now, but you can ask the TeselaGen’s Data Science team to do it for you. In future releases we are planning to enable configurable CV parameters to our users.


More DISCOVER help articles:

Did this answer your question?