The "No Free Lunch" Theorem argues that, without having prior knowledge, there is no single model that will always do better than any other model.
Hence it is important to evaluate different approaches or techniques and select the best performing approach.
In this lecture:
Best practiced on model evaluation
Overfitting vs Underfitting
Evaluation metrics (classification)
Model Evaluation
Feature vector with known label (y_true) -> Learned Model -> Predicted Label (output) y_pred = f(x)
To Evaluate the performance of a machine learning model on a dataset (consisting of feature vectors and known labels), we need quantify how close the model's prediections are to the true labels of the given dataset.
Evaluation Metrics
Regression
MSE (mean squared error)
MAE (mean absolute error)
Classification
Accuracy: proportion of correctly predicted labels
This is what we focus on.
Model Evaluation - Final Model
Happens in the third step (Evaluation)
Data needs to be split into training and testing sets:
The training set should be used in the training phase to select the most promising model.
The testing set should be used in the evaluation phase to evaluation phase to evaluate the final model.
No observations are present in both sets.
Train-Test Data Splitting
Evaluating the final model on the training set may result in an optimistic performance: the final model might have picked up on some patterns that only exist in the training data.
We're interested in evaluating the performance of the model on unseen data: estimating the generalization error. How well will the model generalize to unseen data?
Our experimental setup needs to match (simulate) how we intend to use the model.
Random Sampling without Replacement
Dataset split into two sets:
Training set (maybe 75%), can be further split into validation
Testing set (maybe 25%)
Simplest approach: appropriate for regression problems that are no time dependent
Procedure:
For each sample, flip a weighted coid to decide if the sample goes into the training or testing set
For example, 75% of records may be used for training and the remaining 25% for testing.
Common splits are 60-40, 70-30, or 80-20, depending on the number of samples in the original dataset. However, the splits of 90-10 and 99-1 are also common for very large datasets.
Stratification Followed by Random Sampling without Replacement
Ensures that the class ratios for the training and testing sets match the original data set
Appropriate for classification problems
Procedure:
Samples are divided by their class labels
Each class is divided into a training and testing set using random sampling without replacement
All Training sets are merged to produce a single training set
All Testing sets are merged to produce a single testing set