Week 3 Notes

Problem Set Review (Monday Sept 19th, 2022)


Date	09/19/2022
Topic	Problem Set Review
Professor	Dr. John Bukowy
Week	3

No content. Went over problem set.

Model Evaluation (Tuesday Sept 20th, 2022)


Date	09/20/2022
Topic	Problem Set Review
Professor	Dr. John Bukowy
Week	3

No-Free-Lunch theorem

The "No Free Lunch" Theorem argues that, without having prior knowledge, there is no single model that will always do better than any other model.

Hence it is important to evaluate different approaches or techniques and select the best performing approach.
In this lecture:
- Best practiced on model evaluation
- Overfitting vs Underfitting
- Evaluation metrics (classification)

Model Evaluation

Feature vector with known label (y_true) -> Learned Model -> Predicted Label (output) y_pred = f(x)

To Evaluate the performance of a machine learning model on a dataset (consisting of feature vectors and known labels), we need quantify how close the model's prediections are to the true labels of the given dataset.

Evaluation Metrics

Regression
- MSE (mean squared error)
- MAE (mean absolute error)
Classification
- Accuracy: proportion of correctly predicted labels
- This is what we focus on.

Model Evaluation - Final Model

Happens in the third step (Evaluation)
- Data needs to be split into training and testing sets:
  - The training set should be used in the training phase to select the most promising model.
  - The testing set should be used in the evaluation phase to evaluation phase to evaluate the final model.
- No observations are present in both sets.

Train-Test Data Splitting

Evaluating the final model on the training set may result in an optimistic performance: the final model might have picked up on some patterns that only exist in the training data.
We're interested in evaluating the performance of the model on unseen data: estimating the generalization error. How well will the model generalize to unseen data?
Our experimental setup needs to match (simulate) how we intend to use the model.

Random Sampling without Replacement

Dataset split into two sets:
- Training set (maybe 75%), can be further split into validation
- Testing set (maybe 25%)
Simplest approach: appropriate for regression problems that are no time dependent
Procedure:
- For each sample, flip a weighted coid to decide if the sample goes into the training or testing set
- For example, 75% of records may be used for training and the remaining 25% for testing.
Common splits are 60-40, 70-30, or 80-20, depending on the number of samples in the original dataset. However, the splits of 90-10 and 99-1 are also common for very large datasets.

Stratification Followed by Random Sampling without Replacement

Ensures that the class ratios for the training and testing sets match the original data set
Appropriate for classification problems
Procedure:
- Samples are divided by their class labels
- Each class is divided into a training and testing set using random sampling without replacement
- All Training sets are merged to produce a single training set
- All Testing sets are merged to produce a single testing set

Code

No Stratification

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y)

With Stratification

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y, stratify = y)

Validation Set

Training set needs to be further split into sub-training and validation sets
If the size of the data is very large, partitioning the training set into sub-training and validation sets once is enough.
Otherwise, the performance of any model might be sensitive to how we partition the data into sub-training and validation sets.
Solution: K-Fold cross-validation
- Divide the training set into 5 (K) folds
- In each time, use one fold as the validation set and the remaining folds as the sub-training set.
- Average performance on all validation folds
- This is different than mini-batches, epochs, etc.

Cross-validation in SciKit-Learn

sklearn.model_selection.cross_val_score
sklearn.model_selection.cross_val_predict
sklearn.model_selection.cross_validate

Underfitting, Overfitting

Splitting the data into training and testing sets and performing cross validation on the training test help us:
- Not choose an overfit or underfit model

Possible Scenarios

A model, can be bad for two different reasons:
- Model that poorly fits the data, is too simple and doesn't match the data well: it is said to have high bias (underfit model).
- Model that overly fits the data, is too complex and too specific for the data: is said to have hight variance (overfit model).

A Model's error can be expressed as the sum of three different errors:

Bias
Variance
Irreducible error: Irreducible error refers to the noise that exists in the data itself. It is an aspect of the data, and we cannot beat it.

Bias:

Refers to the error that is introduced due to wrong assumptions such as approximating a complicated pattern in data with a simpler model.
A high-bias or underfit model is a model that fails to caputre the complex patterns in the training data.
- Overfitting the data

Variance

Captures how much a model is over-specialized to a particular training dataset.

How do we know?

Not performing well on training set
- Underfitting the data
Performing well on training set, but poor on validation set

KNN (Friday Sept 20th, 2022)


Date	09/23/2022
Topic	KNN
Professor	Dr. John Bukowy
Week	3

(Online Resource Notes)

KNN Algorithm Flow (Prediction Phase)

Calculate Distances
Sort
Choos k closest neighbors
Grab Labels for Neighbors
Aggregate -> Predicted query label

Time complexity

Assuming n reference points, m features, 1 query point and k neighbors
- Compute distances: Loop through all features m. O(nm)
- Sort the neighbor list (using a heap):
  - Mini-heap: O(n + klog(n))
  - Max-heap: O(k + (n-k)log(k))
- Get known labels and aggregate them: O(k)
Total time complexity:
- Mini-heap: O(nm + n + klog(n) + k)
- Max-heap: O(nm + k + (n - k)log(k) + k)
Dominating term is O(nm)

Complexity of the model

As k increases, the complexity of the model decreases, and vice versa

Choice of distance metric

Euclidean distance
- When features are continuous
Manhattan distance
- When features are binary
- When data is high dimensional, due to the curse of dimensionality

Importance of feature scaling

If we don't scale the features, the performance of the KNN model will be affected negatively
- Especially if we're using euclidean distance

Curse of dimensionality

When the nearest neighbors in high-dimensional feature space might not be close enough, so they might not be similar enough to have a similar label.

Solution: feature selection, or dimensionality reduction techniques

Summary

Advantages

Non-linear model
Adapts easily to new training samples
Extends easily to new training samples
Extends easily to multi-class problem

Disadvantages

Memory and computation costly
Curse of dimensionality
Sensitive to feature scaling
Interpretability

(Video Lecture)

KNN is non-parametric
- It doesn't assume a model
Hypothesis: Points close to eachother in feature space are most likely to be related

Terminology

Reference Points

training set data

Query point

datum we want to classify

Neighbors

points near each other in some space

Aggregation

Classification
- Mode
Regression
- Mean
- Median