Week 1 Notes

Intro to machine learning (Tuesday Sept 6th, 2022)


Date	09/06/2022
Topic	Introduction
Professor	Dr. John Bukowy
Week	1

Lecture notes due every friday @11:59pm
Attempt each problem in problem set (soft %)

What is machine learning? How is it different than traditional programming?

A set of methods that allow computers to learn from data.
It helps detect patterns or structures directly from data, transforming data into knowledge.
"Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed."
Pervasive today!
Speech recognition
Face recognition
Social Media
- Relevant content
- User engagement
Digital Advertising
- Building user profiles
Medical imaging
Spam Emails
Approve/Deny credit card applications

Types of machine learning problems

Goal is to make decisions

Terminology

Supervised learning
- Have labels
Unsupervised Learning
- Don't have labels
Reinforcement Learning

Continued Lecture (Wednesday Sept 7th, 2022)


Date	09/07/2022
Topic	Lab 1
Professor	Dr. John Bukowy
Week	1

Supervised learning

Classification

Predict a category

Regression

Predict a continuous value (i.e. price of a house)

Link to article/image

Regression Problem
- Linear Regression
Classification Problem
- Logistic Regression
- SVM

Unsupervised Learning

Clustering problems
- K-Means
Dimension Reduction
- PCA

Terms

Features

Independent variable, predictor, attribute

Label

Dependent variable, response, target

Sample

Instance, observation, record, example

Steps of building machine learning systems

Data Preprocessing

Feature Engineering
Feature scaling

Training

Use the training set to perform
- Model fitting
- Model selection

Evaluation
Prediction

Data Preprocessing

Split the data into training dataset and testing dataset

Training

The goal is to learn from the label data
A training algorithm is implemented to learn this mapping from the training data.
Assigns parameters to models that allow them to work with the data

Models

Regression

Linear Regression
K-Nearest Neighbors (KNN)
Decision Trees
Neural Networks

Classification

Logistic Regress
K-Nearest Neighbors (KNN)
Decision Trees
Support Vector Machines (SVMs)
Neural Networks

Evaluation Metrics

Regression
- Root Mean-Squared Error
- Mean Absolute Error
- R-squared
Classification
- Accuracy
- Precision
- Recall
- F1-score
- ROC curves
- Precision recall curves
- Confusion matrices

Tools we will use in this class

Python - NumPy and Scikit-Learn

High quality software with regular releases
Great user documentation
Scikit-Learn:
- Fit nearly all models into a unified API
- Includes a large array of algorithms - hard to find something missing!

Data Structuring & Notations


Date	09/09/2022
Topic	Data Structuring
Professor	Dr. John Bukowy
Week	1

Tabular Data

The easiest data to work with
Records are organized into rows
Variables are organized into columns (fields)
Each variable has a type (e.g. int, text, etc.)
Commonly stored in relational databases or spreadsheets

Semi-Structured Data

Tabular data is often stored in JSON or similar files
These files do not enforce a schema
We cannot guarantee that each record will have the same fields

Unstructured Data

Some data such as text are completely unstructured
Cannot be used directly with classic ML models but can be with newer deep learning models

Data format for classic ML models

Preprocessing
- Classic ML models expect data to be in tabular form with numerical features
- Semi-structured data needs to be processed
Training
Evaluation
Prediction
Features in tabular data that are not numerical should be also processed:
- Categorical: color, vehicle type
- Boolean: married, automatic engine
- Ordinal: star ratings
Engineering numerical features from non-numerical ones will be discussed in CS3300 Data Science
We work mostly with tabular data containing only numerical variables

Nominal

-> categories

Ordinal

-> categories -> Rank order

Interval

-> categories -> Rank order -> Equal spacing

Ratio

-> categories -> Rank order -> Equal spacing -> Ratio

Feature vector, feature matrix, label vector

Feature vector

Features represent the measurements or properties of an object for which we want to predict a label.
We can collect these features into a vector: feature vector.
- Each flower can be represented by a vector.

x = [[x1], [x2], [...], [xm]]

Feature matrix

A training, testing or future/naive data set consists of a set of feature vectors.
For each set, these feature vectors are collected into a matrix that we call feature matrix
Must transpose the vector before adding to matrix
Subscript -> attribute
Superscript -> object index

X = [xT, xT, xT]

Label Vector

y, y(hat) is a prediction

Feature Space

the set of all feature vectors. A feature vector can be depicted as a point in the feature space.

What is array reshaping? When is it useful?

mnist data set
can reshape into a feature matrix

Tooling (Numpy and Scipy)

Numpy and Scipy are libraries for numerical and scientific computing
- Arrays - Numpy
- Sparse Matrices - Scipy

NumPy Array Data structure

Multi dim
All vals same type
Allocates a large block of contiguous memory
- Very little memory overhead
- Single object for array itself
- Faster to iterate through memory in-order due to cache

Numpy Basics

array = np.array([], dtype=np.float32)

array = np.zeros(4, dype=np.int32)

array = np.ones((4, 5))

# number of dimensions
array.ndim
# shape
array.shape
# data type
array.dtype

# indexing
array[0]
array[1, 2]
array[1, 2, 3, 4, 5]

# Selecting a single dimension
array[:, 5, :]

# Reversing a 1D array

# returns array with elements at indices in list
subset = array[[1, 8, 4, 5, 2]]
subset = array[another_array]

# boolean mask
subset = array[[True, False, False, True, True]]

# returns a boolean array
mask = array == 1
selected = array[mask]

# Sorting
sorted = np.sort(array)

# Maximum
max = np.amax(array)

# Minimum
min = np.amin(array)

sorted_idx = np.argsort(array)
# smallest element of the array
array[sorted_idx[0]]
# largest element of array
array[sorted_idx[-1]]

# reshape arrays
images.reshape((3, 4))

X1 = np.hstack([X1, X2])