Week 1 Notes
Intro to machine learning (Tuesday Sept 6th, 2022)
|
|
| Date |
09/06/2022 |
| Topic |
Introduction |
| Professor |
Dr. John Bukowy |
| Week |
1 |
- Lecture notes due every friday @11:59pm
- Attempt each problem in problem set (soft %)
What is machine learning? How is it different than traditional programming?
-
A set of methods that allow computers to learn from data.
-
It helps detect patterns or structures directly from data, transforming data into knowledge.
-
"Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed."
-
Pervasive today!
-
Speech recognition
-
Face recognition
-
Social Media
- Relevant content
- User engagement
-
Digital Advertising
-
Medical imaging
-
Spam Emails
-
Approve/Deny credit card applications
Types of machine learning problems
Goal is to make decisions
Terminology
- Supervised learning
- Unsupervised Learning
- Reinforcement Learning
Continued Lecture (Wednesday Sept 7th, 2022)
|
|
| Date |
09/07/2022 |
| Topic |
Lab 1 |
| Professor |
Dr. John Bukowy |
| Week |
1 |
Supervised learning
Classification
Predict a category
Regression
Predict a continuous value (i.e. price of a house)
Link to article/image
- Regression Problem
- Classification Problem
Unsupervised Learning
- Clustering problems
- Dimension Reduction
Terms
Features
Independent variable, predictor, attribute
Label
Dependent variable, response, target
Sample
Instance, observation, record, example
Steps of building machine learning systems
- Data Preprocessing
- Feature Engineering
- Feature scaling
- Training
- Use the training set to perform
- Model fitting
- Model selection
- Evaluation
- Prediction
Data Preprocessing
- Split the data into training dataset and testing dataset
Training
- The goal is to learn from the label data
- A training algorithm is implemented to learn this mapping from the training data.
- Assigns parameters to models that allow them to work with the data
Models
Regression
- Linear Regression
- K-Nearest Neighbors (KNN)
- Decision Trees
- Neural Networks
Classification
- Logistic Regress
- K-Nearest Neighbors (KNN)
- Decision Trees
- Support Vector Machines (SVMs)
- Neural Networks
Evaluation Metrics
-
Regression
- Root Mean-Squared Error
- Mean Absolute Error
- R-squared
-
Classification
- Accuracy
- Precision
- Recall
- F1-score
- ROC curves
- Precision recall curves
- Confusion matrices
Tools we will use in this class
Python - NumPy and Scikit-Learn
- High quality software with regular releases
- Great user documentation
- Scikit-Learn:
- Fit nearly all models into a unified API
- Includes a large array of algorithms - hard to find something missing!
Data Structuring & Notations
|
|
| Date |
09/09/2022 |
| Topic |
Data Structuring |
| Professor |
Dr. John Bukowy |
| Week |
1 |
Tabular Data
- The easiest data to work with
- Records are organized into rows
- Variables are organized into columns (fields)
- Each variable has a type (e.g. int, text, etc.)
- Commonly stored in relational databases or spreadsheets
Semi-Structured Data
- Tabular data is often stored in JSON or similar files
- These files do not enforce a schema
- We cannot guarantee that each record will have the same fields
Unstructured Data
- Some data such as text are completely unstructured
- Cannot be used directly with classic ML models but can be with newer deep learning models
Data format for classic ML models
-
Preprocessing
- Classic ML models expect data to be in tabular form with numerical features
- Semi-structured data needs to be processed
-
Training
-
Evaluation
-
Prediction
-
Features in tabular data that are not numerical should be also processed:
- Categorical: color, vehicle type
- Boolean: married, automatic engine
- Ordinal: star ratings
-
Engineering numerical features from non-numerical ones will be discussed in CS3300 Data Science
-
We work mostly with tabular data containing only numerical variables
-> categories
-> categories -> Rank order
-> categories -> Rank order -> Equal spacing
-> categories -> Rank order -> Equal spacing -> Ratio
Feature vector, feature matrix, label vector
Feature vector
- Features represent the measurements or properties of an object for which we want to predict a label.
- We can collect these features into a vector: feature vector.
- Each flower can be represented by a vector.
x = [[x1], [x2], [...], [xm]]
Feature matrix
-
A training, testing or future/naive data set consists of a set of feature vectors.
-
For each set, these feature vectors are collected into a matrix that we call feature matrix
-
Must transpose the vector before adding to matrix
-
Subscript -> attribute
-
Superscript -> object index
X = [xT, xT, xT]
Label Vector
y, y(hat) is a prediction
Feature Space
the set of all feature vectors. A feature vector can be depicted as a point in the feature space.
What is array reshaping? When is it useful?
- mnist data set
- can reshape into a feature matrix
Tooling (Numpy and Scipy)
- Numpy and Scipy are libraries for numerical and scientific computing
- Arrays - Numpy
- Sparse Matrices - Scipy
NumPy Array Data structure
- Multi dim
- All vals same type
- Allocates a large block of contiguous memory
- Very little memory overhead
- Single object for array itself
- Faster to iterate through memory in-order due to cache
Numpy Basics
array = np.array([], dtype=np.float32)
array = np.zeros(4, dype=np.int32)
array = np.ones((4, 5))
# number of dimensions
array.ndim
# shape
array.shape
# data type
array.dtype
# indexing
array[0]
array[1, 2]
array[1, 2, 3, 4, 5]
# Selecting a single dimension
array[:, 5, :]
# Reversing a 1D array
# returns array with elements at indices in list
subset = array[[1, 8, 4, 5, 2]]
subset = array[another_array]
# boolean mask
subset = array[[True, False, False, True, True]]
# returns a boolean array
mask = array == 1
selected = array[mask]
# Sorting
sorted = np.sort(array)
# Maximum
max = np.amax(array)
# Minimum
min = np.amin(array)
sorted_idx = np.argsort(array)
# smallest element of the array
array[sorted_idx[0]]
# largest element of array
array[sorted_idx[-1]]
# reshape arrays
images.reshape((3, 4))
X1 = np.hstack([X1, X2])