Machine Learning Introduction

CIS1902 Python Programming

Reminders

  • HW1 is due tonight!
  • HW2 will be released soon, later this week.
  • Download the datasets posted on Ed for lab later today.
  • Download scikit-learn pip install -U scikit-learn

Agenda

  1. Defining a Model
  2. Types of Machine Learning
  3. Basic Models
  4. Titanic Lab

Defining a Model

  • Problem Type: What is the goal of the output? Classification or value prediction?
  • Feature selection: What are some insights about that data that are useful?
  • Model selection: Given what I know above, what is the best class of model to pick for my problem?

Machine Learning

There are a few different types of machine learning you may have heard of:

  • Supervised
    • Labeled data, e.g. past exams with solutions
  • Unsupervised
    • Unlabeled data, e.g. just past exams
  • Reinforcement
    • Getting feedback as you take the exam

Basic Models

Disclaimer

These next two lectures may get a bit involved mathematically. Basic linear algebra and probability are assumed knowledge.

I will not be testing you on any of this, it is purely for your own understanding!

Decision Tree

  • One of the simplest and most intuitive models.

    • Basically what 20 questions does!
  • Machine learning "magic" optimizes ways to determine the best splits.

decision tree

Random Forest

  • Decision trees are simple but can easily "overfit" to their training set
  • What if instead we had a bunch of decision trees, not necessarily the same?
  • This is a random forest! Typically, the majority or mean value is taken
  • Machine learning "magic" helps generate lots of trees

Linear Regression

A linear regression assumes that the dependent variable (y) has a linear relationship with one or more of the independent variables (x). These independent variables are also called regressors or predictors.

Specifically, if we have regressors, for the th label, we model

which we can represent succinctly as a matrix

Linear Regression

In the simple one variable case:

Linear Regression

  • The goal of a linear regression is to determine the vector that minimizes the error .

  • Visually, machine learning "magic" is picking the line that minimizes the total distance of the green lines.

linear regression

Logistic Regression

A logistic regression predicts a probability

Given amount of rain, what's the probability of a flood?

You can think of the problem as "What parameters maximize the likelyhood of our (training) data occuring"?

As a caveat, is only good for tasks where there is a binary outcome.

Logistic Regression

Pick parameters for function

such that the likelyhood is maximized

logistic regression

K-Means

Given points, how do we "optimally" group them into sets? Formally,

Conceptually, the best split is where each set is chosen such that the total distance between each point and the center of that set is minimized.

K-Means

alt text

K-Means

This is actually an NP-hard problem! Machine learning instead typically uses an approximiation algorithm. It's quite good in practice, but is prone to finding local minima. To get around this, we simply just repeat the process a few times to see how "good" it is.

Recap

  • Decision Forest
    • Good for classification tasks
  • Linear Regression
    • Good for value prediction tasks
  • Logistic Regression
    • Good for binary outcomes (i.e. True/False, pass/fail, etc)
  • K-Means Clustering
    • Good for unsupervised classification tasks
  • Note: most tasks can be modeled in multiple different ways, picking the "right" model is usually trial & error.

Lab: Titanic Dataset