Data Science Introduction: Numpy, Pandas, & Matplotlib

CIS1902 Python Programming

Reminders

  • HW1 has been released! This one is a bit longer than HW0.
    • Due in 2 weeks, Oct 8

Agenda

  1. Data Science Overview
  2. Notebooks
  3. Numpy Basics
  4. Pandas Basics
  5. Matplotlib

Data Science Overview

Data Science

  • In short, data science is the practice of answering questions about the world using collected data
  • Data scientists use insights about data to build models to automate certain tasks
    • Machine learning = tuning models
  • These models are driven by statistical patterns found in the data

Data Science

  • Machine learning is typically treated like magic, but its foundations lie in statistics
  • At the end of the day, it's an experimental science
    • Pick some parameters, see how "good" they are, repeat

Machine Learning Meme

Data Science

Typically, the flow for training a model is something like:

  1. Data Collection
  2. Data Preprocessing
  3. Model Training
  4. Model Evaluation
  5. Model Deployment

where steps 3-4 could be run many times with different classes of models.

Data Science

  • The arguably best choice of tools is to use for steps 1-4 are the numpy and pandas Python modules
  • We won't be focusing on optimizing steps 3-4, but we will go over a few types of models you can use
    • If you want to learn more about this, take a machine learning course!
  • Step 5 is a problem for a data engineer, someone who productionizes models and data pipelines
    • This could be a good final project idea!

Notebooks

  • Notebooks are another way of running Python code, but provide some properties that are highly useful for data science
  • Code can be logically grouped, outputs can be saved without rerunning code, and plotting data is very quick and easy
  • If you haven't already, install Jupyter with pip install jupyterlab

Numpy

  • Numpy is Python's premier scientific computing package
  • Think complex mathematical functions, matrixes, probability, etc
  • We mentioned that Python is relatively slow, so Numpy leverages underlying C code to ensure fast computation
    • However, everything is abstracted away and the code you usually have to write is very minimal!

Numpy

  • The main datastructure in Numpy is an ndarray
  • In scientific computing, you almost always deal with vectorized computation
  • As a result, these types of operations are first-class citizens in Numpy
  1. # regular python
  2. c = []
  3. for i in range(len(a)):
  4. c.append(a[i]*b[i])
  1. # using Numpy ndarrays!
  2. c = a * b
  3. # and it's super fast!

Numpy

Some more examples

  1. >>> a = np.array([20, 30, 40, 50])
  2. >>> b = np.arange(4)
  3. >>> b
  4. array([0, 1, 2, 3])
  5. >>> c = a - b
  6. >>> c
  7. array([20, 29, 38, 47])
  8. >>> b**2
  9. array([0, 1, 4, 9])
  10. >>> 10 * np.sin(a)
  11. array([ 9.12945251, -9.88031624, 7.4511316 , -2.62374854])
  12. >>> a < 35
  13. array([ True, True, False, False])

Pandas

  • Pandas is basically Excel for Python
  • Pandas provides a streamlined way to manage row and column based datasets (think CSVs!)
  • When used in conjunction with Numpy, it provides a powerful environment where one can easily load and manipulate data for scientific computing

Pandas

  • Labeled data is stored as a DataFrame, e.g. each row has named columns
  • A column within the DataFrame is called a Series, i.e. a list of values
  • Each row can be thought of as a dictionary
    • Similarly, when accessing a DataFrame, each access is a Series

dataframe

Pandas

  1. df = pd.DataFrame(
  2. {
  3. "Name": [
  4. "Braund, Mr. Owen Harris",
  5. "Allen, Mr. William Henry",
  6. "Bonnell, Miss. Elizabeth",
  7. ],
  8. "Age": [22, 35, 58],
  9. "Sex": ["male", "male", "female"],
  10. }
  11. )

excel

Pandas

  1. # reading csv data is easy!
  2. data = pd.read_csv('my_csv.csv')
  3. # we can get an overview of the data
  4. data.info()
  5. # accessing and setting is just like a dictionary
  6. data['my_col']
  7. data['new_col'] = pd.Series([1,2,3,4])
  8. # pandas makes it easy to run various functions on my data
  9. data['my_col'].max()
  10. # and calculate derived values!
  11. data['col'] = data['my_col'] * data['new_col']

Matplotlib

  • A library that provides easy ways to plot data!
  • Can get very complex, so I recommend learning it as you need it
  • Typically, plots a series of points represented as an array of x-points and and array y-points (for 2d graphs)

Matplotlib

  1. import matplotlib.pyplot as plt
  2. xs = [1,2,3,4]
  3. ys = [1,4,9,16]
  4. plt.plot(xs, ys)
  5. # or scatter
  6. plt.scatter(xs, ys)
  7. plt.ylabel('my y-axis')
  8. plt.xlabel('my x-axis')
  9. plt.show()
  10. # or to save as a file
  11. plt.savefig('myfig.png')

matplotlib example

Data Analysis Lab

UFO Sightings 👽

Download the dataset posted on Ed. First, let's clean the data. Remove any rows that do not follow the rules:

  • Has a duration (seconds) that is not a number
    • Then convert to a float
  • Has an unspecified country (is NaN)
  • Has an unspecified shape (is NaN)
  • Replace any timestamp that has 24:00 with 23:59 instead
    • Then convert to a timestamp type

You should have 69001 rows after this.

UFO Sightings 👽

Now, let's answer the following questions

  1. How many sightings were there in the United States?
  2. How many sightings were there in the state of Washington?
  3. How many different types of shapes did people report?
  4. How many sightings were there before the year 2000?
  5. Plot a sighting timeline where the x-axis is years and the y-axis is the total number of sightings that year

UFO Sightings 👽

Did you get them all right?

  1. 63561
  2. 3708
  3. 28
  4. 12262

ufo sightings plot