What to Watch?
*Update 11/26: Parts 2 & 3 are now out! In order to get the starter files, you need to scroll down and download recommending.py
and readme_movies.txt
. The autograder will be updated on Wednesday, 11/27, but you can get started *
Summary of Deliverables
This assignment is broken up into two major parts. In the first, you will be responsible for scraping reviews of movies from a set of pages on the internet. In the second, you will visualize your collection of movie reviews and build some simple tools for algorithmic recommendation.
By the end of this, here’s what you’ll need to submit to Gradescope:
scraping.py
recommending.py
readme_movies.txt
- your screenshot of your visualization
Notes & Advice
We are intentionally providing you much less guidance than you have received on previous assignments. This is for a couple of reasons. For one, part of the “joy” of scraping is poking around on a webpage until you understand its structure. For another, we want you to have the opportunity to work on a more open-ended assignment so that you can confirm your own coding maturity and independence.
Since there are fewer step-by-step instructions, you should feel free to take an approach that makes the most sense to you. Because you will be making a number of independent decisions, it’s vital that you start this assignment as soon as you can. You should also go to Office Hours proactively rather than reactively: it is much easier for a TA to confirm or adjust your understanding of an approach before you try it than to try to “dig you out” of an overly complicated approach after the fact.
Part 1: Scraping
If you need to restart your project, you can find the scraping starter file here.
Structure of the Webpages
If you navigate to the following link (https://www.cis.upenn.edu/~cis110/movies/large_movies/page_1.html
), you’ll see a simple HTML page containing a few components: a table containing information about a bunch of movies, and a pagination element for navigating to other pages in the movie dataset.
The table has a regular structure where one row represents a few key pieces of information about a single movie. Those information attributes are stored in the columns Movie ID
, Title
, Number of Ratings
, and Genres
.
Observe that there are actually two links in each row of the table: one is a link to that movie’s IMDB page, and another is a link to the list of reviews that we have collected for that movie. The IMDB link is there for the reader’s convenience, and the link to the movie’s reviews will be useful for you in a future step.
Above the table, you’ll observe a widget that tells you which movies of the dataset are displayed on the current page. This widget also features a forward and a backward arrow that contain links to the next and previous page of results, respectively. (When there is no previous or next page, that arrow will link to the current page instead.)
The URL linked above, http://www.cis.upenn.edu/~cis110/movies/large_movies/page_1.html
, represents Page 1 of the large_movies
review dataset. If we click the “Next” arrow, we’re taken to http://www.cis.upenn.edu/~cis110/movies/large_movies/page_2.html
, which represents Page 2 of the large_movies
review dataset.
Quick check:
- What would be the URL for Page 37 of the
large_movies
review dataset?- What would be the URL for Page 2 of the
tiny_movies
review dataset?
If we navigate to the link under the Number of Ratings
column, we’ll end up on a page that looks like the following:
This page is for the movie Ghostbusters (1984). This movie has the ID of 2716
in the large_movies
dataset, and so it lives at the URL http://www.cis.upenn.edu/~cis110/movies/large_movies/ratings_2716.html
.
Quick check:
- What would be the URL for ratings of movie 2337 of the
large_movies
review dataset?
This page lists the genres for the movie, as well as a link back to Page 1 of movies in this dataset and a table of all ratings available for this movie. These ratings are all made by a user with a unique ID, and all ratings are numbers between 0
and 5
. Some users provided tags for their ratings as well. (One user helpfully tagged Ghostbusters (1984) with the tag “ghosts”. Thank you, very cool!)
Scraping Goals
Your job is to implement two functions. You will do so in scraping.py
.
Collecting Movie Info
The first function to implement is scrape_movie_info()
. The signature is as follows:
def scrape_movie_info(slug: str) -> dict[int, tuple[str, tuple[str]]]:
...
Given a slug
that represents the prefix used for all pages in the current dataset, return a dictionary that maps each movie ID to that movie’s info. The movie’s info is modeled as a tuple containing the name of the movie as well as a tuple of its genres. Here is an example of the structure that scrape_movie_info
should return:
{
780: ("Ocean's Eleven", ("Crime", "Thriller")),
1214: ("Alien", ("Horror", "Sci-Fi"))
}
Note that this doesn’t correspond to a full result and it is only provided as an example of the correct structure of the answer.
When we refer to “each movie ID” and “all pages in the current dataset”, we are referring to every movie on every page reachable by clicking the next button starting on page_1.html
. That page links to page_2.html
, which therefore should have its movies included. This second page also contains a link to page_3.html
, which should have its movies included. And so on, and so on. The last page in each dataset has an empty string as its next page link—this behavior can be used as a sign to end the scraping process.
Keep in mind the following:
- Different datasets have different numbers of movie pages.
tiny_movies
has three pages whereassmall_movies
has five. There is no way a priori to know how many movie pages each dataset has. - Different datasets have different numbers of movies per page.
tiny_movies
has 2, whereassmall_movies
has 8. There is no way a priori to know how many movies per page there will be. - The table on each page listing movies will always have exactly the same structure.
There are a couple of unit tests made available to you in test_scrape_movie_info.py
. There is a “tiny” test and a “small” test that is commented out—don’t try to run the small one until you pass the tiny one! Since you’re actually loading data over the internet, this code takes some time to run!
Collecting User Ratings
The second function to implement is scrape_ratings()
. The signature is as follows:
def scrape_ratings(slug: str, movie_ids: set[int]) -> dict[int, dict[int, float]]:
...
Given a slug
and a set movie_ids
that contains all of the movie IDs available in a dataset, return a dictionary that stores all ratings made by all users. The dictionary returned should map user IDs to dictionaries containing all of that user’s ratings. Each inner dictionary will map movie IDs to the score that the user provided for that movie ID.
Keep in mind that the ratings for a movie with id n
are found at the URL of f"{slug}ratings_{n}.html"
Here is an example of the structure of the nested dictionary that scrape_ratings
should return:
{
514: {
2716: 5.0,
780: 2.0
},
279: {
780: 4.0,
300: 2.5,
1010: 0.5
}
}
In the example dictionary above, we have modeled a dataset that contains two different users, four different movies, and five different ratings.
Quick check:
- What rating did the user with ID
279
give to the movie with ID300
?- What is the ID of the movie that both users rated? Who gave it the higher rating?
Check Your Work
Here is the intended result of calling scrape_movie_info("https://www.cis.upenn.edu/~cis110/movies/tiny_movies/")
:
{1210: ('Star Wars: Episode VI - Return of the Jedi',
('Action', 'Adventure', 'Sci-Fi')),
2028: ('Saving Private Ryan', ('Action', 'Drama', 'War')),
1307: ('When Harry Met Sally...', ('Comedy', 'Romance')),
5418: ('Bourne Identity, The', ('Action', 'Mystery', 'Thriller')),
56367: ('Juno', ('Comedy', 'Drama', 'Romance')),
3751: ('Chicken Run', ('Animation', 'Children', 'Comedy'))}
If we save the above result into a variable, we could get all of the movie IDs into a set like so:
movie_info = scrape_movie_info("https://www.cis.upenn.edu/~cis110/movies/tiny_movies/")
movie_ids = set(movie_info.keys())
We could then call scrape_ratings("https://www.cis.upenn.edu/~cis110/movies/tiny_movies/", movie_ids)
. This would return a dictionary with 331 entries, representing the reviews of 331 users. You should spot check your output to make sure that you see the matching ratings from each user associated with each movie. For example, you should verify that your output has a review for “When Harry Met Sally…” and “Star Wars: Episode VI - Return of the Jedi” from user #600.
You can compare your output of the tiny movies dataset with the correct movie info and user ratings if you like. There are a couple of unit tests made available to you in test_scrape_ratings.py
. There is a “tiny” test and a “small” test that is commented out—don’t try to run the small one until you pass the tiny one! Since you’re actually loading data over the internet, this code takes some time to run!
Correct outputs for other datasets are also linked here:
Movie Info | Ratings | |
---|---|---|
small | 🔗 | 🔗 |
medium | 🔗 | 🔗 |
large | 🔗 | 🔗 |
(or, you can also download all of them altogether here.)
Part 2: Recommendation
Because this assignment was released in parts, Codio will not contain the files necessary for completing Part 2. All you need to do to rectify this is to download recommending.py
below.
⚠️⚠️⚠️⚠️ Starter File Is Here ⚠️⚠️⚠️⚠️
⚠️⚠️⚠️⚠️ Readme File Is Here ⚠️⚠️⚠️⚠️
In this part of the assignment, we’ll do some recommendation based on the data that can be scraped from these movie review sites. You will be able to complete tasks from this part of the assignment even if your work from the previous part is not finished or correct. That is, we make available to you a bunch of JSON files that contain all of the correctly scraped data from the sources described in the previous section.
To the extent that we’ve implemented recommendation strategies so far in this class—most prominently for restaurants in HW4—we’ve done so in a straightforward, rule-based manner. That is, the user of the Restaurant Recommender needed to know to some degree what it was that they wanted to eat. They needed to select based on price to filter out restaurants that were too expensive, or they needed to have an idea in mind of what kind of cuisine they were looking for.
In this part of this assignment, your job is to implement a MovieRecommender
class that works by user-based collaborative filtering. Although the name is long, the idea is simple: find a way to model each user’s taste, and then recommend a movie to a given person by finding the other person who has the most similar taste and picking something that they liked. The great thing about this kind of system is that the person looking for a recommendation doesn’t have to have any idea ahead of time about what they’d like to see. Just by knowing examples of what they have liked and disliked in the past, our system can present them with reasonable options. This idea powers actual recommendation systems implemented by Netflix (for movies and shows) and Amazon (for just about anything.)
A: Creating a MovieRecommender
We will implement our movie recommendation logic inside of the MovieRecommender
class, which can be found in recommending.py
. The initializer method is completed for you so that you have your object attributes set up properly. One method that __init__
calls is not implemented yet, and so this will be your first task: implement ratings_to_preferences
to turn a single user’s ratings dictionary into a dictionary that models that user’s movie preferences. You shouldn’t change anything about __init__
.
We want to turn a set of the user’s reviews into a representation of their “taste.” Anyone who’s been asked to describe their taste in movies, music, literature, or fashion will know how hard it is to convey something so detailed in a few short words. We will attempt to quantify taste by turning each user’s ratings collection into a dictionary that maps each genre to the average score that the user awards to movies that feature that genre. (You will find an example of this transformation in the next subsection.) This is not a perfect representation of taste, but it will allow us to compare users even when they have not seen the same set of movies!
ratings_to_preferences
Implement ratings_to_preferences
. This method takes in a single user’s ratings
, a dictionary mapping movie IDs to ratings for those movies. The method should return a dictionary that maps strings of genre names to that user’s average rating for movies of that genre. A few things to keep in mind:
ratings
only contains IDs and scores—to look up the genres associated with a movie, you’ll need to use that movie’s ID as a lookup key in the class’ movie information dictionary.- Each movie can have multiple genres.
- Each movie will have at least one genre.
- There are 19 total genres listed in the dataset, but each user may not have rated an example of each genre. The output dictionary should only include mappings for genres that were rated by that user.
You can inspect the following example of how this calculation should work.
If the movies in the Movie Recommender are these:
movie_info = {
1: ("Harry's Adventure", ("Comedy", "Adventure")),
2: ("Travis' Tragedy", ("Drama", "IMAX", "Comedy")),
}
And the user’s ratings are these:
ratings = {1: 3, 2: 4}
Then we would expect their preferences dictionary to look like this:
{"Comedy": 3.5, "Adventure": 3, "Drama": 4, "IMAX": 4}
Observe how both movies are comedies, and so the average rating for comedy movies is impacted by both movies. Otherwise, the average rating from the genre comes from that single movie that had that genre.
B: Quantifying Similarity
Once we have a collection of dictionaries that represent individual users’ preferences, we want to start making comparisons among our users. We will do this using the cosine similarity metric, which will output a value between 0
(for totally unrelated users) and 1
(for users with “identical” taste).
We can pick two users with preference dictionaries $A$ and $B$. Given these two dictionaries, we can calculate the cosine similarity using the following formula.
\[\frac{\sum_{\texttt{genre in A.keys().intersection(B.keys()) }}A[\texttt{genre}]B[\texttt{genre}]}{\sqrt{\sum_{\texttt{genre in A.keys()}}A[\texttt{genre}]^2}\sqrt{\sum_{\texttt{genre in B.keys()}}B[\texttt{genre}]^2}}\]If this is daunting, then consider the following English explanation.
- First, sum over the products of the ratings for genres that are found in both $A$ and $B$. This is the numerator of this formula.
- Then, sum over the squares of the ratings for genres that are found in $A$ and take the square root of that sum.
- Repeat the calculation for $B$, and then multiply these two values together; this gives you the denominator.
- Divide the numerator by the denominator, and you have the cosine similarity!
We can study a specific example modeled by the unit test below. The work for arriving at an expected value of 0.5366563146
can be found below that.
def test_average_rating(self):
A = {"Comedy" : 4, "Drama" : 3}
B = {"Drama" : 2, "Action" : 1}
expected = 0.5366563146
res = MovieRecommender.cosine_similarity(A, B)
self.assertAlmostEqual(expected, res)
cosine_similarity
Implement the cosine_similarity
method. As a static method, cosine_similarity
will not be able to access any information stored within the MovieRecommender
object that its called from. We define the method in this way because it only needs its inputs—two dictionaries representing two users’ preferences—to do its work.
C: Finding the Most Similar User
Our work in our previous two steps allows us to quantify any individual’s preferences and to compare those preferences to that of another user. We are therefore now able to take a given user and find the other user in the dataset whose preferences are the most similar.
find_similar_user_by_id
First, implement find_similar_user_by_id
. Given the input user_id
, which corresponds to a user whose preference dictionary can be found inside of the all_user_preferences
attribute of the MovieRecommender
object, find the ID of a the most similar different user present in the dataset. In order to do this, you will need to calculate the cosine similarity between the input user and each other user present in the dataset. Remember that with cosine similarity, greater values indicate higher similarity.
If there happen to be multiple other users who tie for most similar to the input user, select the one with the higher ID. Make sure that you do not consider the similarity of the input user to themself.
find_similar_user_by_preferences
Next, implement find_similar_user_by_preferences
. The behavior of the method is nearly identical to that of find_similar_user_by_id
. The difference is that the input is a dictionary of preferences rather than an ID of a user already present in the dataset. In this way, we can make recommendations for users (like you!) who are not included in the list of users we originally scraped.
D: Make Recommendations!
make_recommendations_for_id
Given the ID of a user who will be used to source recommendations and the ID of a user who will receive the recommendations, generate a set of movie titles that serve as recommendations. These movies must meet the criteria that
- the recommendee has not rated them
- they are tagged with at least one of the recommendee’s top two rated genres
Among all of these movies—there may be quite a few of them—you should select up to five. The five that you choose must be the most highly rated movies by the recommender from the possible set.
make_recommendations_for_preferences
Given the ID of a user who will be used to source recommendations and preferences of a user who will receive the recommendations, generate a set of movie titles that serve as recommendations. These movies must meet the same criteria as those returned by make_recommendations_for_id
.
Part 3: Data Viz
In this section, you are tasked with getting a bit creative! You’ll need to create some figure/chart/graph that reveals something about user & movie information we have at hand. For both of the next tasks, you can choose to implement your method using any of combination of PennDraw and Pandas. (You can refresh your memory about the myriad delights of Pandas with its documentation here.) The graphics that you create do not have to be complex, but they should be thoughtfully chosen and designed. For both graphics, you’ll be asked to upload a screenshot of your output and a brief explanation of why you chose the visualization elements that you did in addition to the code itself.
You are only required to implement one of visualize_user
or visualize_dataset
, and there is no extra credit for doing both.
A User’s Taste (Option 1)
Write the method visualize_user
that takes in the ID of a user whose ratings are included in the dataset and create a graphic that displays information about that user. Things that might be interesting to display:
- their breakdown of movies watched in different genres
- the timeline of when their watched movies are released
- information about their favorite, least favorite movies & genres
- comparison of the user to the most and least similar other users in the dataset
The Dataset Itself (Option 2)
Write the method visualize_dataset
that creates a graphic that displays some information about the collection of ratings & movies. This is broad, and you won’t be able to include everything. Think about how you can aggregate & summarize, instead. Things that might be interesting to display:
- The most and least common genre combinations among movies
- The most and least highly rated genre combinations among movies
- Best/worst/most reviewed movies
- Trends of movie genres over time
- Summarizing the entire dataset with information about the number of ratings, users, movies, ratings per user, ratings per movie