Skip to main content

This assignment is due on Monday, September 16, 2024 before 11:59PM. This assignment may be done with a partner.

Text Classification : Assignment 1

For this assignment, we’ll be building a text classifier. The goal of our text classifer will be to distinguish between words that are simple and words that are complex. Example simple words are heard, sat, feet, shops, and town, and example complex words are abdicate, detained, liaison, and vintners. Distinguishing between simple and complex words is the first step in a larger NLP task called text simplification, which aims to replace complex words with simpler synonyms. Text simplification is potentially useful for re-writing texts so that they can be more easily understood by younger readers, people learning English as a second language, or people with learning disabilities.

The learning goals of this assignment are:

  • Understand an important class of NLP evaluation methods (precision, recall and F1), and implement them yourself.
  • Employ common experimental design practices in NLP. Split the annotated data into training/development/test sets, implement simple baselines to determine how difficult the task is, and experiment with a range of features and models.
  • Get an introduction to sklearn, an excellent machine learning Python package.

We will provide you with training and development data that has been manually labeled. We will also give you a test set without labels. You will build a classifier to predict the labels on our test set. You can upload your classifier’s predictions to Gradescope. We will score its predictions and maintain a leaderboard showing whose classifier has the best performance.

Here are the materials that you should download for this assignment:

This assignment has several deliverables:

  • Your colab notebook
  • Your model’s output for the test set (your model will be ranked on a leaderboard against the other students’ outputs)