Hail, Caesar!

Summary of Deliverables

By the end of HW03, here’s what you’ll need to submit to Gradescope:

  • caesar.py
  • readme_caesar.txt

0. Getting started

A. Goals

The goal of this assignment is to familiarize yourself with functions, reinforce topics covered in previous assignments, and learn a little bit about the field of cryptography. The specific goals are to:

  • To write and use functions
  • To understand the benefit of splitting a program into helper functions
  • Strengthen familiarity with using lists and loops
  • Learn about string manipulation and ASCII encoding

At the end of this assignment, you will produce a program that can encrypt, decrypt and crack the Caeser cipher!

B. Background

Cryptography is the study of secure communications and secret codes. It helps you, say, send a secret message, knowing that even if the message is intercepted, it could not be read. People have been using ciphers (encrypted messages) for thousands of years, but only in the last century have computers come into the field.

One of the oldest ways to hide a message is to use a substitution cipher. One classic example of a substitution cipher is the Caesar cipher, named after the first recorded (and most famous) user, Julius Caesar. If you’d like to learn more about the Caesar cipher, you can check out the wikipedia page to read about its history and usage.

caesar_img

Julius Caesar

C. Understanding the Caesar Cipher

The Caesar cipher is a basic encryption technique where all letters in a message are shifted down the alphabet by a certain number (determined by the key). In order to encrypt or decrypt a message, you will need a secret key that should, in practice, be hard to find if you don’t already know it. In the Caesar cipher, the key is the number of places to shift each character. This number could be specified numerically (e.g., 4) or it could be specified as a character (e.g., 'E' – which is 4 places over from 'A'). For simplicity, we typically convert the entire message to uppercase, and may omit punctuation.

For instance, consider the message (and famous Julius Caesar quote):

"CAESAR’S WIFE MUST BE ABOVE SUSPICION"

If we encrypt the above message with the key 'E' then we change each letter in the message to the letter four places to the right in the alphabet. For example, 'C' becomes 'G', 'A' becomes 'E', etc.

Note the letter 'W' in 'WIFE', when shifted four down, goes beyond the last letter in the alphabet 'Z'. This is handled by wrapping back to the front of the alphabet, and so 'W' becomes 'A'.

In total, if we encrypt the previous message with the key 'E', it becomes:

"GEIWEV’SW AMJI QYWX FI EFSZI WYWTMGMSR"

D. Data Representation

In this assignment we need to manipulate some text that we will either encrypt/decrypt/crack. We need to store that text somewhere in our program, and we need to choose how we represent that text. The obvious choice would be to store them as a string. This would certainly work, but may not be the easiest approach.

Consider the word 'COMRADE'. If we stored it into a string and wanted to shift it by the letter 'B' (shifting all letters up by 1), it may not be immediately clear how to do this.

Lets instead represent the word 'COMRADE' in a format that is easier to manipulate by our program. We could replace each letter with its position in the alphabet. For example 'A' corresponds with the number 0, 'B' is 1, 'C' is 2, etc. We can now represent our word 'COMRADE' with a list of integers: [2, 14, 12, 17, 0, 3, 4]. This list may be harder to read as a human, but our program can easily shift our “word” up by the letter 'B' by adding 1 to each number in the list so that it becomes [3, 15, 13, 18, 1, 4, 5].

Fortunately for us, characters are already sort of stored as numbers in the computer. These characters are usually stored following the ASCII encoding:

ascii_table

We do not need to memorize this table as we can easily extract the numberical encoding of a character by using the ord() function:

# get the underlying encoding of the character A
# NOTE: this function only works on a single character
encoding = ord('A')

# print the encoding, which should print 65
print(encoding)

Unfortunately for us, in ASCII the letter 'A' has the value 65. While we could certainly use the innate ASCII values of characters to encode characters as integers in this program, the math is a bit easier if we instead represent 'A' as 0 instead of 65.

We will need to adjust the ASCII value so that 'A' can be converted to 0. This can easily be done by subtracting 65 from the value returned by ord(). Fortunately each uppercase letter follows in sequnce (e.g. ord('B') is 66, ord('C') is 67, etc.) so we can subtract 65 to get exactly the number we need for the previously mentioned strategy:

# get the underlying encoding of the character A
our_encoding = ord('A') - 65

# should print 0
print(our_encoding)

To go in the opposite direction, use chr():

back_to_character = chr(our_encoding + 65)

# should print A
print(back_to_character)

Make sure you fully understand these operations and why we use it before moving on to the next section.

E. Your Program Requirements

By the end of this assignment, you will have a caesar.py program which takes in command line arguments that will tell it whether to:

  • Encrypt a message using a particular key
  • Decrypt a cipher using a particular key to get the original message, or
  • Crack an encrypted message, giving us back the secret key

The message will be stored in a file, which will be read by the program.

To encrypt the message stored in plaintext.txt with a key of ‘G’, you would call

python caesar.py encrypt plaintext.txt G

To decrypt the message stored in cipher.txt with a key of ‘G’, you would call

python caesar.py decrypt cipher.txt G

To crack the message stored in cipher.txt, you would call

python caesar crack cipher.txt english.txt

cipher.txt and plaintext.txt are not files that we provide. You must construct them on your own. plaintext.txt can be any message whereas cipher.txt should be the result of encrypting your plaintext.txt with your program.

NOTE: this can seem like a lot, but take a deep breath. You will implement each of these functionalities one at a time. As long as you understand the general idea of how a Caesar cipher works and our chosen data representation, that should be fine enough to get started.

F. Starter Files & Background Knowledge

You should be able to go to the assignment on codio and see a few files present already:

  • caesar.py: Where you will be writing all of our code in for this assignment.
  • encrypted_message.txt: A file containing an encrypted message that we will use for evaluating the correctness of your crack functionality.
  • english.txt: a file containing frequencies that we will use for our crack functionality.
  • readme_caesar.txt: the readme you will submit for this assignment, similar to the ones you have done for previous assignments.

If you need to download the original files again, you can find them here.

In order to complete this assignment, you will need the following Python skills:

  • Manipulating strings as sequences of characters
  • Converting values between different types
  • manipulating lists and iterating over them by both index and element
  • using the modulo % operator to keep a number within a valid range
  • Reading command line arguments using sys.argv
  • Implementing functions and handling main
  • Using values from one list as indicies in another list
  • Reading information from a file object using open()

If you feel uncomfortable with these skills, we highly recommend reviewing:

G. Advice & Hints

Advice

Some of this advice is very similar to advice from previous assignments. That is because this is good advice that is worth repeating.

  • You can test your program easily by adding in temporary print() statements and function calls to the main() function. For instance, if you read in a line of text, you can print its value to check that it contains the information that you expect! When you’re done testing & debugging, you should remove these temporary print() statements.
  • Run and debug frequently! Make one small change at a time, ensuring that your program works as you expect it to after each change. We espeically recommend that you test each function (at least briefly) after you write it. Try editing main() to call the function you just wrote for various inputs and print out the results. Make sure it prints what you expect.
    • For example, make after implementing shift() add this to main() to make sure it prints the expected value: print(shift(24, 3)). Don’t forget to remove these prints when you are done with the assignment!
    • If you are not getting the outputs you expect, print out the values of variables at different points in the program (i.e. before a loop, inside the body of a loop, after the loop)
  • Comment frequently (this helps you and others understand the code) and use descriptive variable names.

Hints

These hints are specific to the homework that may help you debug. In particular, some of these things are just things that people frequently forget while working on the assignment. You may want to revisit these when you are stuck. Don’t hesitate to post on Ed or go to Office hours if these do not help though!

  • Don’t forget to subtract or add 65 when converting a character to/from a number
  • Don’t forget to convert things to upper case! This includes the text we want to encrypt/decrypt/crack and the key we use for encryption/decryption
  • Sometimes it may help to instead loop over a sequences by index instead of by element (e.g. for i in range(...): instead of for ele in seq:). You can also do both at once using enumerate().
  • A big part of this assignment is utilizing helper functions. Don’t forget that you should use the functions you write in one part of this assignment to implement later parts of the assignment.
  • Remember that printing and returning are NOT the same thing. Printing is a side effect of your program, whereas a return statement allows your output to propogate back to the code that called that function, at which point the returned value can be saved and used for whatever you need to do.

1. Functions & main():

A major part of this assignment is being familiar with functions. So before we write code, it is important to understand what they are and how they affect the program we are writing for this assignment.

Recall that functions allow us to intelligently break up our program into a collection of logical units that we can develop and test independently. We can often identify these major units by considering the high level structure of our program and what we will need to do.

Our caesar program is going to have to do a lot of separate tasks to get the desired functionality. We can take note of these “tasks” (which we described previously in this specification) and make a function for each of them. When you open the provided caesar.py, you can see that we have provided most function declarations for you.

For example, our basic encryption/decryption process will need to:

  • take in text as a string and convert each character in the string into our symbol encoding
  • shift those encoded characters either right (to encrypt) or left (to decrypt)
  • convert the encoded characters back into text for output (string)

Our task will then be to write functions for each of these major tasks (which we can do and test independently), and then combine those functions together into a working program.

One function that stands out from the others is the main() function. This function is special since this contains the code that we previously would have stored in our python file outside of any function. Having a main() function is not something that is always done in python, but is typically done when our program has functions that could be useful for other programs. Our main() should process command-line args, read in the text from the file we want to encrypt/decrypt/crack and call the appropriate functions for whatever task the user wants to do (encrypt/decrypt/crack).

For this assignment: all code you write should be inside a function unluess we specify otherwise


2. Converting and manipulating Symbol Lists

We will first tackle the lowest-level functions to handle encoding/decoding of strings by writing two functions: one to convert from a string to our symbol representation (a list of integers), and one to convert from our symbol representation (a list of integers) back to a string.

These should be the first two functions in caesar.py, but we have copied the function declaration and docstring here for convenience:

def string_to_symbol_list(string):
    """
    Description: converts a string to a symbol list, where each element of the
                 list is an integer encoding of the corresponding element of
                 the string.
    Input:  the message text (stored in a string) to be converted
    Output: the encoding of the message into a list of integers
    """

    # TODO: implement
def symbol_list_to_string(symbol_list):
    """
    Description: converts a list of symbols to a string, where each element
                 of the list is an integer encoding of the corresponding
                 element of the string.
    Input:  integer encoding of the message, stored in a list of integers.
    Output: the message text as a string
    """

    # TODO: implement

A. Implementing Our Conversion Functions

You should implement both of these functions according to their specification. We recommend starting with string_to_symbol_list() and then move on to symbol_list_to_string().

For string_to_symbol_list() you should create a list and iterate through each character of string to properly fill in the list. Recall that strings are sequences of characters and can be iterated over like we can with lists. Also recall our previous discussion above on how we represent our data as symbols.

Implementing symbol_list_to_string() should be very similar, except that you will be doing the conversion in the other direction. However, we encourage you to test and make sure your string_to_symbol_list() function works properly (see below) before implementing this symbol_list_to_string().

Note: you will need to call the upper() method on a string before converting it to a symbol list. This will make a copy of the string where all alphabet characters in the string are upercase. For example:

message = 'midi'
other = message.upper()

# prints 'midi'
print(message)

# prints 'MIDI'
print(other)

B. Testing Our Converesion Functions

Once we have implemented a function, we can test it by editing main to call our function and inspect the result.

To test string_to_symbol_list(), we can change our main() to call string_to_symbol_list('COMRADE') and print the returned value. Something like this:

def main():
    symbols = string_to_symbol_list('COMRADE')
    print(symbols)

Once you do that, you can run your program and see if it gets the correct output. If you do get the correct output, you may want to test to make sure it works for other values (such as ‘comrade’ which is all lowercase).

To test symbol_list_to_string() you can further modify main() by taking the symbol lists you printed when testing string_to_symbol_list(), feeding that list back into symbol_list_to_string() and then printing the resulting string. Be sure to try it on multiple inputs! Another example would be to try converting a string that has non-alphabetic characters such as 'BBF3' and making sure we can get the same string back when we convert to symbols and then connvert those symbols back to a string.

C. Shifting Symbols

The next step is to write a function to handle the shifting required to encrypt messages. This is the function in the base file with the function signature:

def shift(symbol, offset):
    """
    TODO: write docstring
    """
    
    # TODO: IMPLEMENT ME

symbol and offset should both be integers. If symbol is an English letter ( an encoded integer value between 0 and 25), then it should be shifted down the alphabet by offset amount (an integer between 0 and 25).

Remember to wrap symbols that go off the alphabet; for example, 22 (‘W’) shifted by 4 (‘E’) should return 0 representing ‘A’)! If you are confused and think it should return ‘B’, keep in mind that we zero-index our offsets. ‘A’ is a shift of 0, so ‘E’ is a shift of 4. Hint: Modulo (%) will be helpful here!

If symbol is not between 0 and 25, meaning it is some sort of whitespace or punctuation, then it should just be returned as is. (In our implementation, punctuation is not encoded and does not change during the encryption/decryption process.) We do this so all whitespace and punctuation is simply ignored in our encryption and decryption process.

Remember to write an appropriate docstring for your shift() function (as well as your other functions going forward).


D. Unshifting Symbols

We also must handle the unshifting (i.e., left shifting) of characters when we need to decrypt an already encrypted message. This function has the following function declaration in your caesar.py file:

def unshift(symbol, offset):
    """
    TODO: write docstring
    """
    
    # TODO: IMPLEMENT ME

This should simply do the inverse of what the shift() function did. Try thinking about how you will do this and do some examples before writing any code! Also, try to think how you can very easily do this by building on code you’ve already written.

This function will also handle symbols outside of the range 0 to 25 the same way the shift() function does (by just returning the symbol)

Reminder: You should write an appropriate docstring for this function as well

E. Testing Shifts

You should now take some time to test shift() and unshift(). You can do this by modifying the main() function similar to how we tested other functions earlier in the assignments. Try calling the functions on various inputs, printing the results, and verify that it prints the expected characters. Some tests you may want to try are:

  • A relatively basic case, e.g. shifting/unshifting 0 by 7
  • Shifting/unshifting past the end of the alphabet
  • Shifting/unshifting a symbol that is not an alphabetic character (e.g. 32)

In addition to testing each function on your own, you can also test by taking the output of shift() as input to unshift() - for example, unshift(shift(3, 5), 5) should be 3. You should also test the functions on their own though.

3. Higher Level Functions: Encrypting and Decrypting

Now that we’ve created functions to handle the encoding/decoding and shifting/unshifting of our messages, we can focus on the three high-level functions in our program outline. We will start with encrypt() before moving on to decrypt(), and finally crack().

Once we write these functions, the main() will then serve as the overall control for our program, calling one of the three functions above as indicated by the given command-line argument.

We will start by defining encrypt():

A. Encrypt

Using the functions you have already written, you should implement the function with the function signature of

def encrypt(message, key):
    """
    TODO: Add function docstring
    """

    # TODO: implement
    #       you probably want to change the return statement

    return ''

The basic operations carried out in the function should be:

  • Convert the message to a list of encoded symbols (which we represent as ints)
  • For each symbol in the list, shift it by the given offset
  • Return a string version of the encrypted symbol list

From the above description, we can see that a message is considered “encrypted” once it has been encoded into integer symbols, shifted, and converted back to a string representation.

Note: It is VERY important that your program take in the numerical representation of the key, and NOT a character. If you do not do this, you will fail most of the tests we provide.

B. Testing Encrypt

Once you have written encrypt(), you should be able to test encrypt() in main() and be able to encrypt the string, “ET TU, BRUTE?” using 6 (letter G) as the offset to get “KZ ZA, HXAZK?”. Be sure to write a comprehensive docstring for the function, and test it thoroughly (try encrypting other strings with various offsets and analyze the output) before proceeding to the next part.


C. Decrypt

Now that you can encrypt your message, you need to be sure you can decrypt it. Write the decrypt function, which has this method header in your file:

def decrypt(cipher, key):
    """
    TODO: Add function docstring
    """

    # TODO: implement
    #       you probably want to change the return statement

    return ''

The structure of this function is very similar to encrypt(), but instead of using shift(), you should be using unshift() and then you must decode the unshifted symbols before converting back to a string.


D. Testing Decrypt

Once you have written decrypt(), you should be able to use decrypt() in main() with “KZ ZA, HXAZK?” and 6 (letter G) as the key to obtain “ET TU, BRUTE?”

You should add some code in main() to test it. At this point, if you run a string through encrypt() and then decrypt(), the output should be the same as the input string.

Here is a sample test that you can use for reference. Please write your own test.

message = 'I AM CONSTANT AS THE NORTHERN STAR'
cipher = encrypt(message, 6)
decrypted = decrypt(cipher, 6)
print(decrypted) # should be the original message (in upper case)

E. main() and reading from a file

Once you are confident that encrypt() and decrypt() are working, you should comment out anything you’ve written in your main(). In order to obtain the message you will be encrypting you will need to read in data (the message) from a file.

Look back at your personality_quiz.py from last week and make use of open() to do this.

  • HINT: The function that will be most helpful is file.read() which returns the entire text of the file as a string. You may also find file.readlines() to be helpful elsewhere in the assignment (that function returns a list of all lines in the file). Remember that when dealing with lines, you’ll need to use .strip() to get rid of leading/trailing whitespace.

Write code in main() to:

  • Handle the command line arguments
  • Load the specified text file
  • Call the appropriate function to encrypt/decrypt/crack the message
  • Print out the result

F. Testing main

Before you move on to the next part, you should be able to run the following operations. For this testing, you will need to create a text file with a message of your choice (come up with your own, or take an example from elsewhere in this specification). Once you encrypt the message, add the encrypted output to the same text file or another. Call decrypt (see command line example below) with the text file containing the message you encrypted and see if you get the original message you created.

To encrypt a message in filename.txt using the symbol value of 6 (character G):

# (filename.txt is not a file we provide)
python caesar.py encrypt filename.txt G

To decrypt the message in another file filename_two.txt using the symbol value of 6 (character G):

# (filename_two.txt is not a file we provide)
python caesar.py decrypt filename_two.txt G

If you’d like a refresher on command-line arguments or reading from a file, you can look back to the PersonalityQuiz writeup.

You should also test that your functions “work together” before moving forward (that decrypting gets you the same message you encrypted, ignoring case). You can test this by creating a test file called message.txt and putting some English sentences in it. Run encrypt() on this file through the command line, saving the output to a new file called test_encryption.txt, and then run decrypt() on that through the command line. If you get back the same original message, then you’re probably in a good place to keep moving forward.


4. Higher Level Functions: Cracking the Cipher

A. Understanding the Cracking Process

The first recorded instance of a systematic breaking of the Caesar cipher came from Al-Kindi in Mesopotamia in the 9th Century: almost 1000 years after Caesar’s death!

Now that you have worked out encoding, decoding, encrypting, and decrypting messages, we will add one more step to reach the final goal of cracking the Caesar cipher. This final step involves letter frequency analysis. This simply means you will be taking a text and counting how often each letter in the alphabet appears throughout the text. English text will have lots of “E”s and “A”s, while an encrypted cipher may have an unusually high number of “X”s or “B”s.

B. Letter Frequency Analysis

Before you can analyze the Caesar cipher text to crack it, you will need the frequency with which the 26 letters in the alphabet appear in the English language. We have given these frequencies for each letter in english.txt. The first line will have the frequency of the letter ‘A’, the second ‘B’, and so on. You should take a look at the file, and note that it has 26 lines.

Implement the function get_letter_frequencies() that takes as input the name of the file containing frequencies to be read and returns a list of floating point numbers with the frequencies contained in that text file. The first element (index 0) in the list should be the frequency of ‘A’, the second should be the frequency of ‘B’, and so on.

Once you have the frequencies of characters in English it is time to obtain the frequencies in the cipher text file. In order to do this, implement the function find_frequencies(). The input will be the int list representation of the cipher text and it should return a list of floating point numbers where the first element in the list is the frequency of ‘A’ in the ciphertext, and so on.

To find the frequency of each letter, first find the number of times each letter appears in the ciphertext. Then, divide all of these numbers by the total number of letters in the ciphertext. Ignore any non-letter symbols.

Write code in main() to test your functions. (Hint: Make a simple string with a known letter frequency, encode it, and make sure your find_frequencies() gives you back the appropriate frequency counts.)

C. Scoring the Frequency

The next step in cracking the cipher will be scoring the frequencies you found in the last step. At a high level, this is done by comparing the letter frequencies of the text with the frequency they are expected to show up in the english language. We calculate this score by finding the absolute value of the difference between each letter in the ciphertext frequencies and its corresponding letter in the English frequencies.

For example, given a list for frequencies of A, B, and C in a message as [ 0.2, 0.1, 0.0 ] and the English language frequencies are [ 0.1, 0.1, 0.05 ]. The total score of the messsage’s frequencies would be:

abs(0.2 - 0.1) + abs(0.1 - 0.1) + abs(0.0 - 0.05) = 0.15

This score essentially tells us how well the letter frequency in the message matches what we would expect if the message were in English, rather than encrypted. The lower the score, the closer the message is to decrypted English. We can use this later to crack the Caesar cipher by trying different keys and picking the one that gives us the lowest score!

Implement the function score_frequencies() that takes in two frequencies lists and returns the difference between each value pair in the lists, as described above.

We would recommend testing your function to make sure that it is calculating the correct absolute value. Think about how you could utilize your main function with lists to help you achieve this.

D. Crack

Now for the fun part! You should be writing a function called crack() to perform these steps outlined below. Note that this function declaration has not been provided for you in caesar.py and is something you must write yourself. We’d like you to think about what the inputs to this function should be, what it should return, and whether you’d like any other functions to help you out. There are several correct ways to do this, so really think about the benefits and drawbacks to your design!

The written decryption and encryption code allows us to be able to hide and decipher cool messages but what if we don’t know the key? In order to crack a cipher with an unknown key we can guess the key via brute force until we find the correct one. We will implement this in crack().

One method to do this is to try to decrypt the cipher message with all possible keys (‘A’ through ‘Z’). At each decryption find the frequencies of the letters in our newly “decrypted” message and use this to determine the frequency score (using score_frequencies()). You want to find which key will give us a message that looks most similar to English, so its frequency score should be the lowest among all keys.

Finding the minimum score key will be a similar process as finding the closest pairs in a list, so looking back at that example from lecture may be helpful.

Once you’ve tried decrypting with all possible keys, return the message found using the key with the lowest score. This should be the original message!

5. Final touches

Once you have written crack(), you should add code to your main so that you can run caesar.py using:

python caesar.py crack cipher.txt english.txt

Note cipher.txt is not a file that we provide. It is an example of a generic cipher. See encrypted_message.txt for a test case.

This command will crack the cipher in cipher.txt, using the letter frequencies stored in english.txt, which your program must also load. It should print out the cracked message.

Once you have this working, and tested crack(), you have a program that can encrypt a message, decrypt it with the key, and break encryption. You’ve just cracked Caesar!

Try out your program by cracking the provided file, encrypted_message.txt. You will need the answer to this for your readme_caesar.txt.

Before you submit, make sure that you can perform decrypt, encrypt, and crack using command line arguments. In addition, make sure that all of your methods have docstrings and that your program header has all possible executions of the caesar.py file, including encrypting, decrypting, and cracking, as they may require different command line arguments.


6. Submission

Submit caesar.py and readme_caesar.txt on Gradescope.

Important: Your code will automatically be tested for correctness when you submit and display a score out of 35 points. This will account for most of your grade for the assignment. We don’t display details on failing tests, but please use the title of the tests as a hint for how to correct your code. If you encounter any autograder-related issues, please make a private post on Ed.