Network Construction and Analysis Project
Networked Life
Spring 2005
Prof. Michael Kearns
PROJECT DESCRIPTION
In the early portion of Networked Life, one of your main tasks will be to construct a network based on a source of real-world data of your own choosing, to measure and analyze various properties of your network, and to compare these properties with those arising in other networks and models we will examine in the course. The hope is that this will be a creative and fun project that will help illustrate, ground and possibly challenge some of the abstract concepts about networks we will be studying.
The project will be broken over time into a number of manageable assignments. More details will be provided below as they develop.
DESCRIPTION AND TIMELINE OF ASSIGNMENTS
Task 1: Identification of data source and definition of network. Due as a hardcopy write-up at the start of class, Thursday, Jan 27.
In this first assignment, you should identify a specific source of real-world data, the precise definition of the network (vertices and edges) you plan to extract from this data, and the methodology by which you will extract it.
We will be generous with the term "real-world", which could include data from the domains of biology, sociology, economics and finance, technology, etc. However, it must be a well-defined, objective data source gathered by a third party. An example of an entirely acceptable data source is the recently released corpus of emails exchanged by Enron executives, where it would be natural to examine the network of whom exchanged email with whom. An example of an unacceptable data source and network would be "I wrote down a list of all my friends and then connected any pair of them that I thought shared a lot of common interests". This example is too subjective and the data is not gathered by a third party.
To be sure there is some minimal level of complexity to your network, we require that the number of vertices in the network be at least 25, and the total number of edges in the network to be at least 25. However, considerably more ambitious networks are encouraged.
By the "methodology" by which you will extract your network, we mean how you plan to go from the raw data source and your defined network to an acutal representation of your network in our simple format (see below, but essentially nothing more than a list of all the vertices in your network, followed by a list of all those pairs of vertices that are connected by an edge). Examples would be the careful application of your definition to the data "by hand", or via the use of Excel macros, or via the creation of a Perl script that computes the network from the raw data, and so on.
For Task 1, you should submit a brief write-up detailing the information described above for your network. If your data source is online, please provide the URLs for the source; feel free to include a small portion of the raw data in your write-up if it would be helpful to do so. Be sure to be as precise as possible in all aspects of your write-up, from network definition to methodology. As an informal test, your write-up should be sufficiently precise that a third party could independently create the same network you will from your description.
As a rough guideline, I would expect most write-ups to be between 1 and 4 pages long.
Tips on finding and downloading data sources: Here are a couple of simple tips. First of all, to find a data source that is closest to your idea, it might be helpful to append the words "database" or "statistics" to your basic Google query --- i.e. search for "movie credits database" as opposed to just "movie credits".
For those of you considering automating or partially automating the compilation of your data source or network extraction, a very handy utility is the program known as wget, which takes any URL and downloads the html found at that URL into a local file, thus letting you download many web pages without tediously having to surf them manually. The wget program is installed on Eniac and most of the other servers around, and there are also Windows versions available on the web. On Eniac, just typing "wget" at the command line will tell you how to get more info on usage.
Task 2: Extraction of your network from your data source. Due by electronic mail to Elliot and Prof Kearns (fengyi@seas.upenn.edu, mkearns@cis.upenn.edu) by midnight on Tuesday, March 1, in the exact form specified below. The subject line of your email MUST be "Task 2".
Once we have approved your choices in Task 1, you will then be asked to go ahead and use the precise definition of vertices and edges you gave and extract the network from the data. Note that while Task 1 was entirely conceptual, here's where you will actually need to manipulate your data source. Depending on how large and complex the network you defined in Task 1 is, this manipulation may require, or be greatly aided by, the use of various software tools like Excel. However, the lower bound on network size of 25 vertices and 25 edges was chosen to allow the possibility of doing all steps of this project "by hand" if you so prefer.
You will be asked to submit your network as a file in a specific format that is extremely straightforward --- essentially simply listing the names or IDs of your vertices, followed by a list of pairs of vertices that are connected by an edge.
More precisely, in the email you send by March 1 to receive credit for the assignment, the body of the email should be plain text (no attachments), and should have the following (sample) form:
graph G {
Alice;
Bob;
Chuck;
Dora;
Alice -- Bob;
Alice -- Dora;
Bob -- Chuck;
Bob -- Dora;
}
In this example, we first "declare" there to be four vertices that will have the names or labels "Alice", "Bob", "Chuck" and "Dora"; and then we declare the four edges listed above. Of course, the names or labels can be anything you like, so we suggest you make them meaningful in the context of your network. But you should probably avoid strange characters, underscores, etc. The syntax above is important --- i.e. you need semicolons after each vertex and edge declaration, you need to use "--", not "-", etc.
We will then provide you with web access to a program that will generate a nice visualization of your network from this formatted file, and we will also create and examine a gallery of all the networks for the class.
Task 3: FINAL REPORTS: Quantitative and Qualitative Analysis. Exact due date to
be determined, but sometime towards the end of finals week.
The culmination of the project will be a written report analyzing your
network from the perspective of the themes of the entire class. I anticipate
that the typical length of these reports, including all figures, diagrams,
references, etc. will be between 10 and 15 pages.
You should think of the
overarching goal of the report to be twofold: (a) Demonstrating your
mastery of the technical and conceptual themes of the class, and
(b) Demonstrating your ability to apply these themes to a specific
real-world network.
You should view the inputs to your report as including at least
the following items:
While I am flexible about the specific format of the reports, there will
be some required elements, and a number of themes and topics that you
should be sure to touch on.
EVALUATION CRITERIA
Your efforts on the project will be evaluated according to
a few criteria:
SAMPLE RESOURCES
To stoke the creative fires, below we provide some sample sources of real-world
data, and networks that one might extract from them.
These are only examples, and you are encouraged to find
your own data source.
In particular, nobody should propose extracting precisely one of the
networks suggested below, though you are free to use the data sources below
to extract other networks of interest.
We will add to this table as other sources come to mind.
Data Source
Description
Comments and Sample Networks
The Internet Movie Data Base
An extensive data base on films that has been the subject of many
social network research projects.
The vertices could be all horror movies, with an edge between any
pair that had the same film editor.
United Nations Statistics Division
Extensive collection of data sets collected by the U.N. over many years. Includes
economic and trade data, demographic data, housing data, etc.
Later in the course, we will examine a network extracted from U.N. data by letting
the vertices be sovereign nation, with an edge between two if they had a "significant"
amount of foreign exchange in the year 2001.
Bureau of Labor Statistics
Labor-related data collected by the U.S. government.
Vertices could be various quantities tracked over time by BLS, like
unemployment, consumer prices, productivity, wages, etc. There could be
an edge between two such quantities if historically they are significantly correlated.
National Institute of Standards and Technology
NIST provides data on all kinds of topics, including (for example)
the
Reuters Corpus,
a large collection of news articles.
Many researchers have extracted interesting networks from the Reuters Corpus,
such as the network over articles in which edges represent pairs of articles with a common
topic(s), according to some objective criterion.
CAIDA data
Data collected by this well-known Internet research consortium.
Since this organization studies the Internet, there are many "obvious" networks
one could extract, but probably more subtle ones that would be interesting as well.
CERT data
CERT is a well-known consortium for documenting network and computer security vulnerabilities.
There may be links other than the one given where data can be obtained from CERT.
One could let the vertices be operating systems and their versions, with an edge between
two if they ever were vulnerable to a common security exploit.
Yahoo! finance
Here you can download all kinds of historical data on the financial markets.
Vertices could be the stocks in the S&P 500. There is an edge between every pair
of stocks that hit their 2004 within 30 days of each other.