Network Construction and Analysis Project

Network Construction and Analysis Project
Networked Life
Spring 2005
Prof. Michael Kearns

PROJECT DESCRIPTION

In the early portion of Networked Life, one of your main tasks will be to construct a network based on a source of real-world data of your own choosing, to measure and analyze various properties of your network, and to compare these properties with those arising in other networks and models we will examine in the course. The hope is that this will be a creative and fun project that will help illustrate, ground and possibly challenge some of the abstract concepts about networks we will be studying.

The project will be broken over time into a number of manageable assignments. More details will be provided below as they develop.

DESCRIPTION AND TIMELINE OF ASSIGNMENTS

Task 1: Identification of data source and definition of network. Due as a hardcopy write-up at the start of class, Thursday, Jan 27.

In this first assignment, you should identify a specific source of real-world data, the precise definition of the network (vertices and edges) you plan to extract from this data, and the methodology by which you will extract it.

We will be generous with the term "real-world", which could include data from the domains of biology, sociology, economics and finance, technology, etc. However, it must be a well-defined, objective data source gathered by a third party. An example of an entirely acceptable data source is the recently released corpus of emails exchanged by Enron executives, where it would be natural to examine the network of whom exchanged email with whom. An example of an unacceptable data source and network would be "I wrote down a list of all my friends and then connected any pair of them that I thought shared a lot of common interests". This example is too subjective and the data is not gathered by a third party.

To be sure there is some minimal level of complexity to your network, we require that the number of vertices in the network be at least 25, and the total number of edges in the network to be at least 25. However, considerably more ambitious networks are encouraged.

By the "methodology" by which you will extract your network, we mean how you plan to go from the raw data source and your defined network to an acutal representation of your network in our simple format (see below, but essentially nothing more than a list of all the vertices in your network, followed by a list of all those pairs of vertices that are connected by an edge). Examples would be the careful application of your definition to the data "by hand", or via the use of Excel macros, or via the creation of a Perl script that computes the network from the raw data, and so on.

For Task 1, you should submit a brief write-up detailing the information described above for your network. If your data source is online, please provide the URLs for the source; feel free to include a small portion of the raw data in your write-up if it would be helpful to do so. Be sure to be as precise as possible in all aspects of your write-up, from network definition to methodology. As an informal test, your write-up should be sufficiently precise that a third party could independently create the same network you will from your description.

As a rough guideline, I would expect most write-ups to be between 1 and 4 pages long.

Tips on finding and downloading data sources: Here are a couple of simple tips. First of all, to find a data source that is closest to your idea, it might be helpful to append the words "database" or "statistics" to your basic Google query --- i.e. search for "movie credits database" as opposed to just "movie credits".

For those of you considering automating or partially automating the compilation of your data source or network extraction, a very handy utility is the program known as wget, which takes any URL and downloads the html found at that URL into a local file, thus letting you download many web pages without tediously having to surf them manually. The wget program is installed on Eniac and most of the other servers around, and there are also Windows versions available on the web. On Eniac, just typing "wget" at the command line will tell you how to get more info on usage.

Task 2: Extraction of your network from your data source. Due by electronic mail to Elliot and Prof Kearns (fengyi@seas.upenn.edu, mkearns@cis.upenn.edu) by midnight on Tuesday, March 1, in the exact form specified below. The subject line of your email MUST be "Task 2".

Once we have approved your choices in Task 1, you will then be asked to go ahead and use the precise definition of vertices and edges you gave and extract the network from the data. Note that while Task 1 was entirely conceptual, here's where you will actually need to manipulate your data source. Depending on how large and complex the network you defined in Task 1 is, this manipulation may require, or be greatly aided by, the use of various software tools like Excel. However, the lower bound on network size of 25 vertices and 25 edges was chosen to allow the possibility of doing all steps of this project "by hand" if you so prefer.

You will be asked to submit your network as a file in a specific format that is extremely straightforward --- essentially simply listing the names or IDs of your vertices, followed by a list of pairs of vertices that are connected by an edge.

More precisely, in the email you send by March 1 to receive credit for the assignment, the body of the email should be plain text (no attachments), and should have the following (sample) form:

graph G {
Alice;
Bob;
Chuck;
Dora;
Alice -- Bob;
Alice -- Dora;
Bob -- Chuck;
Bob -- Dora;
}

In this example, we first "declare" there to be four vertices that will have the names or labels "Alice", "Bob", "Chuck" and "Dora"; and then we declare the four edges listed above. Of course, the names or labels can be anything you like, so we suggest you make them meaningful in the context of your network. But you should probably avoid strange characters, underscores, etc. The syntax above is important --- i.e. you need semicolons after each vertex and edge declaration, you need to use "--", not "-", etc.

We will then provide you with web access to a program that will generate a nice visualization of your network from this formatted file, and we will also create and examine a gallery of all the networks for the class.

Task 3: FINAL REPORTS: Quantitative and Qualitative Analysis. Exact due date to be determined, but sometime towards the end of finals week.

The culmination of the project will be a written report analyzing your network from the perspective of the themes of the entire class. I anticipate that the typical length of these reports, including all figures, diagrams, references, etc. will be between 10 and 15 pages. You should think of the overarching goal of the report to be twofold: (a) Demonstrating your mastery of the technical and conceptual themes of the class, and (b) Demonstrating your ability to apply these themes to a specific real-world network.

You should view the inputs to your report as including at least the following items:

The original data used to form your network.

Visualizations of your network or fragments of it. If you would like different visualizations than those we have already generated for you, you should communicate with the TAs and they will help you.

Quantitative and numerical analyses performed on your network.

Your knowledge of the domain from which the network comes.

Any articles or other materials you feel are relevant.

While I am flexible about the specific format of the reports, there will be some required elements, and a number of themes and topics that you should be sure to touch on.

Your report should begin with a precise recounting of the exact definition of your network (edges and vertices), as well as of your methodology in constructing the network, including all data sources. Again, the goal here is to be precise enough that a third party could replicate your efforts.

Your report should analyze various structural properites of your network, including (but not restricted to) things like its diameter, clustering coefficient(s), degree distribution, and so on. You should consider doing computations (either by hand or computer) of these quantities, or at least make an effort to estimate or bound them. You are free to approach the TAs for help here as well, but please be aware that they are busy and will help you on their own schedules.

You should further discuss these structural properties in light of what you know about what actually happens on your network, or what aspect of the real world it represents. For instance, if there are vertices with extraordinarily high degree, you should discuss why it makes sense that they have high degree (if in fact it does make sense), how they might have come to have this property, and so on. Ditto for the other structural properties you discuss.

You should discuss plausible generative models for your network, and contrast them with those we examined in class. You should discuss which of the generative models we examined seem like better or worse explanations for how your network formed.

You should discuss how structural properties of your network either do or do not influence what actually happens on your network.

You should make liberal use of network visualizations, graphs, plots, charts, tables, etc. Anything that helps make your points more precise and understandable is encouraged.

Other guidance will be added to this list as it occurs to me.

EVALUATION CRITERIA

Your efforts on the project will be evaluated according to a few criteria:

Creativity. How novel and imaginative your data source and network are will be considered. Try to both have fun and examine a network whose structural properties might actually show something interesting about some aspect of the world. It might be helpful to define and examine a network from an area you have a personal interest in.

Scale and Complexity. While it will be possible to succeed on the project examining a small but interesting network, we will also appreciate those who tackle the extraction and analysis of large-scale networks.

Depth of Analysis. Your grasp of the main principles of networks discussed in class and your ability to apply them in a compelling way to your network (including if your network seems to violate some of the principles) will be considered.

SAMPLE RESOURCES

To stoke the creative fires, below we provide some sample sources of real-world data, and networks that one might extract from them. These are only examples, and you are encouraged to find your own data source. In particular, nobody should propose extracting precisely one of the networks suggested below, though you are free to use the data sources below to extract other networks of interest.

We will add to this table as other sources come to mind.

Data Source Description Comments and Sample Networks
The Internet Movie Data Base An extensive data base on films that has been the subject of many social network research projects. The vertices could be all horror movies, with an edge between any pair that had the same film editor.
United Nations Statistics Division Extensive collection of data sets collected by the U.N. over many years. Includes economic and trade data, demographic data, housing data, etc. Later in the course, we will examine a network extracted from U.N. data by letting the vertices be sovereign nation, with an edge between two if they had a "significant" amount of foreign exchange in the year 2001.
Bureau of Labor Statistics Labor-related data collected by the U.S. government. Vertices could be various quantities tracked over time by BLS, like unemployment, consumer prices, productivity, wages, etc. There could be an edge between two such quantities if historically they are significantly correlated.
National Institute of Standards and Technology NIST provides data on all kinds of topics, including (for example) the Reuters Corpus, a large collection of news articles. Many researchers have extracted interesting networks from the Reuters Corpus, such as the network over articles in which edges represent pairs of articles with a common topic(s), according to some objective criterion.
CAIDA data Data collected by this well-known Internet research consortium. Since this organization studies the Internet, there are many "obvious" networks one could extract, but probably more subtle ones that would be interesting as well.
CERT data CERT is a well-known consortium for documenting network and computer security vulnerabilities. There may be links other than the one given where data can be obtained from CERT. One could let the vertices be operating systems and their versions, with an edge between two if they ever were vulnerable to a common security exploit.
Yahoo! finance Here you can download all kinds of historical data on the financial markets. Vertices could be the stocks in the S&P 500. There is an edge between every pair of stocks that hit their 2004 within 30 days of each other.

Data Source	Description	Comments and Sample Networks
The Internet Movie Data Base	An extensive data base on films that has been the subject of many social network research projects.	The vertices could be all horror movies, with an edge between any pair that had the same film editor.
United Nations Statistics Division	Extensive collection of data sets collected by the U.N. over many years. Includes economic and trade data, demographic data, housing data, etc.	Later in the course, we will examine a network extracted from U.N. data by letting the vertices be sovereign nation, with an edge between two if they had a "significant" amount of foreign exchange in the year 2001.
Bureau of Labor Statistics	Labor-related data collected by the U.S. government.	Vertices could be various quantities tracked over time by BLS, like unemployment, consumer prices, productivity, wages, etc. There could be an edge between two such quantities if historically they are significantly correlated.
National Institute of Standards and Technology	NIST provides data on all kinds of topics, including (for example) the Reuters Corpus, a large collection of news articles.	Many researchers have extracted interesting networks from the Reuters Corpus, such as the network over articles in which edges represent pairs of articles with a common topic(s), according to some objective criterion.
CAIDA data	Data collected by this well-known Internet research consortium.	Since this organization studies the Internet, there are many "obvious" networks one could extract, but probably more subtle ones that would be interesting as well.
CERT data	CERT is a well-known consortium for documenting network and computer security vulnerabilities. There may be links other than the one given where data can be obtained from CERT.	One could let the vertices be operating systems and their versions, with an edge between two if they ever were vulnerable to a common security exploit.
Yahoo! finance	Here you can download all kinds of historical data on the financial markets.	Vertices could be the stocks in the S&P 500. There is an edge between every pair of stocks that hit their 2004 within 30 days of each other.