Corrections and Clarifications
- Actually handed out Jan 22 (Jan 22)
- Number of experiments to run in problem 1 is thirty, not twenty. (Jan 22)
- Question 4 clarified: You should compare your list to Google's list; GoogleBrowser's relationship to a search on Google.com is explained. (Jan 29)
This problem set has seven problems for a total of 100 points. Some problems have several parts.
This assignment is to be handed in at the beginning of class on paper, not submitted by email.
The instructions to "answer in about X words" are guidelines. You are not required to write any particular number of words, but we expect that complete answers will be about that long.
Visit the epidemic simulation provided by Sangkyum Kim: http://www.cis.upenn.edu/~sangkyum/applet_codes/epidemic/Epidemic.html
The simulator draws a two dimensional lattice. Each square represents a node in the network. Each node has eight neighbors to which it is directly connected. A non-zero rewiring probability allows for a chance that a node is also connected to some other random node on the graph.
("Random seed" is just a number that is used by the applet in a function that generates pseudo-random numbers. You can type some arbitrary number in there, if you like.)
Set the model speed to "hyper" and run thirty simulations with a fixed infection probability of .19. Only the rewiring probability will vary during your experiments.
Run five simulations for each of the following rewiring probability values: 0, .2, .4, .6, .8, 1.
Describe a statistical process by which large networks are generated that might lead to a heavy-tailed distribution of degrees (in which most vertices have relatively small degree, but a small fraction have inordinately large degree). This may be an informal English description of how new vertices are added, and the process by which they acquire links to other vertices, or it may be a more mathematical, formal description (e.g. an algorithm for generating networks). In either case you should strive to be as precise as you can. Try to make your process as decentralized as possible. You should then argue, again as precisely as possible, why your process generates heavy-tailed distributions, and discuss what types of natural networks it might model well.
Consider a network in which person A has ten friends, the ten friends of person A each have ten friends, and those ten friends each have ten friends. There are no other people in this network besides the ones that have been mentioned. We assume that people cannot be their own friends.
Before consulting any Web resources on the subject, list the ten universities that seem to you to be most similar to Penn. This is "your list."
Now, visit TouchGraph's GoogleBrowser: http://www.touchgraph.com/TGGoogleBrowser.html Enter "www.upenn.edu" as your stating URL. You should also visit Google and simply type "related:www.upenn.edu" as your search query. The GoogleBrowser and the "related" search on Google.com are giving you two different visualizations of the same information that is in Google's database.
Take a look at the top ten Universities whose Web sites seem to be the most related to Penn's, according to Google's analysis. (This is probably easier in Google itself, since results are listed in order of similarity.) The ranking of "related" pages in Google's database is "Google's list" and can be observed several ways, but just look at the Google search results for "related:www.upenn.edu."
Pick two or three of the most surprising differences between the top results on both lists, that is, "your list" and "Google's list." What factors could account for these differences? (Looking at the GoogleBrowser may help you to determine this, since it gives some visual information about who links to who.) What things were you thinking about when you wrote your list, and what factors are important to Google's analysis?
Pick a university on your list that was not "similar" to Penn according to Google. Find a path of similarity, either using the TouchGraph GoogleBrowser or using Google itself, between Penn and that university. Write down the path that you found. How long did it take you to find it? What strategy did you use to find that path?
Hasbro developed an electronic game called Pox a few years ago. People marketing the game went to schools in Chicago and conducted research in an attempt to determine which kids were coolest. They asked kids to rank others in terms of coolness, asked the coolest of these to name the people they thought were the coolest, and so on, until they found the "top dogs" who were the seen as the coolest kids. Then they gave free units of their game to these "top dogs." Leaving aside for a moment the social atrocity that the company committed, what are some reasons that this marketing strategy be ineffective in transmitting Pox, given Gladwell's discussion of the way that product "epidemics" spread?
A few years ago, almost overnight, Napster became the biggest peer-to-peer file sharing network in the world. However, soon after the company was bought by Bertelsmann it was shut down for legal reasons. (A new service has relaunched under the Napster name.) Now Kazaa (using a Gnutella-like scheme) has taken over and seems to have remained successful, despite similar challenges. Outline the network dynamics and statics of both networks. What are the fundamental differences and where do they overlap? In what sense is Kazaa's network structure responsible for its long survival? (This exercise might require some online research on the different file sharing companies and technologies.)
For each of the following general categories of networks discussed in class, give a real-world example of such a network other than those given in class. You should be as clear as possible about what the vertices (nodes) are and about what relationship between two vertices determines whether or not there is an edge (link) between them. For each of your networks, comment in a sentence or two on its static structure, its dynamics, and its formation: