You are What You Link

Lada A. Adamic
Xerox PARC
Stanford University
ladamic@parc.xerox.com
Eytan Adar
HP Sand Hill Laboratory
eytan@hpl.hp.com

Abstract

The Web has become a massive repository of a wide variety of data. Hidden in this pile of data is information about web users, in the form of the text on their personal homepages, links to and from those homepages, and users' mailing lists. Our study of the publicly available user information revealed a complex network of interpersonal links and a rich mining ground for studying social phenomena. We have developed a web application to gather such data and present it in an easily browsable form.

1. Introduction

Before the advent of web e-commerce, even before the advent of large scale web directories such as Yahoo, web users were busily utilizing the Web to broadcast information about themselves through personal homepages. The practice spread fast, as users became aware of others' homepages and realized that they could easy make one of their own. Homepages are so prevalent on the web and reveal so much about the individuals who write them, yet no study has fully explored them.

The hypertext markup language allows authors not only to write about their friends and family, but to link to their homepages as well. Hence valuable information is hidden both in the text of the homepages, and in the links which are created between them. This study focuses on universities, a setting where homepage writing flourished early on and continues to be a regular part of student life. It is quite common for student homepages to include lists of friends. Since many people initially write their homepages using a friend's homepage as a guide [15], they are likely to imitate the practice of listing friends. The spread of the phenomenon is understandable when we consider the disappointment of a friend who has just included a link to our homepage, but whom we didn't give a link in return.

A link from one homepage to another does not necessarily mean that the two individuals are friends. It can indicate that the two people are coworkers or collaborators or it can simply mean that one person found the content of other's homepage interesting, without being personally acquainted with the other. However, for the most part, a link does usually convey a social connection between the two individuals.

Surprisingly, a few links added here and there between homepages generate a large community structure. Such a structure can be mined for relationships between people, and the activities and preferences which bring about these relationships. The pages vary as much as the individuals themselves do. Some pages are written in a professional manner, including the author's projects, class assignments, or a resume. Some pages include photographs and other content meant to entertain. Still others include very personal items such as poetry, essays, or diary entries.

Regardless of content, personal homepages are usually written with an intended audience: someone acquainted with the author, a possible employer, or someone interested in specific content. Our use of homepages goes beyond their original purpose to study a tangental and large scale phenomenon, the social structure of homepage networks. The fact that this information can be mined for a goal different from its original intent is an example of an information side effect. A ingenious use of information side effects is the RadioCamera system [4]. RadioCamera reads road traffic conditions from the load on cell phone towers. Congested roadways will show a increased load on base stations than roads with no traffic.

Just as global traffic patterns emerge out of cellphone use, large social networks emerge from individualized homepages. Users linking to one another form a giant social network which is easy to harvest and is rich in information about the context of a link between individuals. The contexts can range from cohabitation (i.e. fraternities) to shared interests (i.e. basketball). Acquiring such data on social networks used to be an arduous task involving time-consuming phone or live interviews. We are now able to harvest this information easily and automatically because it is already available as a side effect of people living a digital life. This presents an unprecedented opportunity to discover new and interesting social and cultural phenomena.

Related work [6] [7] [11] has attempted to use analysis of link topology to find web communities: web page collections with a shared topic. For example, any page dealing with 'data mining' and linking to other pages on the same topic would be part of the data mining web community. Such a page is not necessarily a homepage or even associated with a particular individual. In contrast, our work focuses on individuals' homepages and the connections between them, essentially allowing us to tap into both virtual and real world communities of people. Although personal homepage identification has been researched as a separate problem [9][13], networks of homepages have not previously been studied.

The data we use in our study, as described below and in Figure 1, comes from the following four different sources:

  1. Text on a user’s home page provides semantic insight into the content of the page. Co-occurrence of text between users who link to each other usually indicates a common interest. In our study we use multi-word “things” such as organization names and noun phrases, rather than single words .
  2. Out-links are links from a user's homepage to other pages.
  3. In-links are links from other pages to the user's homepage. For example, a web page listing all members of a fraternity links to individual homepages.
  4. Mailing lists provide us with valuable community structure that may not necessarily be apparent in homepages.

Figure 1: The four sources of information, in-links, out-links, mailing lists and text can be used as a means of inferring relationships between the users.

The data was obtained from individual home pages and other data sources at Stanford University and the Massachusetts Institute of Technology (MIT). In the process of analyzing and collecting this data we created a number of useful tools. These tools provided us with insights on the two networks which we discuss briefly below and in detail in a separate paper [2]. The tools have been converted into web based applications which allow others to further explore these web based communities.

Paper Roadmap

In Section 2 of the paper we discuss community web page structures in terms of the small world phenomenon. Section 3 describes the user interface we developed to browse the homepage networks. In Section 4 we discuss potential applications of our technique, and draw general conclusions.

2. Homepage link structure

Anyone who has been surprised to find a common acquaintance with someone they just met is familiar with the 'small world' phenomenon. It may appear that any two people in the world must be connected through a short chain of acquaintances, in spite of the fact that most people move in small social circles defined by their work or the place where they live.

The concept was tested experimentally by Harvard sociologist Stanley Milgram in the 1960's [12]. He asked a group of 160 people in Omaha, Nebraska to deliver a message to a stockbroker in Boston, Massachusetts. The participants could pass the message only onto people they knew on a first name basis, and yet the messages made it to the target in only five or six steps on average. Another interesting fact is that of the millions of people who could have been selected in passing the message, half of the messages passed through a key three people. These people are connectors [8], individuals who maintain many social contacts and who are key in transmitting trends and information in society.

We have tested these concepts on the network of homepages. This network is expected to have far fewer links than a real social network. After all, the people one lists on a homepage, if any are listed at all, are only a tiny fraction of the entire list of one's acquaintances. Nevertheless, we expect to find common properties between the real and virtual networks, such as cliqueishness, short average paths, and a disproportionate allocation of links.

We measured these properties by mapping the homepage network to a graph, with the nodes representing homepages and links representing hyperlinks between the homepages. For the purposes of this study we treated the links as undirected. That is, we take a link from one homepage to another as evidence that the two individuals are acquainted. By manually sampling the links we found that links between individuals who are not acquainted are relatively rare, and can be removed automatically. For example, many template pages distributed in html instructional courses at Stanford contained links to the instructor's homepages. Since these links were not intentionally placed by the users, we removed them.

The two data sets were obtained from the homepages found on the www.stanford.edu server and homepages on the {www,web}.mit.edu servers. Many homepages exist on departmental and personal web servers, but for simplicity, we did not include them in our study. Table 1 gives a summary picture of our data set. One observes that although the Stanford server holds more home pages, the MIT server has more links between homepages. At both institutions, those who receive links are more numerous than those who give them, indicating that many people who are linked to did not create links to others.

Stanford MIT
Users with non-empty WWW directories 7473 2302
Percent who link to at least one other person 14% 33%
Percent who are linked to by at least one other person 22% 58%
Percent with links in either direction 29% 69%
Percent with links in both directions 7% 22%
Table 1 Summary of links given and received among personal
homepages at Stanford and at MIT.

A startling result is that Stanford users, who link to only 2.5 other people on average, create a virtual connected social network of 1,265 people accounting for 58% of the users. A few smaller networks making up the remainder. At MIT, 1281 users, a full 85.6% of those having links on their homepages, belong to the giant component. This is due to a higher percentage of MIT users linking to one another as listed in Table 1. Figures 2 and 3 show graph visualizations of the largest connected components of the Stanford and MIT social networks. In the layout, each point is a homepage, and each link is a hyperlink between two homepages. The layout algorithm places connected nodes close together, while spreading unconnected nodes apart.

Figure 2: Graph layout of the Stanford social web.

Figure 3: Graph layout of the MIT social web.

The first property we measured was clustering or cliquishness. A high degree of clustering indicates that people form small groups, and within these small groups, most individuals are acquainted with one another. To measure the degree of clustering, we used the clustering coefficient C defined by of Watts and Strogatz [14]: C is fraction of neighbors of a node who are linked to one another. It measures how many of one's friends are friends themselves. At Stanford, C = 0.22 while at MIT, it is 0.21. Both clustering coefficients are 70 times greater than they would have been had homepages been linked to one another at random.

The second property we measured was the distribution of in and out links, shown for Stanford in Figure 4. These distributions are nearly power-law, meaning that they appear as nearly straight lines on log-log plots. A power-law distribution implies that a handful of people have a disproportionate number of links, while the majority of people receive very few links. For example, one person had 22 other people link to him, while over 1,000 people received only one link. This is consistent with the real world experience of some individuals being more popular than others, and many people wanting to claim an association with them. On the other hand, one person at Stanford had put links to 35 others, while over 500 users had linked to only one other person. This is true to real social networks as well. While most people keep their acquaintances at arm's length, preferring a small group of close friends, there are those who put a lot of energy into actively keeping up contacts with a great number of people.

Figure 4 Distribution of in-,out-, and undirected links for the Stanford social web.

Finally, we measured the average shortest path using links between any two homepages. Because our set of links is much smaller than real world contacts, we did not expect to reproduce Milgram's result of 6 steps for millions of people. However, we did measure small paths, 9.2 on average for the Stanford network, and 6.4 for the better connected MIT network. The difference in average shortest path is consistent with visualizations in Figures 2 and 3. The visualizations show MIT as a more tightly knit network of homepages for which one would intuitively expect a smaller separation between the nodes.

In summary, we found that the network of homepages consitutes a small world network, one which has short distances between any two individuals, and yet contains many local contacts. In addition to these two qualities of small worlds proposed by Watts and Strogatz, we found that the homepage networks, much like real world networks, contain a few highly connected members. These results reflect the nature of real social networks, while fitting with the structure of the web in general, which itself happens to be a small world network [1][3].

3. Homepage Analysis Tools

The web-based suite of tools we developed provides a number of functions useful for analyzing web communities. A demonstration of this application for the Stanford community is available at http://negotiation.parc.xerox.com/web10. Some functions our tools provide are:

A. View user details Find detailed information on individuals with homepages.

B. Visualization View graphical representations of the relationships between users in both a local and global context.

C. Social network analysis List whom the user links to and who links to them, as well as discover what those users have in common.

D. Matchmaking Match a specific user to others based on links, text, and mailing lists. The resulting ranks can be used for predicting connections between users.

E. Cluster analysis Determine which features are common to sub-groups of users.

The majority of the tools provided are implemented as web scripts or java applets. Each feature of the system is described below.

3.1 User details

Our analysis is based on four types of data: text, out-links, in-links and mailing lists. Text and out-links (including links to other users) were extracted from crawls of the users' homepages. ThingFinder [8] was used to extract the words and phrases in the text in the following categories: persons, places, cities, states, countries, organizations, companies, miscellaneous proper nouns, and noun groups. Although using categorized words and phrases was an improvement over using single words, ThingFinder was designed with commercial applications in mind. Thus, it fares better in recognizing companies and organizations than hobbies or majors which might be more relevant to students. It is also fairly sensitive to capitalization, so that it might pick up "Social Networks", but not "social networks". Despite these minor shortcomings, ThingFinder worked well for the homepage data we obtained.

In-links to Stanford homepages were collected by querying Google. Within the MIT websites multiple URLs correspond to the same page, requiring us to use AltaVista's wildcard search to gather in-links for MIT. Finally, complete lists of subscribers to mailing lists were obtained from the main mailing list server of each institution. At the time of our study, information on 95% of the lists at Stanford was publicly accessible from outside of Stanford. The remaining 5% of the mailing lists were excluded from our study. All information about the MIT mailing lists is internal to MIT. Because of this, and because users have some expectation about the privacy of their e-mail subscriptions, our public tools do not display list names.

Figure 5 provides an example of the output generated by the user description page for one specific Stanford user.

Things found on Sarah Emily Perez's page
CITIES: stratford
PERSONS: adam fingerhut, faulkner, jeannie kim, marcela muniz, mark branom, morrison, robert c. byrd, ryan tacorda
NOUN GROUPS: academic excellence freshman, admission counselor, business enterprise trust, gown women, research assistant, undergraduate admission, business inquiries, computer databases, history class, implement undergraduate recruitment strategy, language instructor, transfer applications, yield program
MISC: admit weekend, assistant, b.a., chappell-lougee grant, cuernavaca, designer, east palo alto tennis, education, gpa, her own words, jarrid whitney, marlon evans, masque, oxford, stanford admit weekend, student affairs, tutoring, vice provost, yahoo
ORGANIZATIONS: office of undergraduate admission, stanford english department, stanford office
POSITIONS: admission counselor, coordinator, editor, president

Out Links
http://admit.stanford.edu/admit
http://www.stanford.edu/~markb/
http://www.altavista.digital.com/
http://www.stanford.edu/~tacorda/
http://www.stanford.edu/
http://www.yahoo.com/

In Links
no in links

Sarah subscribes to 5 mailing lists (names are suppressed)
Figure 5

3.2 Visualization

Graphical representation of the network is generated by a java applet that lays out the largest connected component of the graph. Figures 2 and 3 are snapshots of the result. The applet interface allows one to locate specific users within the global graph. Users may also drag and move individual nodes in order to better see the link structure around an individual.

An alternative to the global view are the images generated for each user individually. Figure 6 below illustrates such an image. In it the user, as well as their first and second degree neighbors, are illustrated and labeled.

3.3 Social network analysis

In network analysis one is interested in the connections between users. Our interface allows researchers to get details about web links and hopefully understand why they exist. Figure 6 illustrates this facility for another Stanford student. Individuals that are linked to are listed in one section (for example, Dwayne Virnau) and those who link to the user are listed in another (Anne L. Gordon). Clicking on the name of a friend shows in turn that friend's list of connections. Thus one is able to browse the entire social network by following person-to-person links.

[ social web home | user directory ]

Mark Branom

homepage: http://www.stanford.edu/~markb
things and links found on Mark's web pages
people who have things in common with Mark


Here are the people whom Mark links to.
Click on the name to find out who that person is linked to

Dwayne Virnau   What do Mark and Dwayne have in common?
Sarah Colleen Brandel   What do Mark and Sarah have in common?

Here are the people who link to Mark.
Click on the name to find out who that person is linked to

Anne L Gordon   What do Mark and Anne have in common?
Elizabeth Maureen Hills   What do Mark and Elizabeth have in common?
Sarah Colleen Brandel   What do Mark and Sarah have in common?
Sarah Emily Perez   What do Mark and Sarah have in common?


[ social web home | user directory ]
Figure 6


Clicking on the "What do x and y have in common" link produces a page listing the items shared by x and y. Figure 7 shows an example for two users, kpsounis and stoumpis.


user 1: kpsounis
Konstantinos Psounis
user 2: stoumpis
Stavros Toumpis

Things in Common
CITIES: escondido, athens, cambridge
NOUN GROUPS: birth date, student association, undergraduate studies
MISC: computer science, electrical engineering, ntua, ph.d., general lyceum, toefl, computer
COUNTRIES: greece
ORGANIZATIONS: national technical university of athens

Pages pointed to by both users (out-links in common)
http://www.stanford.edu
http://www.kathimerini.gr
ht tp://www.stanford.edu/group/hellas
http://www.ntua.gr
http://ee.sta nford.edu

Pages pointing to both users (in-links)
http://www.stanford.edu/~dkarali/
http://171.64.54.173/fil arakia.html

4 mailing lists in common (name supressed)
Figure 7

3.4 Matchmaking

The tool described in the previous section is an attempt to understand links which have been made explicit by users. These links respresent only a small subset of all the real world social interactions. We would like to fill in the missing connections by using the information gathered on the users. Intuitively, the matchmaking algorithm guesses that the more similar a person is, the more likely they are to be a friend. If the two people matched have a lot in common, but are not friends already, the algorithm becomes a true matchmaker by suggesting which users should get to know each other based on their shared interests.

Similarity is measured by analyzing text, links, and mailing list. If we are trying to evaluate the likelihood that user A is linked to user B, we sum the number of items the two users have in common. Items that are unique to a few users are weighted more than commonly occurring items. The weighting scheme we use is the inverse log frequency of their occurrence. For example, if only two people mention an item, then the weight of that item is 1/log(2) or 1.4, if 5 people mention the item, then its weight drops down to 1/log(5) or 0.62. The tool, shown in Figure 8, lists the best matches for a given user and indicates where explicit homepage links are present.

anakken: Clifford Hsiang Chao
Linked Likeness Score
Person Things in common
NO 8.25 Eric Winston Liao What do Clifford and Eric have in common?
NO 5.76 Armen Norick Berjikly What do Clifford and Armen have in common?
YES 4.15 Andrew Chunghun Jhun What do Clifford and Andrew have in common?
YES 3.96 John Andrew Vestal What do Clifford and John have in common?
YES 3.65 Stanley Hsinheng Lin What do Clifford and Stanley have in common?
NO 3.27 desiree dawn ong What do Clifford and desiree have in common?
YES 3.08 Charles Ming Lin What do Clifford and Charles have in common?
YES 2.74 Justin Karl Dunscombe What do Clifford and Justin have in common?
NO 2.66 Daniel Sunil Chai What do Clifford and Daniel have in common?
NO 2.55 Wei Nan Hsu What do Clifford and Wei have in common?
Figure 8

3.5 Cluster analysis

Although only currently available offline, we have created algorithms that allow researchers to determine which "things" have high incidence within specific groups. Using this facility one may learn why groups are clustered together.

Our system ranks terms, lists, and links by evaluating the number of users who link to each other who use the same term. We consider all users associated with a particular item, for example the term "pearl tea". Among the N users who mention the item we take the number of pairs who link to each other and divide it by the total possible number of pairs, given by N/(N-1). Ranking the results of this algorithm allows us to see which items are highly identifying for a given group.

Table 2 shows the result of this tool for all MIT users and illustrates some of the key things (in this case terms) which "bind" groups together. This corresponds well to the reality of the MIT population where a bulk of the student population reside in fraternities or living groups (6 of the top 10 terms are of this type).

Term (description) Pairs Users Ratios
Union Chicana (student group) 5 5 .5
Phi Beta Epsilon (fraternity) 6 7 .286
Bhangra (traditional dance, practiced within a club at MIT) 4 6 .267
neurosci (appears to be the journal Neuroscience) 5 7 .238
Phi Sigma Kappa (fraternity) 5 7 .238
PBE (fraternity) 8 9 .222
Chi Phi (fraternity) 7 11 .127
Alpha Chi Omega (sorority) 4 11 .727
Stuyvesant High School 5 13 .064
Russian House (living group) 5 14 .055
Table 2 The most "indicative" terms in the MIT data set.
The descriptions were manually added.

4. Conclusions

We have shown that personal homepages provide a glimpse into the social structure of university communities. Not only do they reveal to us who knows whom, but they give us a context, whether it be a shared dorm, hobby, or research lab. Obtaining data on social networks used to be a tedious process of conducting a series of phone or live interviews. Studying social networks online can give us rich insight into how social bonds are created, but requires no more effort than running a crawler on home pages.

Among the numerous applications of these results is the mining of correlations between groups of people, which can be done simply by looking at co-occurrence in homepages of terms associated with each group. Using these techniques in combination with community discovery algorithms yields labeled clusters of users. Thus, not only is it possible to find communities, but we can describe them in a non-obvious way.

Another possible application is the facilitation of networking within a community. Knowing which friend of a friend is involved in a particular activity can help users find a chain of acquaintances to reach the people they need to reach. Finally, networks of homepages open a whole range of possibilities in marketing research, from identifying which groups might be interested in a product to relying on the social network to propagate information about that product.

Acknowledgements

The authors would like to thank Rajan Lukose, Bernardo Huberman, and TJ Giuli for their valuable advice and comments.

References

  1. L. Adamic, "The small world Web," Proceedings of the European Conf. on Digital Libraries, 1999.
  2. L. Adamic and Eytan Adar, "Friends and Neighbors on the Web".
  3. R. Albert, H. Jeong, A.-L. Barabasi, ``The diameter of the World Wide Web,'' Nature 401, 130 (1999).
  4. S. Diaz, "Cell Phone Signals Touted to Fight Traffic Wars," San Jose Mercury News, Jan. 20, 2000, .
  5. P. Erods, and A. Renyi, Publ. Math. Inst. Hung. Acad. Sci. 5 (1960) 17; B. Bollobas, Random Graphs (Academic Press, London, 1985).
  6. G. Flake, S. Lawrence, and C. Lee Giles. "Efficient identification of web communities". In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, August 20-23 2000, pp.150-160.
  7. D. Gibson, J. Kleinberg, and P. Raghavan. "Inferring Web communities from link topology," Proceedings 9th ACM Conference on Hypertext and Hypermedia, 1998
  8. Michael Gladwell, "The Tipping Point", Little & Brown, 2000.
  9. HomePageSearch
  10. InXight ThingFinder product page.
  11. R. R. Larson, "Bibliometrics of the World Wide Web: an exploratory analysis of the intellectual structure of cyberspace," Global Complexity: Information, Chaos and Control, the 1996 Annual Meeting of the American Society for Information Science, October 21-26, 1996, Baltimore, Maryland, USA.
  12. S. Milgram, ``The small world problem,'' Psychology Today 1, 61 (1967).
  13. J. Sharkes, M. Langheinrich, and O. Etzioni, Dynamic Reference Sifting: a Case Study in the Homepage Domain, Proceedings of the Sixth International World Wide Web Conference, pp.189-200 (1997).
  14. D. Watts and S. Strogatz, "Collective dynamics of small-world networks", Nature 393, 440 (1998).
  15. Patricia Wallace, "The Psychology of the Internet", Cambridge Univeristy Press, Cambridge, 1999