| Lada A. Adamic Xerox PARC Stanford University ladamic@parc.xerox.com |
Eytan Adar HP Sand Hill Laboratory eytan@hpl.hp.com |
Before the advent of web e-commerce, even before the advent of large scale web directories such as Yahoo, web users were busily utilizing the Web to broadcast information about themselves through personal homepages. The practice spread fast, as users became aware of others' homepages and realized that they could easy make one of their own. Homepages are so prevalent on the web and reveal so much about the individuals who write them, yet no study has fully explored them.
The hypertext markup language allows authors not only to write about their friends and family, but to link to their homepages as well. Hence valuable information is hidden both in the text of the homepages, and in the links which are created between them. This study focuses on universities, a setting where homepage writing flourished early on and continues to be a regular part of student life. It is quite common for student homepages to include lists of friends. Since many people initially write their homepages using a friend's homepage as a guide [15], they are likely to imitate the practice of listing friends. The spread of the phenomenon is understandable when we consider the disappointment of a friend who has just included a link to our homepage, but whom we didn't give a link in return.
A link from one homepage to another does not necessarily mean that the two individuals are friends. It can indicate that the two people are coworkers or collaborators or it can simply mean that one person found the content of other's homepage interesting, without being personally acquainted with the other. However, for the most part, a link does usually convey a social connection between the two individuals.
Surprisingly, a few links added here and there between homepages generate a large community structure. Such a structure can be mined for relationships between people, and the activities and preferences which bring about these relationships. The pages vary as much as the individuals themselves do. Some pages are written in a professional manner, including the author's projects, class assignments, or a resume. Some pages include photographs and other content meant to entertain. Still others include very personal items such as poetry, essays, or diary entries.
Regardless of content, personal homepages are usually written with an intended audience: someone acquainted with the author, a possible employer, or someone interested in specific content. Our use of homepages goes beyond their original purpose to study a tangental and large scale phenomenon, the social structure of homepage networks. The fact that this information can be mined for a goal different from its original intent is an example of an information side effect. A ingenious use of information side effects is the RadioCamera system [4]. RadioCamera reads road traffic conditions from the load on cell phone towers. Congested roadways will show a increased load on base stations than roads with no traffic.
Just as global traffic patterns emerge out of cellphone use, large social networks emerge from individualized homepages. Users linking to one another form a giant social network which is easy to harvest and is rich in information about the context of a link between individuals. The contexts can range from cohabitation (i.e. fraternities) to shared interests (i.e. basketball). Acquiring such data on social networks used to be an arduous task involving time-consuming phone or live interviews. We are now able to harvest this information easily and automatically because it is already available as a side effect of people living a digital life. This presents an unprecedented opportunity to discover new and interesting social and cultural phenomena.
Related work [6] [7] [11] has attempted to use analysis of link topology to find web communities: web page collections with a shared topic. For example, any page dealing with 'data mining' and linking to other pages on the same topic would be part of the data mining web community. Such a page is not necessarily a homepage or even associated with a particular individual. In contrast, our work focuses on individuals' homepages and the connections between them, essentially allowing us to tap into both virtual and real world communities of people. Although personal homepage identification has been researched as a separate problem [9][13], networks of homepages have not previously been studied.
The data we use in our study, as described below and in Figure 1, comes from the following four different sources:
The data was
obtained from individual home pages and other data sources at Stanford
University and the Massachusetts Institute of Technology (MIT). In the
process
of analyzing and collecting this data we created a number of useful
tools. These
tools provided us with insights on the two networks which we discuss
briefly
below and in detail in a separate paper [2]. The tools have been
converted into
web based applications which allow others to further explore these web
based
communities.
Figure 1:
The four sources of information, in-links, out-links, mailing lists and text
can be used as a means of inferring relationships between the users.
Paper Roadmap
In Section 2 of the paper we discuss community web
page
structures in terms of the small world phenomenon. Section 3
describes the user interface we developed to browse the homepage networks. In Section 4 we
discuss
potential applications of our technique, and draw general conclusions.
Anyone who has been surprised to find a common acquaintance with someone they just met is familiar with the 'small world' phenomenon. It may appear that any two people in the world must be connected through a short chain of acquaintances, in spite of the fact that most people move in small social circles defined by their work or the place where they live.
The concept was tested experimentally by Harvard sociologist Stanley Milgram in the 1960's [12]. He asked a group of 160 people in Omaha, Nebraska to deliver a message to a stockbroker in Boston, Massachusetts. The participants could pass the message only onto people they knew on a first name basis, and yet the messages made it to the target in only five or six steps on average. Another interesting fact is that of the millions of people who could have been selected in passing the message, half of the messages passed through a key three people. These people are connectors [8], individuals who maintain many social contacts and who are key in transmitting trends and information in society.
We have tested these concepts on the network of homepages. This network is expected to have far fewer links than a real social network. After all, the people one lists on a homepage, if any are listed at all, are only a tiny fraction of the entire list of one's acquaintances. Nevertheless, we expect to find common properties between the real and virtual networks, such as cliqueishness, short average paths, and a disproportionate allocation of links.
We measured these properties by mapping the homepage network to a graph, with the nodes representing homepages and links representing hyperlinks between the homepages. For the purposes of this study we treated the links as undirected. That is, we take a link from one homepage to another as evidence that the two individuals are acquainted. By manually sampling the links we found that links between individuals who are not acquainted are relatively rare, and can be removed automatically. For example, many template pages distributed in html instructional courses at Stanford contained links to the instructor's homepages. Since these links were not intentionally placed by the users, we removed them.
The two data sets were obtained from the homepages found on the www.stanford.edu server and homepages on the {www,web}.mit.edu servers. Many homepages exist on departmental and personal web servers, but for simplicity, we did not include them in our study. Table 1 gives a summary picture of our data set. One observes that although the Stanford server holds more home pages, the MIT server has more links between homepages. At both institutions, those who receive links are more numerous than those who give them, indicating that many people who are linked to did not create links to others.
A startling result is that Stanford users, who link to only 2.5 other
people on average,
create a virtual connected social network of 1,265 people accounting for
58% of
the users. A few smaller networks making up the remainder. At MIT,
1281 users, a full
85.6% of those having links on their homepages, belong to the giant
component. This is
due to a higher percentage of MIT users linking to one another as listed
in
Table 1. Figures 2 and 3 show graph visualizations of the largest
connected
components of the Stanford and MIT social networks. In the layout, each
point is a homepage, and each link is a hyperlink between two
homepages. The layout algorithm places connected nodes close together,
while spreading unconnected nodes apart.
The first property we measured was clustering or cliquishness. A high
degree of clustering indicates that people form small groups, and within
these small groups, most individuals are acquainted with one another. To
measure the degree of clustering, we used the clustering coefficient
C defined by of Watts and Strogatz [14]: C
is fraction of neighbors of a node who
are linked to one another. It measures how many of one's friends are
friends themselves. At Stanford, C = 0.22 while at MIT,
it is 0.21. Both
clustering coefficients are 70 times greater than they would have been
had homepages been
linked to one another at random.
The second property we measured was the distribution of in and out
links,
shown for Stanford in Figure 4. These distributions are nearly
power-law,
meaning that they appear as nearly straight lines on log-log plots. A
power-law
distribution implies that a handful of people have a disproportionate
number of
links, while the majority of people receive very few links. For example, one
person had 22
other people link to him, while over 1,000 people received only one link. This
is
consistent with the real world experience of some individuals being more
popular
than others, and many people wanting to claim an association with them.
On the
other hand, one person at Stanford had put links to 35 others, while
over 500
users had linked to only one other person. This is true to real social
networks
as well. While most people keep their acquaintances at arm's length,
preferring
a small group of close friends, there are those who put a lot of energy
into
actively keeping up contacts with a great number of people.
Finally, we measured the average shortest path using links between
any two
homepages. Because our set of links is much smaller than real world
contacts, we
did not expect to reproduce Milgram's result of 6 steps for millions of
people.
However, we did measure small paths, 9.2 on average for the Stanford
network,
and 6.4 for the better connected MIT network. The difference in average
shortest path is consistent with visualizations in Figures 2
and 3. The
visualizations show MIT as a more tightly knit
network of homepages for which one would intuitively expect a smaller
separation
between the nodes.
In summary, we found that the network of homepages consitutes a small
world
network, one which has short distances between any two individuals, and
yet
contains many local contacts. In addition to these two qualities of
small
worlds proposed by Watts and Strogatz, we found that the homepage
networks,
much like real world networks, contain a few highly connected members.
These results reflect the nature of real social networks, while
fitting with
the structure of the web in general, which itself happens to be a small
world network
[1][3].
A. View user details Find detailed information on individuals
with
homepages.
B. Visualization View graphical representations of the
relationships between users in both a
local and global context.
C. Social network analysis List whom the user links to and who
links
to them, as well as discover what those users have in common.
D. Matchmaking Match a specific user to others based on links,
text,
and mailing lists. The resulting ranks can be used for predicting
connections between users.
E. Cluster analysis Determine which features are common to
sub-groups
of users.
The majority of the tools provided are implemented as web scripts or
java
applets. Each feature of the system is described below.
Our analysis is based on four types of data: text, out-links,
in-links and
mailing lists. Text and out-links (including links to other users) were
extracted from crawls of the users' homepages.
ThingFinder [8] was used to extract the words and phrases in the text in
the
following categories: persons, places, cities, states, countries,
organizations,
companies, miscellaneous proper nouns, and noun groups. Although using categorized
words and phrases was an improvement over using single words, ThingFinder
was designed with commercial applications in mind. Thus, it fares better in
recognizing companies and organizations than hobbies or majors which might be
more relevant to students. It is also fairly sensitive to capitalization, so
that it might pick up "Social Networks", but not "social networks". Despite
these minor shortcomings, ThingFinder worked well for the homepage data we
obtained.
In-links to Stanford homepages were collected by querying Google.
Within the MIT websites multiple URLs correspond to the same page, requiring us
to use AltaVista's wildcard search to gather in-links for MIT.
Finally, complete lists of subscribers to mailing lists were obtained
from the main mailing list server of each institution. At the time of our
study, information on 95% of the lists at Stanford was publicly accessible
from outside of Stanford. The remaining 5% of the mailing lists were excluded
from our study. All information about the MIT mailing lists is internal to MIT.
Because of this, and because users have some expectation about the privacy
of their e-mail subscriptions, our public tools do not display list names.
Figure 5 provides an example of the output generated by the user
description
page for one specific Stanford user.
Graphical representation of the network is generated by a java applet
that
lays out the largest connected component of the graph.
Figures 2 and 3 are snapshots of the result. The applet interface allows
one to locate specific users within the global graph. Users may
also drag and move individual nodes in order to better see the link
structure
around an individual.
An alternative to the global view are the images generated for each
user
individually. Figure 6 below illustrates such an image. In it the user,
as well
as their first and second degree neighbors, are illustrated and labeled.
In network analysis one is interested in the connections between
users. Our
interface allows researchers to get details about web links and
hopefully
understand why they exist. Figure 6 illustrates this facility for
another
Stanford student. Individuals that are linked to are listed in one
section (for
example, Dwayne Virnau) and those who link to the user are listed in
another
(Anne L. Gordon). Clicking on the name of a friend shows in turn that friend's list of connections. Thus one is able to browse the entire social network by following
person-to-person links.
Clicking on the "What do x and y have in common" link produces
a page listing the items shared by x and y. Figure 7 shows
an example for two users, kpsounis and stoumpis.
The tool described in the previous section is an attempt to
understand
links which have been made explicit by users. These links
respresent only a small subset of all
the real world social interactions. We would like to fill in the missing
connections
by using the information gathered on the users. Intuitively, the
matchmaking algorithm guesses that the more similar a person is, the
more likely
they are to be a friend. If the two people matched have a lot in common,
but are
not friends already, the algorithm becomes a true matchmaker by
suggesting which users should get to know each other based on their
shared interests.
Similarity is measured by analyzing text, links, and mailing list. If
we are
trying to evaluate the likelihood that user A is linked to user B, we
sum the
number of items the two users have in common. Items that are unique to a
few
users are weighted more than commonly occurring items. The weighting
scheme we
use is the inverse log frequency of their occurrence. For example, if
only two
people mention an item, then the weight of that item is 1/log(2) or 1.4,
if 5
people mention the item, then its weight drops down to 1/log(5) or 0.62.
The tool, shown in Figure 8, lists the best matches for a given user and indicates where
explicit homepage links are present.
Although only currently available offline, we have created algorithms
that
allow researchers to determine which "things" have high incidence within
specific groups. Using this facility one may learn why groups are
clustered
together.
Our system ranks terms, lists, and links by evaluating the number of
users
who link to each other who use the same term. We consider all users
associated with a particular item, for example the term "pearl tea". Among the N users who mention the item we take the number of pairs who link to each other and divide it by
the total possible number of pairs, given by
N/(N-1).
Ranking the
results of this algorithm allows us to see which items are highly
identifying
for a given group.
Table 2 shows the result of this tool for all MIT users and
illustrates some
of the key things (in this case terms) which "bind" groups together.
This
corresponds well to the reality of the MIT population where a bulk of
the
student population reside in fraternities or living groups (6 of the top
10
terms are of this type).
Among the numerous applications of these results is the mining of
correlations between groups of people, which can be done simply by
looking at
co-occurrence in homepages of terms associated with each group. Using
these
techniques in combination with community discovery algorithms yields
labeled
clusters of users. Thus, not only is it possible to find communities,
but we can
describe them in a non-obvious way.
Another possible application is the facilitation of networking within
a
community. Knowing which friend of a friend is involved in a particular
activity
can help users find a chain of acquaintances to reach the people they
need to reach.
Finally, networks of homepages open a whole range of possibilities in
marketing
research, from identifying which groups might be interested in a product
to
relying on the social network to propagate information about that
product.
Stanford
MIT
Users with non-empty WWW directories
7473
2302
Percent who link to at least one other person
14%
33%
Percent who are linked to by at least one other
person
22%
58%
Percent with links in either direction
29%
69%
Percent with links in both directions
7%
22%
Table 1 Summary of links given and
received among
personal
homepages at Stanford and at MIT.
Figure 2: Graph layout of the Stanford
social web.
Figure 3: Graph layout of the MIT social
web.
Figure 4 Distribution of in-,out-, and
undirected
links for the Stanford social web.
3. Homepage Analysis Tools
The web-based suite of tools we
developed provides a
number of functions useful for analyzing web communities. A
demonstration of this
application for the Stanford community is available at
http://negotiation.parc.xerox.com/web10. Some functions our tools
provide are:
3.1 User details
Things found on Sarah Emily Perez's
page
CITIES:
stratford
PERSONS:
adam fingerhut, faulkner, jeannie kim, marcela muniz, mark
branom, morrison, robert c. byrd, ryan tacorda
NOUN GROUPS:
academic excellence freshman, admission counselor,
business
enterprise trust, gown women, research assistant,
undergraduate
admission, business inquiries, computer databases, history
class,
implement undergraduate recruitment strategy, language
instructor,
transfer applications, yield program
MISC:
admit weekend, assistant, b.a., chappell-lougee grant,
cuernavaca, designer, east palo alto tennis, education, gpa,
her own
words, jarrid whitney, marlon evans, masque, oxford,
stanford admit
weekend, student affairs, tutoring, vice provost,
yahoo
ORGANIZATIONS:
office of undergraduate admission, stanford english
department,
stanford office
POSITIONS:
admission counselor, coordinator, editor,
president
Out Links
http://admit.stanford.edu/admit
http://www.stanford.edu/~markb/
http://www.altavista.digital.com/
http://www.stanford.edu/~tacorda/
http://www.stanford.edu/
http://www.yahoo.com/
In Links
no in
links
Sarah subscribes to 5 mailing lists
(names
are
suppressed)
3.2 Visualization
3.3 Social network analysis
[ social
web
home | user
directory ]
Mark Branom 
homepage: http://www.stanford.edu/~markb
things and links found on Mark's web pages
people
who have things in common with Mark
Here are the
people
whom Mark links to.
Click on the name to find out who that
person is
linked to
Dwayne
Virnau What do Mark and Dwayne have in
common?
Sarah
Colleen Brandel What do Mark and Sarah have in
common?
Here are the people who link to
Mark.
Click on the name to find out who that person is linked
to
Anne
L Gordon What do Mark and Anne have in
common?
Elizabeth
Maureen Hills What do Mark and Elizabeth have in
common?
Sarah
Colleen Brandel What do Mark and Sarah have in
common?
Sarah
Emily Perez What do Mark and Sarah have in
common?
[ social web
home | user
directory ]
Figure 6
user 1:
kpsounis
Konstantinos
Psounisuser 2:
stoumpis
Stavros
Toumpis
Things in
Common
CITIES:
escondido, athens, cambridge
NOUN GROUPS:
birth date, student association, undergraduate
studies
MISC:
computer science, electrical engineering, ntua, ph.d.,
general
lyceum, toefl, computer
COUNTRIES:
greece
ORGANIZATIONS:
national technical university of
athens
Pages pointed to by
both users
(out-links in common)
http://www.stanford.edu
http://www.kathimerini.gr
ht
tp://www.stanford.edu/group/hellas
http://www.ntua.gr
http://ee.sta
nford.edu
Pages pointing to both
users
(in-links)
http://www.stanford.edu/~dkarali/
http://171.64.54.173/fil
arakia.html
4 mailing lists in
common (name
supressed)
Figure 7
3.4 Matchmaking
anakken: Clifford Hsiang Chao
Figure
8 3.5 Cluster analysis
Term (description)
Pairs
Users
Ratios
Union Chicana (student group)
5
5
.5
Phi Beta Epsilon (fraternity)
6
7
.286
Bhangra (traditional dance, practiced within a club at
MIT)
4
6
.267
neurosci (appears to be the journal Neuroscience)
5
7
.238
Phi Sigma Kappa (fraternity)
5
7
.238
PBE (fraternity)
8
9
.222
Chi Phi (fraternity)
7
11
.127
Alpha Chi Omega (sorority)
4
11
.727
Stuyvesant High School
5
13
.064
Russian House (living group)
5
14
.055
Table 2 The most "indicative" terms in the
MIT
data set.
The descriptions were manually
added.4. Conclusions
Acknowledgements
The authors would like to thank Rajan Lukose,
Bernardo
Huberman, and TJ Giuli for their valuable advice and comments.
References