PicASHOW: Pictorial Authority Search by Hyperlinks on the Web

Ronny Lempel   Aya Soffer
Department of Computer Science   IBM Research Lab in Haifa
The Technion   Matam
Haifa 32000, Israel   Haifa 31905, Israel
rlempel@cs.technion.ac.il   ayas@il.ibm.com


Copyright is held by the author/owner(s).

WWW10, May 1-5, 2001, Hong Kong.

ACM 1-58113-348-0/01/0005.



We describe PicASHOW, a fully automated WWW image retrieval system that is based on several link-structure analyzing algorithms. Our basic premise is that a page p displays (or links to) an image when the author of p considers the image to be of value to the viewers of the page. We thus extend some well known link-based WWW page retrieval schemes to the context of image retrieval.

PicASHOW's analysis of the link structure enables it to retrieve relevant images even when those are stored in files with meaningless names. The same analysis also allows it to identify image containers and image hubs. We define these as Web pages that are rich in relevant images, or from which many images are readily accessible.

PicASHOW requires no image analysis whatsoever and no creation of taxonomies for pre-classification of the Web's images. It can be implemented by standard WWW search engines with reasonable overhead, in terms of both computations and storage, and with no change to user query formats. It can thus be used to easily add image retrieving capabilities to standard search engines.

Our results demonstrate that PicASHOW, while relying almost exclusively on link analysis, compares well with dedicated WWW image retrieval systems. We conclude that link analysis, a bona-fide effective technique for Web page search, can improve the performance of Web image retrieval, as well as extend its definition to include the retrieval of image hubs and containers.

Keywords: Image Retrieval; Link Structure Analysis; Hubs and Authorities; Image Hubs; PicASHOW.

1. Introduction

The WWW is host to millions of images on almost every conceivable topic. Finding effective methods to retrieve these images has attracted many research efforts over the past few years. Such research has led to academic image retrieval systems ([10]), to general search engines which have developed image retrieval capabilities ([7], [17]), and to search engines dedicated to WWW image retrieval ([9],[19]).

There are three main approaches for WWW image search and retrieval:

  1. Text-based retrieval. This approach annotates images with text derived from the containing (displaying) HTML documents, and then applies text-based retrieval algorithms to the annotated collection of images. The derived text can include the caption of the image, text surrounding the image, the entire text of the containing page, the filename of the containing HTML document and the filename of the image itself.
  2. Content-based image retrieval (CBIR). This approach applies image analysis techniques in order to extract visual features (e.g., color, texture, orientation, shapes) from the images. The features are extracted in a preprocessing stage, and stored in the retrieval system's database. The extracted features (e.g., the color histogram of the image) are usually of high dimensionality, and in order to allow scalability of these systems (in terms of storage space and query processing times), some sort of dimension reduction is usually performed on the data [e.g., 21].
  3. Manually annotated image collections. There are several companies that specialize in providing visual content to a diverse range of image consumers. The two largest companies are Getty Images [15] and Corbis [14]. The images are indexed and retrieved by keywords, which are manually assigned to each image. While end users may use these services, they are especially geared toward companies and professionals which require high volumes of diverse images.
One implication of the difference between the retrieval approaches is their support of different types of queries. Text-based retrieval systems, as well as the commercial image providers, support natural, topic-descriptive queries. These queries are friendly and familiar to the typical surfer of the Web. On the other hand, CBIR supports either queries which are formulated in terms of the extracted visual features, or similarity queries, in which a sample image is presented and the system is required to retrieve images with similar visual features.
Web Image Search
PicToSeek ([12]) is an example of a pure content-based image retrieval system. It classifies images into photographed/synthetic images, and further classifies photographs into portraits and indoor/outdoor scenes. It extracts many visual features from the images, and supports queries by image-examples and by image features.

Many WWW image search engines combine text-based retrieval and CBIR into an integrated system. In webSeek ([27]), for example, each image is processed and its visual features are extracted. Each image is also associated with the text in its containing page. The images are additionally classified into topics from a specialized taxonomy that was developed for this purpose. The WebSeer system ([11]) uses associated text and feature extraction to support complex queries, which state both the search topic and some visual properties of the desired images. The system depicted in [3] unifies the textual representations (derived from the containing documents) and visual representations of images into a single representative vector. This enables the utilization of possible statistical couplings between the textual contents of the containing documents and the visual properties of the images.

Harmandas et al. in [13] suggest a text-based image retrieval system in which connectivity information is used to induce textual annotations of images. Each image i in their approach is assigned a weighted vector of representing terms, which is some function of the combined text of all of the pages that contain i and of the pages that link to pages containing i. While this scheme does consider hyperlinks, it is essentially a text-based retrieval scheme in which hyperlinks are used to induce textual annotations, without any analysis being done on the link-structure per se.
Link Structure Analysis in Web Page Search
Recent work in Web search has demonstrated that link structure analysis is a very effective technique for finding authoritative Web pages. Information such as which pages are linked to others is commonly used to augment search algorithms, and has significantly improved the ability of search engines to rank quality pages at the top of their search results. Link-structure analysis is based on the notion that a link from page p to page q can be viewed as an endorsement of q by p, and as some form of positive judgment by p of q's content.

Two important types of techniques in link-structure analysis are co-citation based schemes, and random-walk based schemes. The main idea behind co-citation based schemes is the notion that when two pages p1 and p2 both point to some page q, it is reasonable to assume that p1 and p2 share a mutual topic of interest. Likewise, when p links to both q1 and q2, it is probable that q1 and q2 share some mutual topic. An important work in the context of co-citation based schemes was Jon Kleinberg's introduction of the notions of hubs and authorities ([23]) as two distinct types of Web pages. Authorities, or authoritative pages, are Web pages that contain high-quality information regarding some topic. Hubs, on the other hand, may not directly contain information but are rather resource lists, linking to authorities on a topic without necessarily displaying the information itself. Kleinberg devised an algorithm aimed at finding authoritative pages, and researchers from IBM's Almaden Research Center have implemented Kleinberg's algorithm in various projects, most notably CLEVER [4].

Random walk based schemes model the Web (or part of it) as a graph (where pages are nodes and links are edges), and apply some random walk model to the graph. Pages are then ranked by the probability of visiting them in the modeled random walk. The most notable algorithm of this type is PageRank [2], which is an important part of the ranking function and of the success of the Google search engine ([16]). Both Kleinberg's algorithm and PageRank are described in detail in Section 2.

Co-citation reasoning was combined with random walk theory in SALSA ([24]), to separate the random walk based rankings of hubs and authorities. Rafiei and Mendelzon ([25]) have also integrated co-citation and random walks in their work on computing page reputations.
Our Approach: Link Structure Analysis in Web Image Search
In this paper we present PicASHOW, a pictorial retrieval system that searches for images (pictures) on the Web using hyperlink-structure analysis. PicASHOW applies co-citation based approaches and PageRank influenced methods. Our basic premise is that a page p displays (or links to) an image when the author of p considers the image to be of value to the viewers of the page. We further contend that the standard reasoning behind the co-citation measure applies to images just as it does to HTML pages: In addition, in the spirit of PageRank, we assume that images which are contained in authoritative pages on topic t are good candidates to be quality images on that topic.

In the next sections, we describe several link-structure based WWW image retrieval schemes. Following are the highlights of the PicASHOW approach:

The remainder of this paper is organized as follows. In Section 2 we provide some background on link-structure analysis when searching for Web pages. In Section 3 we formally define the image collections which are to be analyzed, explain how such collections are assembled from a given query, and present our proposed image ranking schemes. Section 4 introduces the concept of image hubs and image containers and describes how we identify such hubs and containers. In Section 5 we discuss the pros and cons of our method, and suggest interesting extensions of this research direction. Appendix A lists the URLs of the images which are displayed in this paper.

We do not provide any formal evaluation section in this paper since there are no benchmarks for testing such systems and thus most evaluations are qualitative. Rather, we include results of sample queries throughout the paper. For comparison purposes, we also show the results of some of these queries on commercial Web search engines. The purpose of this comparison is to show that different search techniques yield different images and to highlight the benefits of link analysis in the context of image search.

2. Link Analysis for Finding Authoritative Web Pages

This section provides some technical background on applications of WWW link-structure analysis when searching for Web pages. Specifically, we provide a brief overview of two link-structure analyzing approaches: PageRank ([2]) and Kleinberg's Mutual Reinforcement approach ([23]). This background is required in order to describe our image ranking schemes which are inspired by these approaches. Indeed, for each of the two approaches described, we also describe the main points which we adapted and evolved in our image ranking schemes.

2.1 PageRank

PageRank [2] is an important part of the ranking function of the Google search engine ([16]). The PageRank of a page p is the probability of visiting p in a random walk of the entire Web, where the set of states of the random walk is the set of pages, and each random step is of one of the following two types:

  1. From the given state s, choose at random an outgoing link of s, and follow that link to the destination page.
  2. Choose a Web page uniformly at random, and jump to it.
PageRank chooses a parameter d, 0 < d < 1, and each state transition is of the first transition type with probability d, and of the second type with probability 1-d. The PageRanks obey the following formula (where page p has incoming links from pages q1 , . . . , qk):
\begin{displaymath}\mbox{PageRank}(p)=(1-d)+d(\sum_{i=1}^k \frac{\mbox{PageRank}(q_i)}{\mbox{out degree of }q_i})\end{displaymath}

Thus, the PageRank of a page grows with the importance (=PageRanks) of the pages which point to it. An endorsement (=link) from a prominent (high ranking) site, like Yahoo! ([20]), contributes to a page's PageRank much more than an incoming link from some obscure personal homepage. Our image ranking schemes will imitate this property. In particular, the rankings of images will grow with the importance of the pages which contain them.

Table 1: PicASHOW/Scour images for the query "Michael Jordan"

2.2 The Mutual Reinforcement Approach

Kleinberg's Mutual Reinforcement approach, introduced in [23], aims to find hubs and authorities which pertain to a given topic t. The key observation behind the approach is that hubs and authorities which pertain to t display a mutually reinforcing relationship: For a page to be considered a good t-hub, it must point to many t-authorities, while a page is considered to be an authority on topic t only if many hubs deem it as such (and point to it). Because of this relationship, prominent t-hubs and t-authorities tend to form communities, which can be seen as densely inter-connected bipartite portions of the Web-graph.

The algorithm starts by assembling a collection C of Web pages, which should contain many high quality Web pages which pertain to a given topic t. It then analyzes the link structure induced by that collection, in order to find the authoritative pages (and the hubs) on topic t.

Denote by q a term-based search query which describes the topic of interest t. The collection C is assembled as follows: The collection C and its link structure induce a directed graph G whose nodes are the pages in C and for all i , j\in$C, the directed edge i$\rightarrow$j appears in G if and only if page i contains a hyperlink to page j. Let W denote the |C| × |C| adjacency matrix of G.

Each page s$\in$C is then assigned a pair of weights, a hub-weight h(s) and an authority weight a(s), based on the following two principles: The top ranking pages, according to both types of weights, form the Mutually Reinforcing communities of hubs and authorities.

Kleinberg uses the following iterative algorithm to assign the weights:

  1. Initialize a(s)$\leftarrow$1 , h(s)$\leftarrow$1 for all pages s$\in$C.
  2. Repeat the following three operations until convergence:
Note that applying the I operation is equivalent to assigning authority weights according to the result of multiplying the vector of all hub weights by the matrix WT. The O operation is equivalent to assigning hub weights according to the result of multiplying the vector of all authority weights by the matrix W.

Kleinberg showed that this algorithm converges, and that the resulting authority weights [hub weights] are the coordinates of the normalized principal eigenvector 2 of WTW [of WWT]. The pages which correspond to the largest coordinates of these eigenvectors are returned by the algorithm as the principal community of authorities[hubs].

The two matrices WTW and WWT are well known in the field of bibliometrics:

  1. WTW is the co-citation matrix ([26]) of the collection. [WTW]i,j is the number of pages which jointly point at (cite) pages i and j.
  2. WWT is the bibliographic coupling matrix ([22]) of the collection. [WWT]i,j is the number of pages jointly referred to (pointed at) by pages i and j.

Table 2: PicASHOW (left)/AltaVista images for the query Vincent Van Gogh
It is important to note that the outcome of the algorithm, namely the communities of hubs and authorities which the algorithm will identify, are determined solely by the adjacency matrix W of the collection C. The adjacency matrix implies the co-citation and bibliographic coupling matrices, and it is the eigenvectors of these matrices, in turn, which uniquely determine the principal communities of hubs and authorities.

Our co-citation based image retrieval schemes basically imitate this algorithm, albeit with different adjacency matrices. Defining the adjacency matrices to be used will suffice to uniquely define our schemes.

3. Link Structure Analysis for Finding Authoritative Images

We now describe our method for finding authoritative images given a query. First, we formally define the image collections which are analyzed. Next, we explain how such collections are assembled from a given query. Finally, we present our image ranking schemes.

3.1 Formal Definition of the Model

A page p is said to contain an image i (denoted by p$p \leadsto i$i) in either of the following two cases:

  1. p displays i: When page p is loaded in a Web browser, i is displayed. For example, either of the following html directives: <IMG source="foo.gif" alt="nice image"> or <A HREF="foo.html"> <IMG source="foo.gif"> </A>
  2. p points to i's image file (in some image file format such as .gif or .jpeg). For example,
    <A HREF="foo.jpeg"> nice image</A>. Note that p does not contain i when p points to an HTML file which contains i, even if i is the only visible object in the HTML file.
A topical WWW image collection is a quadruple IC=(P,I,L,E), where P is a set of Web pages (many of which deal with a certain topic t), I is the set of images which are contained in P, L $\subseteqP×P is the set of (directed) links which exist on the Web between the pages of P, and E$\subseteq$P×I is the relation page p contains image i.

We denote by W the adjacency matrix of the page-to-page relation L, and by M=[mi,j] the |P|×|I| adjacency matrix of the page-to-image relation E.
Page-Image Adjacency
Two of the most important observations of link-structure analysis are the following:

  1. The notion of authority being conferred through links from a pointing resource to a pointed resource. In our context, the pointing resources are Web pages, while the pointed resources are the images.
  2. The topical similarity between resources (images, in our case) which is inferred through co-citation.
Both of these principles are reflected in the adjacency relation which exists in the data. The adjacency matrix conveys the flow of authority, while the entries of the co-citation matrix, which the adjacency matrix implies, define the strength of the topical affinities which exist between the resources.

It is our goal, therefore, to define an adjacency relation between Web pages and images, in a manner which will best reflect both the flow of authority from pages to images, and the topical affinities between the various images.

There are several reasonable definitions for such adjacency relations. The intuition behind these definitions is perhaps best explained through an example. Consider the scenario depicted in Figure 1, which consists of five Web pages P1,...,P5 and four images, one of them replicated.

Figure 1: An example page/image case

The most basic adjacency relation which comes to mind is to adopt the page-to-image relation E, defined above, as the adjacency relation (and M as the adjacency matrix). The outright manner for a page to endorse an image is simply to display it or point to it, and that is the information which is represented in M. This approach also reflects some topical affinities between images, through the corresponding co-citation matrix MTM. Pairs of images which are co-contained in the same page are considered to be topically related. For example, note the entry which corresponds to the runner and tennis player in Figure 2.

Figure 2: The k=0 adjacency and co-citation matrices for the example case

However, this approach fails to convey some other fairly intuitive topical relations, such as between the soccer player and the tennis player. These images appear in pages which are co-cited, and assuming we agree that co-cited pages are topically related, then perhaps so are the images which are contained in them. To reflect such a connection, we need to consider the adjacency matrix WM, which associates each page p with the images which are displayed in pages to which p links. Using WM as the adjacency matrix, the co-citation matrix MTWTWM (Figure 3) now reflects some topical affinity between the soccer image and the tennis image.

Figure 3: The k=1 adjacency and co-citation matrices for the example case

This approach also suffers from some obvious setbacks. The affinity between the soccer image and the runner image is considered as strong as the affinity between the tennis image and the runner, although it seems logical that the tennis and runner images are more tightly coupled, since they appear in the same page. A greater problem exists with the connection between the image of the baseball player and the images of the soccer and tennis players. Why is a linkage between the soccer image and the tennis image inferred by P2 co-citing pages P4 and P5, while no linkage is inferred between those two images and the image of the baseball image, contained in P2 itself? Perhaps the answer to these issues lies in using the matrix (W+I|P|)M as the adjacency matrix (where I|P| is the |P|×|P| identity matrix).

Figure 4: Two times the k=0.5 adjacency matrix (and the corresponding co-citation matrix) for the example case
The matrix (W+I|P|)M defines each page to be adjacent both to the images which are contained in it, and to the images which are contained in pages to which it points. The co-citation measure which this adjacency matrix implies (the corresponding co-citation matrix is [(W+I|P|)M]T(W+I|P|)M ), now addresses the concerns raised in the previous case, as shown in Figure 4.

A closer look at our three proposed adjacency matrices (M, WM and (W+I|P|)M ) reveals that they are all members of the following parametric family of adjacency matrices (up to, perhaps, a constant factor):

AIC = { [kW + (1-k)I|P|]M : 0 k 1 }
By denoting AIC(k) = [kW + (1-k)I|P|]M , we have M = AIC(0), WM = AIC(1), and
(W+I|P|)M = 2AIC(0.5) . In general, choosing large values of k will introduce bias towards relations between pages and images contained in pages linked from them, while small values of k will boost the relationship between pages and the images which they themselves contain.

Our experiments, and our sample results, were derived using the three adjacency matrices defined above, although we do not claim that any of these choices is in any way optimal.

Weighted Relations
The definitions of the previous subsection are easily extended to the case where both L and E are weighted relations, that is L$\subseteq$P×P×R+3 is the set of weighted page to page links and E$\subseteq$P×I×R+ is the weighted page-image relation. Weighted relations are derived by assigning weights to the links (relations) which reflect the amount of authority that the pointing (containing) page confers to the pointed page (image). Possible factors which may contribute to the weight of a link or relation include the following (the first factor is considered in PicASHOW):

3.2 Assembling a Topical Collection

Our assumption in applying link-structure analysis in search of quality images on topic t is that t-relevant pages will contain quality images on t. Thus, by examining a large enough set of t-relevant pages, we should be able to identify high quality t-images. Therefore, the first step in assembling a topical collection of images is to assemble a large collection of t-relevant pages. This collection is assembled in the same manner as described in Section 2.2. That is, for a query q which describes the topic t, we assemble a q-induced collection of Web pages by submitting q first to traditional search engines, and adding pages that point to or are pointed by pages in the resultant set. This provides us with the page set P and the page-to-page link set L. Note that we do not utilize any image search engine in this step -- we are using only standard (HTML-page finding) search engines. Once we compile the page set P, we define the set I as the set of images which are contained in P (this also implies the page-to-image relation E).

This scheme for assembling the topical image collection can be implemented, with reasonable overhead in terms of both computations and storage, by many standard WWW search engines. All search engines continuously crawl the Web with robots, which collect the textual content of the pages to be indexed. Many engines (such as AltaVista[7], Google[16] and Lycos[18]) additionally collect connectivity information that captures the information regarding the links between the pages as they crawl the Web. Our scheme requires the engines to also catalog, for each page, which images are contained (or pointed to) in the page. However, the images themselves need not be stored. As explained below, each image requires only the storage of a 32 byte signature, plus a URL where it can be found.

When building the image collection IC=(P,I,L,E) we must consider a common authoring technique of Web pages. Specifically, when a Web site creator encounters an image of his liking on a remote server, the usual course of action would be to copy the image file to the local server, thus replicating the image. The motivation behind this practice is to enable the author's page to load faster from within the author's domain/organization, as the displayed images are stored locally.

This behavior of authors with respect to images is different from the corresponding behavior with respect to HTML pages. In most cases, authors will not copy a remote page (or some portion of its contents) to the local servers, but rather provide links from their site to the remote page. There are exceptions to this rule (such as the replication of system manuals or software APIs), but for most types of content, HTML pages are not replicated. This authoring mode has two important implications for link-based image search techniques, which are in contrast to the corresponding link based techniques for searching Web pages. We expand on these now.
Identifying replicated images
We must identify that multiple pages contain a certain image, even when the pages contain different copies of the image. Thus, images cannot be identified by their URIs, but rather must be identified by their content. In contrast, when applying link-analysis in search of authoritative pages, identifying replications is less crucial. Satisfactory results can be obtained even when the issue of page-replications is ignored.

Fortunately, it is possible to decide whether two images are identical with a relatively high probability by examining a small portion of the image. In PicASHOW, we only download the first 1024 bytes of the image and apply a double hash function to these bytes, so that each image is represented by a signature consisting of 32 bytes. Two images with the same signature are considered identical. Our experience shows that very rarely do different images result in the same signature since the first 1024 bytes usually capture the header information as well as the first part of the image itself. The storage overhead which is associated with each image is thus quite minimal. Note that replications of the same image result in only one 32-byte signature (and one URL) in terms of storage requirements.
Filtering non-informative images
The basic assumption behind link-analysis based methods is that a link between two pages p and q usually means that page p suggests, or even recommends, that surfers visiting p follow the link and visit page q. This may reflect the fact that pages p and q share a common topic of interest, and that the author of p thinks highly of q's contents. Such a link, called an informative link, is p's way to confer authority on q [23].

Unfortunately, not all links are informative. There are many kinds of links which confer little or no authority [6], such as intra-domain (inner) links (whose purpose is to provide navigational aid in a complex Web site of some organization), commercial/sponsor links, and links which result from link-exchange agreements. A crucial task which should be completed prior to analyzing the link structure of a given collection, is to filter out as many of the non-informative links as possible.

When building the set of page-to-page links L, we identify and filter out intra-domain page-to-page links since they are ruled to be navigational links, not informative links.

Devising a similar strategy for the page-to-image relation is crucial for successful link-based image retrieval. Site banners and logos can be thought of as the image equivalents of non-informative links. These images introduce a large amount of noise into image collections, which we would like to be able to filter. However, the described practice of image replication implies that filtering out the intra-domain page-to-image links of E will be destructive as we may also lose informative images in this fashion. We thus introduce a few heuristics, that can mitigate the noise that is introduced by non-informative images: These heuristics do not filter out all of the noise caused by non-informative images. Some non-informative images survive this process, and thus introduce noise into our page-to-image adjacency matrices. We found that this noise affects the image rankings more than usually happens in the corresponding page ranking schemes (where page-to-page adjacency matrices are used). As a consequence, link-based image search seems to be more noisy than link-based page search, and specifically may be easier to spam. Devising more elaborate and effective filtering schemes is left for future work.

3.3 Image Ranking Schemes

After assembling a topical image collection IC=(P,I,L,E) which pertains to a certain topic t, we need to rank the images of I with respect to t. We assume that every page p$p\in P$P is associated with a t-relevance score, denoted by rt(p). Note that this is not a limiting assumption, since we can always calculate the authority scores of the pages and use them as relevance scores. In particular, the collection IC contains the linkage information which is required to calculate authority scores by the Mutual Reinforcement approach.

Below is a list of the ranking schemes with which we have experimented. They are divided into three categories: A naive image in-degree approach, PageRank-influenced ranking schemes, and co-citation based analyses. These ranking schemes are based on the matrices which were defined in section 3.1.

  1. In-degree rank according to the matrix M. Here, the score of image i equals $\sum_{\{ p \in P \vert p \mbox{\small\ contains }
i \}}m_{p,i}$, where mp,i is the weight associated with the page-image relation p$p \leadsto i$i. That is, the score of image i is the sum of the weights of all relations p$p \leadsto i$i for all pages p which contain image i in the collection.
  2. PageRank ([2]) influenced ranking schemes. Under the hypothesis that images which are contained in t-relevant pages should be of higher quality (with respect to t) than images contained in t-irrelevant pages, we factor the relevance score of page p (rt(p)) into the score of image i. In particular, we set the score of image i to equal $\sum_{\{ p \in P \vert p \mbox{\small\ contains }i \}}r_t(p)$. In the case that M is a weighted matrix (and not simply a binary matrix), the straightforward variant of this score is $\sum_{\{ p \in P \vert p \mbox{\small\ contains }i \}}r_t(p) m_{p,i}$.
  3. Co-citation based analyses, with each of the matrices M, WM and (W+I|P|)M serving as the citation (adjacency) matrix, and with the co-citation matrices MTM, (WM)T WM, and [(W+I|P|)M]T(W+I|P|)M used for the purpose of the analysis. Specifically, we tested the rankings that are produced by the Mutual Reinforcement approach and by SALSA 4.
    Note that we can also utilize the t-relevance score { rt(p) , p$\{ r_t(p),p\in P \}$P } to enhance our co-citation analysis and boost the rankings of images that are cited by highly t-relevant pages. As an example, consider applying co-citation analysis to the |P|×|I| page-to-image matrix MR that is defined as follows: $[M_R]_{i,j} \stackrel{\triangle}{=}m_{i,j} \sqrt{r_t(i)}$ .   By examining the t-relevance image co-citation matrix MRTMR, we note that when M is unweighted (a binary adjacency matrix), [MRTMR]i,j sums the relevance weight of all pages which co-display images i and j:
    \begin{eqnarray*}[M_R^T M_R]_{i,j}&=&
\sum_{\{ k:k\leadsto i,k\leadsto j \}}[M_...
...rt{r_t(k)})^2 =
\sum_{\{ k:k\leadsto i,k\leadsto j \}}r_t(k)\\

    Thus, not all image co-citations are considered equal - highly relevant pages endorse their co-contained images to a larger extent than do less relevant pages.


Table 3: PicASHOW/Ditto images for the query Solar System

4. Image Containers and Image Hubs

One of the major benefits of hyperlink-based image search is that in addition to finding good images which pertain to a certain query, it can also identify Web pages that are rich in relevant images, or from which many images are readily accessible. Our proposed co-citation based ranking schemes naturally allow for such Web pages to be found.

While we have concentrated, in the previous section, on how to rank the authoritative images in IC, we can similarly find Web pages whose role corresponds to that of hubs. Just as hubs were defined as pages which link to many authoritative pages, in our context image hubs should be pages which are, in some sense, linked to many authoritative images.

However, the notion of an image hub is somewhat ambiguous. Do we, by calling p an "image hub", mean that many authoritative images are displayed in p, or do we mean that p points to many pages which contain quality images? We claim that both possible interpretations are of value, and so we define them separately as follows: Pages which contain high-quality images are called image containers, while pages which point to good image containers are called image hubs. Thus, image hubs are once removed from the authoritative images themselves, which are contained (as the name implies) in the image containers.

Table 4: PicASHOW/Lycos images for the query Jaguar car

Our co-citation based image retrieval schemes can find both image containers and image hubs, either separately or in some mixed manner. The outcome depends on the type of adjacency matrix used to describe the collection IC, which, in turn, implies the bibliographic coupling matrix which governs the ranks of pages as image hubs/containers (the technical details of how hub ranks are derived from the adjacency matrix were given in section 2.2).

When using the adjacency matrix M (or MR), the p'th row describes which images are contained in page p. The pages whose coordinates will stand out in the principal eigenvector of the matrix MMT will, accordingly, form a community of image containers. Image hubs are likely to be found when using the adjacency matrix WM, whose p'th row describes which images are contained in pages to which p links (the corresponding bibliographic coupling matrix will be WMMTWT). Co-citation analysis using the matrix (W+I|P|)M allows us to find communities of pages which both contain images and which point to other image containers, since the p'th row there details the images which are contained either in p or in pages to which p links. In general, when using the adjacency matrix AIC(k), high values of k shift the analysis towards image hubs, while lower values of k accentuate image containers.

One of the most striking examples of image containers which we found throughout our experiments with PicASHOW is the following fractal image container: http://sprott.physics.wisc.edu/fractals.htm . The page contains over a hundred images of fractals. Table 5 shows a few of those striking images.

Table 5: Images of fractals contained in http://sprott.physics.wisc.edu/fractals.htm

Another example of an image container is the url http://www.xs4all.nl/~renebos/magritte.html  for the query Magritte. The page itself contains just 5 words: "Rene Magritte : 10 Genius Paintings". Underneath that phrase are displayed 10 of Magritte's masterpieces, five of which are shown in Table 6

Table 6: Magritte paintings contained in http://www.xs4all.nl/~renebos/magritte.html

Here are a couple of examples for image hubs:

  1. For the query pyramids, many relevant images are just a click away from the following URLs:
  2. For the query "Yosemite", the URL titled "Photos by Rick Ellis -- Yosemite" (http://www.fnet.net/~ellis/photo/yosemite.html)  is both an image container and an image hub. Actually, it is a Yosemite hub in general, since it points to many valuable resources on Yosemite National Park. Michael Gordon's High Sierra page (http://home.earthlink.net/~mgordon324/sierra_new.htm) is an image container and hub for the Sierra Nevada mountain range (which includes the Yosemite Valley and the mountains around it). A great deal of Yosemite images are accessible from this page.

5. Discussion and Future Work

The examples showcased throughout the paper show that PicASHOW, while relying on very little besides link-structure analysis, demonstrates retrieval abilities comparable to those of available WWW image search engines. In addition, PicASHOW's retrieval of image containers and hubs is a natural and useful extension of the image search paradigm, which, to the best of our knowledge, has not been pursued before.

We have shown results for queries from diverse fields such as science, art, geographical landmarks, transportation and celebrities. The following list gives more details on the ranking scheme that was applied in each example, and highlights other noteworthy points which follow from the example. The reader should recall the definition of the ranking schemes from section 3.3.

  1. Table 1 brings PicASHOW's results for the query "Michael Jordan". The image collection consisted of 1083 images, and the results shown were derived by applying the PageRank-influenced ranking scheme. The URLs of these images (see Table 8) exemplify the practice of image replication among Web authors. Note that two of those images are animated GIFs.
  2. Table 2 brings PicASHOW's results for the query "Vincent Van Gogh". The image collection consisted of 582 images, and the results shown were derived by applying Kleinberg's algorithm using the adjacency matrix (W+I|P|)M .
  3. Table 3 brings our results for the query "Solar System". The image collection consisted of 682 images, and the images were ranked by weighted in-degree according to the relation E .
  4. Table 4 has our results for the query "Jaguar car". The image collection in this case was very small, and consisted of just 67 images. The ranking scheme was Kleinberg's algorithm, using the adjacency matrix M. Note that while PicASHOWs results all come from different servers, most of the Lycos images are from a single server (see Table 12).
  5. Table 7 displays PicASHOW's results for the query "Kilimanjaro". The 309 images of this example were ranked by SALSA, using the adjacency matrix M. Note how some of the image names do not contain anything resembling the query, but rather the name "Kibo", which is the name of one of Mt. Kilimanjaro's peaks (see Table 10).

Table 7: PicASHOW images for the query Kilimanjaro

Our ranking techniques require that the search topic be one of wide interest on the Web, where image replication is likely to occur. For example, people are likely to display replications of publicly released images of celebrities. Natural landmarks may also induce image replications. However, a query such as Paris is less likely to achieve quality results, since people will often display images of their own vacation in Paris, rather than replicated versions of the city's landmarks. Also, as in the case of link analysis in Web page search, queries on obscure topics of little interest will most likely fail to produce quality results.

Since PicASHOW performs no image analysis whatsoever, it cannot handle queries that contain image qualifiers such as color, orientation, and other specific features. For example, PicASHOW can retrieve images of Michael Jordan, but not of Michael Jordan wearing a suit. It can find images of Jaguar cars, but not of red Jaguar cars, and while it will rank nicely images of Mount Kilimanjaro, one cannot ask for those images not to contain trekkers, or to be taken from below. Note however, that link-analysis based techniques could still be of value for such queries. For example, PicASHOW could be used as an initial filter to find candidate images on the topic of interest (e.g., Jaguar cars). Some form of image analysis could then be performed on these candidate images in order to select those that further satisfy the image qualifications (e.g. which of the resulting Jaguar cars are red).

Several interesting extensions of the image search schemes are feasible:


We would like to thank Shlomo Moran and Yoelle Maarek for useful discussions on the ideas presented in this paper.

Appendix A - URLs of images

We bring here the URLs of the showcased images. For each result set, the URLs are given in clockwise order, starting from the upper left corner. For PicASHOW's results, multiple URLs are given for replications of the same image, when applicable.

URLs of PicASHOW's "Michael Jordan" images
http://sesd.sk.ca/scp/images/AIR_JORDAN.gif     (animated gif)
http://metal.chungnam.ac.kr/~myoungho/1991.gif     (animated gif)
URLs of Scour's "Michael Jordan" images

Table 8: URLs of "Michael Jordan" images

URLs of PicASHOW's Van Gogh images
URLs of AltaVista's Vincent Van Gogh images

Table 9: URLs of Vincent Van Gogh images

URLs of PicASHOW's Kilimanjaro images

Table 10: URLs of Kilimanjaro images

URLs of PicASHOW's Solar System images
http://www.geosci.unc.edu/classes/Geo15/SNsmall.gif  5
http://www.solarviews.com/images/rocketvision.gif (animated gif)
URLs of Ditto's Solar System images

Table 11: URLs of Solar System images

URLs of PicASHOW's Jaguar car images
http://www.ferrari-transmissions.co.uk/home2.jpg  6
URLs of the Lycos Jaguar car images

Table 12: URLs of Jaguar car images


  1. Yariv Aridor, David Carmel, Ronny Lempel, Yoelle S. Maarek, and Aya Soffer. Knowledge agents on the web. Proc. 4th International Workshop on Cooperative Information Agents, 2000.
  2. Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Proc. 7th International WWW Conference, 1998.
  3. Marco La Cascia, Saratendu Sethi, and Stan Sclaroff. Combining textual and visual cues for content-based image retrieval on the world wide web. Proc. IEEE Workshop on Content-based Access of Image and Video Libraries, 1998.
  4. IBM Corporation Almaden Research Center. CLEVER, http://www.almaden.ibm.com/cs/k53/clever.html.
  5. S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. Proc. 7th International WWW Conferenc, 1998.
  6. S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Hypersearching the web. Scientific American, June 1999.
  7. AltaVista Company, Altavista Search Home. http://www.altavista.com/.
  8. Jeffrey Dean and Monika R. Henzinger. Finding related pages in the world wide web. Proc. 8'th International World Wide Web Conference, 1999.
  9. ditto.com. ditto.com visual search engine. http://www.ditto.com/.
  10. Columbia University Center for Telecommunications Research. webSEEk, http://www.ctr.columbia.edu/webseek/.
  11. C. Frankel, M. Swain, and V. Athitsos. WebSeer: An image search engine for the world wide web. Technical Report TR-96-14, Computer Science Department, University of Chicago, 1996.
  12. Theo Gevers and Arnold W. M. Smeulders. The PicToSeek www image search system. Proc. IEEE International Conference on Multimedia Computing and Systems, 1:264-269, 1999.
  13. V. Harmandas, M. Sanderson, and M.D. Dunlop. Image retrieval by hypertext links. Proc. 20'th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 296-303, 1997.
  14. Corbis Inc., Corbis images. http://www.corbis.com/.
  15. Getty Images Inc., Getty images. http://www.getty-images.com/.
  16. Google Inc., Google search engine. http://www.google.com/.
  17. Lycos Inc., Lycos internet guide. http://multimedia.lycos.com/.
  18. Lycos Inc., Lycos internet guide. http://www.lycos.com/.
  19. Scour Inc., Scour. http://www.scour.com/.
  20. Yahoo! Inc., Yahoo! http://www.yahoo.com/.
  21. K. Kanth, D. Agrawal, and A. Singh. Dimensionality reduction for similarity searching in dynamic databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 166-176, 1998.
  22. M.M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14:10-25, 1963.
  23. Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM 46:5, 604-632, September 1999.
  24. Ronny Lempel and Shlomo Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Proc. 9th International WWW Conference, 2000.
  25. D. Rafiei and A.O. Mendelzon. What is this page known for? computing web page reputations. Proc. 9th International WWW Conference, 2000.
  26. H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. J. American Soc. Info. Sci., 24:265-269, 1973.
  27. J. R. Smith and S.-F. Chang. Searching for images and videos on the world-wide web,. Technical Report 459-96-25, Columbia University/CTR, 1996.
  28. C.J. van Rijsbergen. Information Retrieval.  Butterworths, 1979.


... similarity queries1
Note that this is different than the similarity queries mentioned in the context of CBIR systems - our method will find images on the same topic as the sample images rather than images with similar visual properties
... principal eigenvector 2
The eigenvector which corresponds to the eigenvalue of highest magnitude of the matrix.
... R+ 3
R+ is the set of non-negative real numbers
... SALSA 4
An in-depth description of SALSA([24]) is out of this paper's scope. However, SALSA is based on co-citation, and its analysis, just like Kleinberg's Mutual Reinforcement approach, is completely governed by the adjacency matrix which is used.
... .gif 5
Although two of the URLs have a suffix of .gif, all three contain the same jpeg image. The suffix, in these cases, does not describe the file type correctly.
... home2.jpg 6
When enlarged, this image reads "Michael Ferrari is my name, but Jaguars are my game". Mr. Ferrari claims to be an independent Jaguar transmission specialist...