Traditionally search algorithms have been studied in the Information Retrieval (IR) research community . Most traditional algorithms are keyword-based(*) and, given a user query, use word frequencies, word importance, document length and other statistical cues to assign potential importance to a document. However, with the emergence of the web many new algorithms for web search have been proposed and are being used in various web search engines today. Many of these algorithms incorporate link-structure of pages in their ranking schemes, and are notably different from the traditional keyword-based document ranking algorithms.
One conference which has been quite influential in the advancement of traditional keyword-based IR ranking algorithms is Text REtrieval Conference or TREC . TREC is a series of annual conferences run by DARPA and NIST with the aim of objectively evaluating text search and related technologies in independently run evaluations. Valuable benchmark test collections are produced as a by-product of TREC. The document search problem has been dubbed as ad-hoc search under TREC. The ad-hoc scenario is parallel to what happens in web search. A user provides the search system with a (usually short) query, and the system ranks potentially relevant documents in response to the query. Traditionally TREC has used documents from Newswires and other non-web text collections, for example, AP Newswire, Wall Street Journal, LA Times, San Jose Mercury News, the Federal Register, etc. More recently there has been a shift towards using a web document collection .
During the last eight years, TREC participants have developed new document ranking algorithms that have been shown to be quite effective for searching document collections used in the TREC ad-hoc tasks. One difference between the traditional IR or TREC environment and the web environment is the presence of hyper-links between web documents. Several search techniques have been proposed in the web environment that exploit the presence of links [1,9]. Major web search engines don't disclose all details about their ranking schemes, however, it is widely known that several of them do incorporate link information in some form [1,25]. How much more effective are link-based methods in the web environment as compared to a state-of-the-art keyword-based method developed for the TREC ad-hoc task? This question has been studied in a limited number of studies, especially under TREC's web track [5,6,7]. The results from these studies indicate that for web search, link based methods do not hold any advantage over the state-of-the-art keyword-based methods developed for TREC ad-hoc search. These results are quite counter-intuitive given the general wisdom in the web search community that some kind of linkage analysis does improve web page/site ranking. Our work is motivated by this discrepancy between the results presented in [5,6,7], and the general belief in the web search community.
Different web search engines make competing claims regarding their coverage and search effectiveness. In this study, we don't concentrate on comparing the search effectiveness of different web search engines. There have been several studies that do such a comparison [4,11]. Instead, our aim is to study how a state-of-the-art keyword-based document ranking algorithm (emerging from the TREC ad-hoc task) will perform on a realistic web search task; and how that performance compares to the performance of some popular web search engines which use link structure in their ranking schemes. Previous studies have shown that link-based methods do not hold much advantage over keyword-based TREC ad-hoc algorithms, however, these studies accompany their results with several caveats which we discuss in detail in Section 2. This work aims at studying the above question in an environment which is closer to real web search and does not have these caveats. Again, the details are discussed in Section 2.
The rest of the study is organized as follows. Section 2 discusses the TREC ad-hoc and web tracks and points out some of the shortcomings of the web search evaluation studies done under TREC. Section 3 discusses our experimental environment and explains how this environment removes the problems associated with the previous studies. In Section 4 we describe our implementation of a state-of-the-art TREC ad-hoc algorithm, and show that it indeed is competitive with the best TREC results. In Section 5 we present our results and discuss them. Section 6 concludes the study.