To select a set of queries for use in a web search evaluation, ideally, one should take a random sample from queries posed to a search engine by a large population of users. Also, the pages retrieved by different engines should be judged for goodness by the person who posed the query. However, the two goals of a) using a large population of users, and b) asking the original user to do relevance judgments, are quite contradictory in a lab setting. One possible fix to this problem is to use a limited set of users available for an experiment and only use their queries, and their judgments. This approach does not yield as wide a variety of query types as one can get from a real search engine query log. The other fix to this problem is to use a sample of queries from a real search engine query log, and ask a human subject, obviously different from the person who posed the query, to judge pages for someone else's query. This is the approach taken in some of the TREC studies [5,7]. In essence, the human subject is told: ``make your judgments based on what you would have been looking for, had you posed this query''. This approach suffers from the problem that two human interpretations of a query can be quite different. For example, a human subject can interpret the query ``who wants to be a millionaire'' as a query looking for ratings/reviews of the famous TV show, whereas the original user who posed the query might have been looking for the home-page for the show.
Given these problems in obtaining extensive relevance judgments, and given that it is quite time and human-labor intensive to get relevance judgments from humans for a large set of queries, we decided to experiment with only one popular type of web queries for which doing relevance judgments is relatively easy; we use queries of the type: find a web page/site. A large proportion of users pose such queries to web search engines everyday, and doing relevance judgments for these queries is not as expensive. This selection also allows us to do our evaluation using a relatively large set of test queries.
From two real user query logs, one our internal log for our engine, and another made available by Excite (www.excite.com) we select queries that are explicitly seeking a home-page or a web-site. The Excite log(++) contains 2,477,283 queries posed to Excite during few hours on Dec. 20, 1999. To avoid the query interpretation problem mentioned above, we first find all queries in these logs that contain the string home followed by the string page, or the string web followed by the string page or site. This strict selection criteria gives us 14,603 queries from this log, for example ``Aces High homepage'', or ``Champion Nutrition web site''. There are many more queries in the log that seem to be seeking a web page/site (e.g., ``Panache communications'' or ``Office Depot'') but we don't want to get engaged in a query interpretation exercise. Then we use a human subject to go through these 14,603 filtered queries, and a) eliminate the ones that are not seeking an explicit page, e.g., ``web site administration'', and b) link queries to their respective web pages, e.g., link ``Purdue University Homepage'' to www.purdue.edu. Using this process, we generate a set of 100 queries, and their corresponding relevant pages, for use in our evaluation.
Since the keyword-based TREC algorithms are quite sensitive to presence of extraneous words (like homepage) in a query, the human subject generating the query, relevant page pairs also removed these extraneous words from the queries. So the query ``Champion Nutrition web site'' was reduced to just ``Champion Nutrition''. To our knowledge, most web search engines have such a stop-list (list of words to remove) for query processing. Despite our instructions, eight of the 100 queries were left as is by our human subject and do contain these extraneous words.
Our query selection process eliminates the first, second and fourth problems (mentioned in Section 2) with the previous studies done in [5,6,7]. Since we have only one page that is relevant to a query, the fourth problem of differences in quality of two relevant pages does not exist. Also, the larger problem (problem 2 in Section 2) of page-based, instead of site-based, evaluation disappears since there is only one correct site for a query.
We realize that for queries that seek a web site, it is possible for an engine to use some URL based heuristics to improve its chances of finding the relevant site. For example, for the query ``IBM'', it is a reasonable guess that the user is looking for the site www.ibm.com. If the commercial web search engines use such URL based heuristics, they will have an unfair advantage over the TREC algorithms. For this reason, in our query selection process, we take extra care to make sure that the desired site is not a URL formed easily by using query words. For example, we reject queries like ``IBM'', or ``AOL'', or the query ``williams sonoma homepage'' as the desired page (www.williams-sonoma.com) has query words in the URL. Even though there is nothing wrong in using such URL cues to rank pages for a query, we want to limit the advantage the commercial engines might have due to using such cues. For the queries used in this study, if the commercial engines do use some URL cues to promote certain pages, they must do some non-trivial processing of the query string to match it to a page URL. One variable that we did not account for in our query selection process was keyword navigation services like RealNames. We discuss the impact of this on our results in Section 5.