aIBM Haifa Research Laboratory
MATAM, Haifa 31905, Israel
bDept of Computer Science
Technion, Haifa, Israel
One of the first dynamic search heuristics was the "fish search" [De Bra et al. 94], that capitalizes on the intuition that relevant documents often have relevant neighbours. Thus, it searches deeper under documents that have been found to be relevant to the search query, and stops searching in "dry" areas. However, the fish-search algorithm presents some limitations. Under time limit constraints, it does not discover "enough" relevant search directions. In this paper, we propose an improved and more "aggressive" version of that algorithm, that we call the "shark search" algorithm. The shark-search algorithm overcomes the limitations of its ancestor, by better estimating the relevance of neighbouring pages, even before they are accessed and analyzed.
We have embodied the shark-search algorithm into Mapuccino (previously known as WebCutter), a system for dynamically generating Web maps tailored to the user's interests, that was described at the previous WWW6 conference [Maarek et al. 97] and that is provided as a free service to Web users at the Mapuccino home page.
Section 2 briefly reviews related work and
describes in detail the fish-search algorithm and its limitations. Section
3 introduces the shark-search algorithm. Section
4 describes the Mapuccino system, a system for dynamic
site mapping, in which the shark-search was embodied. Section
5 shows some experimental results obtained with Mapuccino
when using these search heuristics and in particular
shows the improvements achieved by the shark-search algorithm over the
fish-search algorithm. Section 6 summarizes the
contributions of this paper. Note that while the shark-search algorithm
has been implemented in and for the Mapuccino system, i.e., for tailorable
site mapping, it can in principle be used in other dynamic search systems.
Children whose depth is greater than 0 are inserted in the priority list according to the following heuristics:
The fish-search algorithm, while being attractive because of the simplicity of its paradigm, and its dynamic nature, presents some limitations.
First, it assigns a relevance score in a discrete manner (1 for relevant, 0 or 0.5 for irrelevant) using primitive string- or regular-expression match. More generally, the key problem of the fish-search algorithm is the very low differentiation of the priority of pages in the list. When many documents have the same priority, and the crawler is restricted to a fairly short time, arbitrary pruning occurs the crawler devotes its time to the documents at the head of the list. Documents which are further along the list whose scores may be identical to some further along may be more relevant to the query. In addition, cutting down the number of addressed children by using the width parameter is arbitrary, and may result in loosing valuable information. Clearly, the main issue that needs to be addressed is a finer grained scoring capability. This is problematic because it is difficult to assign a more precise potential score to documents which have not yet been fetched.
One immediate improvement is to use, instead of the binary (relevant/irrelevant) evaluation of document relevance, what we will call hereafter a similarity engine in order to evaluate the relevance of documents to a given query. Such an engine analyzes two documents dynamically and returns a "fuzzy" score, i.e., a score between 0 and 1 (0 for no similarity whatsoever, 1 for perfect "conceptual" match) rather than a binary value. A straightforward method for building such an engine is to apply the usual vector space model [Salton & McGill 83]. It has been shown that these kinds of techniques give better results than simple string- or regular-expression matching [Salton 89, p. 306]. The similarity algorithm can be seen as orthogonal to the fish-search algorithm. We will assume in the rest of the paper that such an engine is available and that for any pair query, document, (q,d), it returns a similarity score sim(q,d) between 0 and 1.
This first improvement has a direct impact on the priority list. We use a "fuzzy" relevance score, giving the child an inherited score, thus preferring the children of a node that has a better score. This information can be propagated down the descendants chain, thus boosting the importance of the grandchildren of a relevant node over the grandchildren of an irrelevant node. The children of an irrelevant node use their parent's inherited score for calculating their own inherited score by multiplying it by some decay factor d, whose role is comparable to Marchiori's "fading" factor [Marchiori 97]. Thus, suppose documents X and Y were analyzed and that X has higher score. Suppose further that the children of both X and Y have a null score, and the algorithm now has to select the most promising of their grandchildren. Since both their scores are multiplied by d 2, the shark search chooses X's grandchildren first (which seems intuitively correct). Stepping further away from X, into the great-grandchildren zone, would bring Y's descendants back into consideration, given an appropriate selection of d. This continuous valued behavior of score inheritance is much more natural than the Boolean approach of the fish-search.
A more significant improvement consists of refining
the calculation of the potential score of the children not only by propagating
ancestral relevance scores deeper down the hierarchy, but also by making
use of the meta-information contained in the links to documents.
We propose to use the hints that appear in the parent document, regarding
a child. The anchor text of the link is the author's way to hint as to
the subject of the linked document. A surfer on the Web, encountering a
page with a large amount of links, will use the anchor text of the links
in order to decide how to proceed. Some automated tools (see [Iwazume
et al. 96]) also take advantage of that information. This approach
however can fail in poorly styled HTML documents, in which anchor texts
consist only of "click here" or "jump there", or when the anchor information
is a picture without ALT information.
To remedy this problem we suggest using the close textual context of the link, and combining the information extracted from it with the information extracted from the anchor text. To reduce the risk of mistakenly giving a high potential score to a node due to a textual context that actually belongs to another node linked in the same paragraph (context boundaries are difficult to identify), we suggest a small fix. If a node has a relevant anchor text, the score of the textual context is set to the maximum (value of 1), thus giving it an edge over neighbouring links. We claim that a policy that selects the children with the most promising anchor and anchor context information is preferable to arbitrarily selecting the first set of children.
These heuristics are so effective in increasing the differentiation of the scores of the documents on the priority list, that they make the use of the width parameter redundant there is no need to arbitrarily prune the tree. Therefore, no mention is made of the width parameter in the shark-search algorithm that is more formally described in Fig. 2 (note that only the delta from the fish-search algorithm is given).
In the fish-search algorithm, replace step 3.1, for computing the child's potential score, with the
neighbourhood_score = b * anchor_score + (1 b) * anchor_context_score
where b is a predefined constant
potential_score(child_node) = g * inherited_score(child_node) + (1 g) * neighbourhood_score(child_node)
where g is a predefined constant.
Tailored maps have "longer arms" in relevant directions. They display not only pages actually relevant to the topic of interest (highlighted in blue, shaded lighter or darker according to the relevance in Fig. 3) but also their immediate neighbourhood (not highlighted, i.e., coloured in white) so as to show relevant information in context.
In a typical scenario, the user accesses a Mapuccino server via his/her Web browser, specifies what site (or seed URL) s/he wishes to map, how long s/he is ready to wait, and more importantly what topics s/he is more interested in. The server then works in three steps:
The pages relevant to the user's interests are shown in context together with pages that might not be relevant but are hyperlinked to them. Thus, if a user generated a map tailored to the topic "Java", s/he may see a highlighted page about "Java education", which itself is related to some pages dealing with education on the object-oriented paradigm. While the latter pages are not directly related to the original user's query, and thus are not highlighted in the map, they can still present some interest. The context as defined by the Web authors thus brings additional information.
Note that the purpose of this tool is not to simply search for information in a site, but rather to generate a global view of a Web area where the user knows that relevant information can be found. In other terms, the prime intention of the user is to browse rather than to search.
The Mapuccino system has been fully implemented, and comprises two main components:
To simplify the evaluation, we therefore propose to use a measure that
reflects what most users expect from a "good" map: namely, getting
as many relevant documents in the shortest delays.
We define this measure as the sum of similarity scores of all documents forming the maps, which we will refer to as the "sum of information" measure in the rest of this paper. Note that we do not use an average measure (for instance dividing the sum by the total number of nodes in the map) since irrelevant documents that were identified in the same process do not hurt the quality of the map. On the contrary, they give context information and can be pruned from the map if so desired.
So in summary, if we define as sim(d,q), the similarity score (between 0 and 1) between any document d in the map and query q directing the mapping, and M the set of documents in a given map, we have
Note that sim(d,q) is the actual similarity score as computed by the similarity engine for documents whose contents was fetched by the crawler. Potential scores, which are speculative, do not affect this value.
Coming to include fish-search as a control group in the comparison tests with the shark-search algorithm, we decided to slightly refine the tested fish-search component by using the same retrieval engine as for the shark-search component so as not to bias the experiments and unfairly penalize the fish-search algorithm.
Throughout the experiments, the shark-search runs obtained significantly higher sum-of-information values than the fish-search algorithm. In average, the best results were obtained at d = 0.5, b = 0.8 and g = 0, as the results were systematically in the top three sum-of-information values (note that it happened that the maximum was reached a few times at g = 1, and once at g = 0.5).
Table 1 shows the performance of the shark-search against that of the fish-search, for four test cases. We see that the shark-search performed between 1.5 and 3 times better than its ancestor.
sum of info
sum of info
|query: "computational biology course"
seed URL: http://www.cs.technion.ac.il
|query: " Global survey Mars exploration
seed URL: http://www.cnn.com
|query: "breast feeding nutrition"
seed URL: http://www.parentsplace.com
|query: "semi-definite programming and Cholesky
seed URL: http://www.mcs.anl.gov/otc/Guide/
Note that these test cases having been designed experimentally (rather than formally over a statistically significant sample, and using testing guidelines procedure), they should be considered as a positive indicator of the effectiveness of the algorithm rather than as formal results.
We used three variations of the Mapuccino engine, (all using the same similarity engine) to generate the maps below. The seed in all examples was the CNN home page (depicted by a small house icon), using "San Francisco 49ers" as search query for directing the crawling process. Relevant pages are highlighted in blue (irrelevant ones in white). The darker the shade of blue, the more relevant the document is. The same amount of time was affected in all three cases.
Case 1: fish-search algorithm (with similarity engine)
The map in Fig. 4 shows that 3 directions were identified by the fish-search algorithm from the CNN home page, and one got explored somewhat deeper in the given time span. The total of relevant pages thus identified amounts to 7.
Case 2: shark-search algorithm without context analysis
As seen in Fig. 5, the same directions were explored at the first level, but thanks to the queue prioritization, the server did not waste time exploring irrelevant pages identified at the first level in the previous case, and could go deeper in two relevant directions, identifying more relevant pages as indicated by the darker shade of blue. A total of 14 relevant documents was found, and 3 of them were particularly relevant (no document with this degree of relevance was found in the previous case).
Case 3: complete shark-search algorithm
The tendency shown in the previous example gets more pronounced. The first-level descendant that seemed to induce the best direction (after the fact) is immediately identified as a highly relevant page (darker blue), see Fig. 6, and more sub-directions are identified in the same time span. The number of relevant pages found increased to 17, 4 of them being highly relevant.
|Michael Herscovici is a Research Staff Member at the IBM Haifa Research Lab in Haifa, Israel and belongs to the "Information Retrieval and Organization" Group. His research interests include Internet applications and parsing techniques. Mr. Herscovici will receive his B.Sc. in Computer Science from the Technion, Israel institute of technology in Haifa, in 1998. He joined IBM in 1997 and has since worked on the dedicated robot component of Mapuccino, a Web site mapping tool.|
|Michal Jacovi is a Research Staff Member at the IBM Haifa Research Lab in Haifa, Israel, and belongs to the "Information Retrieval and Organization" Group. Her research interests include Internet applications, user interfaces, and visualization. She received her M.Sc. in Computer Science from the Technion, Haifa, Israel, in 1993. Ms. Jacovi has joined IBM in 1993, and worked on several projects involving user interfaces and Object Oriented, some of which have been published in journals and conferences. Since the emergence of Java, she has been involved in the conception and implementation of Mapuccino, a Web site mapping tool, written in Java, that is being integrated into several IBM products.|
|Yoelle S. Maarek is a Research Staff Member at the IBM Haifa Research Lab in Haifa, Israel and manages the "Information Retrieval and Organization" Group that counts about 10 staff members. Her research interests include information retrieval, Internet applications, and software reuse. She graduated from the "Ecole Nationale des Ponts et Chaussees", Paris, France, as well as received her D.E.A (graduate degree) in Computer Science from Paris VI University in 1985. She received a Doctor of Science degree from the Technion, Haifa, Israel, in January 1989. Before joining IBM Israel, Dr Maarek was a research staff member at the IBM T.J. Watson Research Center for about 5 years. She is the author of Guru, an advanced information retrieval system, widely used within IBM, and lead the team that conceived and implemented Mapuccino, a Web site mapping tool, written in Java, that is being integrated into several IBM products. She has published over 20 papers in referred journals and conferences.|
|Dan Pelleg received his B.A. from the Department of Computer Science, Technion, Haifa, Israel, in 1995. He is currently an M.Sc. student in the same department, finishing his Master's thesis, "Phylogeny Approximation via Semidefinite Programming". His research interests include computational biology, combinatorial optimization and Web-based software agents. During the summer of 1997, Dan worked as an extern student in IBM Haifa Research Laboratory, where he focused on the Shark-search algorithm.|
|Menachem Shtalhaim is a Research Staff Member at the IBM Haifa Research Lab in Haifa, Israel and belongs to the "Information Retrieval and Organization" Group. His research interests include Internet applications, communication protocols and heuristic algorithms. Mr. Shtalhaim joined IBM in 1993, and worked on several projects involving morphological analysis tools, Network Design and analysis tool (IBM product NetDA/2) and the AS400 logical file system layer. In the past, Mr. Shtalhaim have worked on medical diagnostic systems. He is the author of the dedicated robot component of Mapuccino, a Web site mapping tool.|
|Sigalit Ur is a Research Staff Member at the IBM Haifa Research Lab in Haifa, Israel, working on Mapuccino, a Web site mapping tool, written in Java, that is being integrated into several IBM products. She received a Master of Science degree in Intelligent Systems from the University of Pittsburgh in 1993. Before joining IBM, Ms. Ur was involved in projects in a wide variety of fields, including data processing, databases, cognitive science, multi-agent planning and image processing, some of which have been published in journals and conferences.|