IEPAD: Information Extraction based on Pattern Discovery

 

Chia-Hui Chang

Dept. of Computer Science and Information Engineering

National Central University, Chung-Li, Taiwan 320
Tel: +886-3-4227151x4523

chia@csie.ncu.edu.tw

 

Shao-Chen Lui

Dept. of Computer Science and Information Engineering

National Central University, Chung-Li, Taiwan 320

anyway@db.csie.ncu.edu.tw

Copyright is held by the author/owner.
WWW10, May 2-5, 2001, Hong Kong
ACM 1-58113-348-0/01/0005.

 

ABSTRACT

The research in information extraction (IE) regards the generation of wrappers that can extract particular information from semi-structured Web documents. Similar to compiler generation, the extractor is actually a driver program, which is accompanied with the generated extraction rule. Previous work in this field aims to learn extraction rules from users’ training example. In this paper, we propose IEPAD, a system that automatically discover extraction rules from Web pages. The system can automatically identify record boundary by repeated pattern mining and multiple sequence alignment. The discovery of repeated patterns are realized through a data structure call PAT trees. Additionally, repeated patterns are further extended by pattern alignment to comprehend all record instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental result shows that the constructed extraction rules can achieves 97 percent extraction over fourteen popular search engines.

Keywords

Information extraction, extraction rule, PAT tree, multiple string alignment

1.     INTRODUCTION

Current Web sites present information on various topics in various formats.  A great amount of effort is often required for a user to manually locate and extract useful data from the Web sites. Therefore, there is a great need for value-added service that integrates information from multiple sources. For example, customizable Web information gathering robots/crawlers, comparison-shopping agents, meta-search engines and newsbots, etc. To facilitate the development of these information integration systems, we need good tools for information gathering and extraction. Suppose the data has been collected from different Web sites, a conventional approach for extracting data from various Web pages would have to write programs, called “wrappers” or “extractors”, to extract the contents of the Web pages based on a priori knowledge of their format. In other words, we have to observe the extraction rules in person and write programs for each Web site. However, programming wrappers requires manual coding which generally entails extensive debugging and is, therefore, labor-intensive. In addition, since the format of Web pages is often subject to change, maintaining the wrapper can be expensive and impractical.

Fortunately, researchers have built tools that can generate wrappers automatically. For example, WIEN [11], Softmealy [7], Stalker [13] etc. are three famous works in this field. Similar to scanner or parser generator for compilers where users provide the syntax grammar and get the transition tables for scanner or parser drivers, these wrapper construction systems actually output extraction rules from training examples provided by the designer of the wrapper. The common idea involved is the machine learning techniques to summarize extraction rules, while the difference is the extractor architectures presumed in each system. For example, the single-pass, LR structure in WIEN and the multi-pass, hierarchical structure in Stalker. Nevertheless, the designer must manually label the beginning and the end of the training examples for generating the rules. Manual labeling, in general, is time-consuming and not efficient enough. 

Recently, researchers are exploring new approaches to fully automate wrapper construction. That is, without users’ training examples. For example, Embley et. al. describe a heuristic approach to discover record boundaries in Web documents by identifying candidate separator tags using five independent heuristics and selecting a consensus separator tag based on a heuristic combination [4]. However, one serious problem in this one-tag separator approach arises when the separator tag is used elsewhere among a record other than the boundary.

On the other hand, our work here attempts to eliminate human intervention by pattern mining [1]. The motivation is from the observation that useful information in a Web page is often placed in a structure having a particular alignment and order. Particularly, Web pages produced by search engines generally present search results in regular and repetitive patterns. Mining repetitive patterns, therefore, may discover the extraction rules for wrappers.

In this paper, we introduce IEPAD, an information extraction system applying pattern discovery techniques. In section 2, we present the system overview of IEPAD, including a pattern viewer, a rule generator and an extractor module. In section 3, we present the details of rule generator, followed by the embodiment of extractor in Section 4. Finally, we present experimental results in section 5 and make the conclusion in section 6.

 

2.     SYSTEM OVERVIEW

The system IEPAD includes three components, an extraction rule generator which accepts an input Web page, a graphical user interface, called pattern viewer, which shows repetitive patterns discovered, and an extractor module which extracts desired information from similar Web pages according to the extraction rule chosen by the user.

The core techniques of pattern mining are implemented in the rule generator. The extraction rule generator includes a translator, a PAT tree constructor, a pattern discoverer, a pattern validator, and an extraction rule composer. The results of rule extractor are extraction rules discovered in a Web page. The graphical user interface can then enable users to view the information extracted by each extraction rule. Once the user selects a target extraction rule conforming to his information desire, the extractor module can use it to extract information from other pages having similar structure with the input page.

Referring to Figure. 1, a flowchart of the rule extraction process is shown. The flowchart presents an overview of the mining process; more details will follow as to how a target pattern for extraction is generated. When a user submits an HTML page to IEPAD, the translator will receive the HTML page and translate it into a string of abstract representations, referred to here as tokens.  Each token is represented by a binary code of fixed length l.  The PAT tree constructor receives the binary file to construct a PAT tree.  The pattern discoverer then uses the PAT tree to discover repetitive patterns, called maximal repeats. The maximal repeats are forwarded to validator, which filters out undesired patterns and produces candidate patterns. Finally, the rule composer revises each candidate pattern to form an extraction rule in regular expression.


Once the extraction rules are discovered, the user may select from these candidate patterns his target pattern that contains desired information. The extractor receives a target pattern and a Web page as input and applies a pattern-matching algorithm to recognize and extract all occurrences of the target pattern in the token sequence of the encoded Web page. The extractor is a C++ module that can be linked to other information integration system for information extraction. A typical pattern-matching algorithm, for example, is the Knuth-Morris-Pratt’s algorithm. Each occurrence of the target pattern may represent a desired data record, and all of the desired data records form a useful information block.

 


3.     EXTRACTION RULE GENERATOR

The motivation of IEPAD is based on the observation that desired information in a Web page is often placed in a structure having a particular alignment and forms repetitive patterns. The repetitive patterns, therefore, may constitute the extraction rules for wrappers. For a program to discover such repetitive patterns in a Web page, we need some abstraction mechanism to translate the HTML source to distinguish the display format (translator) and a pattern discovery algorithm to find out all repetitive patterns. By repetitive patterns, we generally mean any substring that occurs at least twice in the encoded token string. However, such a definition will definitely include too many patterns fitting this requisite. Therefore, we define maximal repeats to uniquely identify the longest pattern as follows.

Definition: Given an input string S, we define maximal repeat a as a substring of S that occurs in k distinct positions p1, p2, … , pk in S, such that the (pi-1)th token in S is different from the (pj-1)th token for at least one i, j pair, 1£i<j£k (called left maximal), and the (pi’+|a|)th token is different from the (pj’+|a|)th token for at least one i', j’ pair, 1£i'<j’£k (called right maximal).

The definition of maximal repeats is necessary for identifying the well used and popular term repeats. Besides, it also captures all interesting repetitive structures in a clear way and avoids generating overwhelming outputs. However, not every maximal repeat represents a good one. Maximal repeats have to be further verified by the validator to filter interesting ones.

3.1     Translator

Since HTML tags are the basic components for document presentation and the tags themselves carry a certain structure information, it is intuitive to examine the tag token string formed by HTML tags and regard other non-tag text content between two tags as one single token called TEXT. Tokens often seen in an HTML page include tag tokens and text tokens, denoted as Html(<tag_name>) and Text(_), respectively. For example, Html(</a>) is a tag token, where </a> is the tag. Text (_) is a text token, which includes a contiguous text string located between two HTML tags. 

Tags tokens can be classified in many ways.  The user can choose a classification depending on the desired level of information to be extracted. For example, tags in the BODY section of a document can be divided into two distinct groups: block-level tags and text-level tags. The former defines the structure of a document, and the latter defines the characteristics, such as format and style, of the contents of the text. 

 


In Figure 2, a classification of block-level tags and text-level tags is shown. Block level tags are divided into different categories including headings, text containers, lists, and other classifications, such as tables and forms. Text-level tags are similarly divided into categories including logical markups, physical markups, and special markups for marking up texts in a text block.

The many different tag classifications allow different HTML translations to be generated. With these different abstraction mechanisms, different patterns can be produced. For example, skipping all text-level tags will result in higher abstraction from the input Web page than all tags are included. In addition, different patterns can be discovered and extracted when different encoding schemes are translated. The experiment results also indicate that block-level tags, while being a small percentage of the input size, contain a significant amount of useful information as shown in Section 5. 

For example, the translation process will be described using a simple Web page that contains only two lines of HTML source code (i.e., the Congo code):

      <B>Congo</B><I>242</I><BR>

      <B>Egypt</B><I>20</I><BR>$

The corresponding translation produced by the translator is a token string: Html(<B>)Text(_)Html(</B>)Html(<I>)Text(_)Html(</I>)  Html(<BR>)Html(<B>)Text(_)Html(</B>)Html(<I>)Text(_)Html(</I>)Html(<BR>)$, where each token is encoded as a binary strings of  “0”s and “1”s with length l. For example, suppose three bits encode the tokens in the Congo code:

Html(<B>)                         000

Html(</B>)                        001

Html(<I>)                          010

Html(</I>)                         011

Html(<BR>)                      100

Text(_)                               110

The encoded binary string for the token string of the Congo code is “000110001010110011100000110001010110011100$” of 3*14 bits, where “$” represents the ending of the encoded string.

3.2     PAT Tree Construction

Our approach for pattern discovery uses a PAT tree to discover repeated patterns in the encoded token string. A PAT tree is a Patricia tree (Practical Algorithm to Retrieve Information Coded in Alphanumeric [12]) constructed over all the possible suffix strings [5]. A Patricia tree is a particular implementation of a compressed binary (0,1) digital tree such that each internal node in the tree has two branches: zero goes to left and one goes to right. Like a suffix tree [6], the Patricia tree stores all its data at the external nodes and keeps one integer, the bit-index, in each internal node as an indication of which bit is to be used for branching. For a character string with n indexing point (or n suffix), there will be n external nodes in the PAT tree and n-1 internal nodes. This makes the tree O(n) in size.

Referring to Figure. 3, a PAT tree is constructed from the encoded binary string of the Congo example. The tree is constructed from fourteen sequences of bits, with each sequence of bits starting from each of the encoded tokens and extending to the end of the token string. Each sequence is called a “semi-infinite string” or “sistring” in short. Each leaf, or external node, is represented by a square labeled by a number that indicates the starting position of a sistring. For example, leaf 2 corresponds to sistring 2 that starts from the second token in the token string. Each internal node is represented by a circle, which is labeled by a bit position in the encoded bit string. The bit position is used when locating a given sistring in PAT tree. As shown in Figure 3, all suffix strings with the same prefix will be located in the same subtree.


 


Each edge in the PAT tree has a virtual edge label, which is the substring between two bit positions of the two nodes. For example, the edge label between node k and m is f=11100, i.e. the substring from the fifth bit to the ninth bit of sistring (external node) 5 and 12. Edges that are visited when traversing downward from root to a leave form a path that leads to a sistring corresponding to the leave. In fact, we can uniquely identify a sistring by a prefix, which is formed by concatenating all the edge labels along the path from root to the corresponding leave. The concatenated edge labels along the path form a virtual path label.  For example, the edge labels “1”, “00”, and “0…” on the path that leads from root to leave 7 form a path label “1000…”, which is a unique prefix for sistring 7.

The PAT tree shown here is actually depicted for understanding. In implementation, there are only two pointers at each internal node. The index bit alone can distinguish the branches, zero goes left and one goes right. The edges labeled with a dollar sign are an indication of the end of the string. This edge uses the right link and pointed to the internal node itself. At the construction phase, sistrigns are inserted one by one. To insert a sisting, we first search the sistring in the PAT tree and find the proper node that can distinguish the sistrings with others. Such an implementation requires more time than the optimal algorithm described in Gusfield’s book, chapter 3 to 5 (linear), since we are using binary tree to store character string and only the character suffixes need to be indexed in stead of all bit suffixes. However, binary trees have the advantage of simpler implementation.

3.3     Pattern Discoverer

All the leaves in a subtree share a common prefix, which is the path label for the path that leads from root to the root of the subtree. Each path label of an internal node represents a repeated sequence in the input.  Therefore, to discover repetitive patterns, discoverer only needs to examine path labels to determine whether or not they are maximal repeats. Since every internal node in a PAT tree indicates a branch, it implies a different bit following immediately the common prefix between two suffixes. Hence, every path label is right maximal. For a path-label of an internal node v to be left maximal, at least two leaves in the v's subtree should have different left characters. Let's call such a node left diverse. Followed by definition, the property of being left diverse propagates upward in T. Therefore all maximal repeats in S can be found in linear time based on the following lemma [6].

Lemma  The path labels of an internal node v in a PAT tree T is a maximal repeat if and only if v is left diverse.

The essence of a PAT tree is a binary suffix tree, which has also been applied in several research fields for pattern discovery. For example, Kurtz and Schleiermacher have used suffix trees in bioinformatics for finding repeated substring in genome [9]. As for PAT trees, they have been applied for indexing in the field of information retrieval since a long time ago [5]. It has also been used in Chinese keyword extraction [2] for its simpler implementation than suffix trees and its great power for pattern discovery. However, in the application of information extraction, we are not only interested in repeats but also repeats that appear regularly in vicinity.

By recording the occurrence counts and the reference positions in the leaf nodes of a PAT tree, we can easily know how many times a pattern is repeated and where it occurs. Thus, given the pattern length, occurrence count, we can traverse the PAT tree to enumerate all maximal repeats. For example, to find all patterns with the length greater than three tokens. Since each token is encoded with three bits, discoverer only needs to consider the internal nodes with index bit greater than 3*3.  Therefore, only node “d”, “e”, “g”, “l”, and “m” are qualified. 

The path labels and their respective token representations are candidates for useful repetitive patterns. Applying the definition of maximal repeat to the Congo example, only node “d” qualifies for a left diverse, since the left tokens for other leaves rooted at the same node all have the same token. For example, sistring 3 and 10 which are rooted at node “e” has the same token Text(_). The token representation for the corresponding maximal repeat is, therefore, Html(<B>)Text(_)Html(</B>)Html(<I>)Text(_)Html(</I>) Html(<BR>).

The use of maximal repeats reduces the amount of unnecessary output produced by the discoverer. Besides, during the process of discovering the maximal repeats, the number of occurrences of the repeat and their respective positions in the input sequence can be easily derived for the validator to use.

In the above example where the path label of node “d” is a maximal repeat, the subtree having node d as its root contains two leaves: leave 1 and leave 8. Because leaves 1 and 8 represent the only two sistrings that share the maximal repeat as their common prefix, the number of occurrences are two, and the positions of the maximal repeat in the input are one and eight in the input token sequence.

3.4     Pattern Validator

The above Congo source code is a simple example that shows how a maximal repeat can be discovered for a given input.  However, a typical web page usually contains a large number of maximal repeats, not all of which contain useful information.  To eliminate undesired maximal repeats, IEPAD uses the validator to determine whether or not the maximal repeats contain useful information.  In addition to the occurrence frequency and pattern length of a maximal repeat, the validator employs a number of criteria, including regularity, compactness, and coverage.  A user of IEPAD may choose to use only one of the criteria, or to use multiple of them in combination.  Each of the criteria has a threshold that can either have a default value, or can be determined by the user.  Suppose a maximal repeat a are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence, we define regularity, compactness, and coverage as follows.

Regularity of a pattern is measured by computing the standard deviation of the interval between two adjacent occurrences (pi+1-pi). That is, the sequence of spacing between two adjacent occurrences is (p2 - p1), (p3p2), …, (pkpk-1).  Regularity of the maximal repeat a is equal to the standard derivation of the sequence divided by the mean of the sequence.  A commonly used bound for regularity is 0.5.

Compactness is a measure of the density of a maximal repeat. It can be used to eliminate maximal repeats that are scattered far apart beyond a given bound.  Using the example of maximal repeat a in the previous paragraph, density, is defined as (k-1)*|a|/{pk-p1}, where |a| is the length of a in number of tokens. The criterion of compactness requires that only maximal repeats with a density greater than a given bound (default 0.5) be qualified for extraction.

Coverage measures the volume of content in a maximal repeat.  Suppose the function P(i) returns the position of the ith sistring in the input Web page in number of bytes.  Coverage is defined as [P(pk+|a|)–P(p1)]/|Webpage|, where |Webpage| is the number of bytes of the input Web page.

These three measures are proposed because most information we would like to extract is presented in regular and contiguous format. Ideally, the extraction pattern should have regularity equal to zero, density equal to one and large coverage. To filter potentially good patterns, a simple approach will be to use a threshold for each of these measures above. Only patterns with regularity less than the regularity threshold and density between the density thresholds are considered candidate extraction patterns. Additionally, we can see that not every candidate maximal repeat is useful. For example, patterns those do not contain any text tokens have no avail to users.

In this paper, we have utilized regularity and compactness to filter the discovered patterns. These two measures, together with coverage, can further be used to measure patterns’ fitness for pattern ranking. The validation proceeds as follows: all discovered maximal repeats with regularity less than the regularity threshold g (=0.5) are considered potential and are forwarded to rule composer if they have density greater than d (=0.5), while others (patterns with density less than d) are discarded. As for patterns with regularity greater than g, special care is taken by the partition module described below.

3.5     Occurrence Partition

The regularity threshold 0.5 can extract information that is placed in a structure having a particular alignment and order. However, in examples such as Lycos (as will be discussed later), where the pattern of the target information occurs as three blocks in the Web pages (and forms three information blocks), the regularity can be large since it is measured over all instances of the pattern. Nonetheless, the regularity of the occurrences in each single block is still small. Therefore, the idea here is to partition the occurrences into segments so that we can analyze each partition respectively.

To handle such Web pages, these occurrences of such a pattern are carefully segmented to see if any partition of the pattern's occurrences satisfies the requirement for regularity. Generally, the regularity of a pattern is computed through all occurrences of the pattern. For example, if there are k occurrences, the k-1 intervals (between two adjacent instances) are the statistics we use to compute the standard deviation and the mean. To partition those occurrences, the occurrences are first sorted by their position. Let Gi,j denotes the ordered occurrences pi, pi+1, ..., pj and initialize variables s=1, j=1. For each instance pj+1, if the regularity of Gs,j+1 is greater than g  then we further consider whether Cs,j is a good partition and assign j+1 to s. The pseudo code is as follows:

文字方塊: S=f, s=1;
For j=1 to k-1 do
	If  Regularity(Gs,j+1) > g then
	     If Regularity(Gs,j) > gm then
		S= S È{Gs,j};
	     endif
	     s= j+1;
	endif
endf

If a partition includes occurrences more than the minimum count and has regularity less than threshold gm, the pattern as well as the occurrences in this partition is outputted. Note that the threshold gm is set to a smaller value close to zero to control the number of outputted patterns. With this modification, more Web pages such as Lycos can easily be handled.

3.6     Rule Composer

The main application of PAT tree is on the domain of exact match. To allow inexact, or approximate, matching (matching with some errors permitted), the technique for multiple string alignment is borrowed to extend the discovered exact repeats. The idea is to find a good presentation of the critical common features of multiple strings. For example, suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb”. If we have the following multiple alignment for strings ``adcwbd", ``adcxb'' and ``adcxbd":

a  d  c  w b  d

a  d  c  x  b  -

a  d  c  x  b  d

The extraction pattern can be generalized as “adc[w|x]b[d|-]” to cover these three examples where “-“ represents a missing character. Specifically, for a pattern which has k+1 occurrence, p1, p2, ..., pk+1 in the encoded token string. Let string Pi denotes the string starting at pi and ending at pi+1-1. The problem is transformed to find the multiple alignment of the k strings S=P1, P2, ..., Pk. so that the generalized pattern can be used to extract all records we need.

Multiple string comparison is a natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths. Extending dynamic programming to multiple string alignment yields an O(nk) algorithm. Instead, an approximation algorithm is used such that the score of the multiple alignment is no greater than twice the score of optimal multiple alignment (see [5], Chapter 14). The approximation algorithm starts by computing the center string Sc in k strings S that minimizes consensus error åPiÎS D(Sc, Pi). Once the center string is found, each string is then iteratively aligned to the center string to construct multiple alignment, which is in turn used to construct the extraction pattern.

For each pattern with density less than 1, the center star approximation algorithm for multiple string alignment is applied to generalize the extraction pattern. Note that the success of this approach lies in the assumption of contiguous repeats. If alignment of substrings between two adjacent occurrences results in extraction rules with alternatives at more than ten positions, such an alignment is ignored. In such cases, the original maximal repeats are kept instead of the aligned patterns.

Another problem regarding the aligned extraction pattern is that the pattern is not necessarily located at the boundary of the information record. Suppose the generalized extraction pattern is expressed as “c1c2c3...cn”, where each ci is either a symbol x Î S or a symbol set indicating more than one symbols can appear at position i (e.g. subset of SÈ{-}). If the actual starting position of a record is at cj then the correct extraction pattern should be “cjcj+1cj+2...cnc1c2...cj-1”. Therefore, an additional step is taken to consider whether “cjcj+1cj+2... cnc1c2...cj-1” is the pattern we need. Generally, such position j often has only one symbol. Besides, since extraction patterns often begin with tags like <DL>, <DT>, <TR> etc. and end up with tags such as <P>, <BR>, <HR> etc., we therefore uses this guide[1] to generate possible extraction rules. However, the disadvantage here is that a large number of patterns can be produced during this rule generation step.

 

4.     THE EXTRACTOR

After the extraction rules are constructed, the user may select from the pattern viewer one or more target patterns that contain desired information. Figure 4 is a demonstration of the pattern viewer where two patterns are discovered for SavvySearch. The upper frame shows the patterns discovered and the lower frame shows the details measures of the selected pattern. The selected pattern is then forwarded to the extractor for pattern recognition and extraction. A screen shot of the extracted data when double click the pattern is shown in Figure 5.

The extractor is implemented as a module that can be linked into other information integration systems. There are two ways to complete the extraction work depending on whether a PAT tree is available for the input Web page. The extractor can search the PAT tree to find all occurrences of the extraction pattern. Or, the extractor can also apply a pattern-matching algorithm to recognize and extract all occurrences of the target pattern in the translated token string of unseen Web pages. Searching in a PAT is fast, since every subtree of a PAT tree has all its sistrings with a common prefix. Therefore, it allows efficient, linear-time solutions to complex string search problem. For example, string prefix searching, proximity searching, range searching, regular expression search etc.

 

 

Figure 4. Pattern Viewer

 



Figure 5. Screen shot of the data extracted

 


On the other hand, if the pattern is used to extract unseen Web page where no PAT tree has been constructed, pattern-matching algorithms can be applied. Typical pattern matching algorithms, for example, are the Knuth-Morris-Pratt’s algorithm or Boyer-Moore’s algorithm [15]. Alternatively, since the extraction rule is expressed as a regular expression (with concatenation and alternative only), it is easier to construct a finite-state machine for such an extraction rule.

Note that each extraction rule composed by multiple string alignment actually represents several patterns. That is, there are alternatives at each state of the non-derterministic finite-state machine. Therefore, several patterns can apply when matching the rule against the translated token sequence. In such case, the longest match is considered. For example, extracting occurrences of the rule “ab[a|-][b|-]” in the string “abababab” will find only two occurrences of “abab”.

The input to the extractor module is the Web pages to be extracted and the selected rule together with the encoding scheme used to discover the extraction rule. The extractor module first translates the Web pages using the encoding scheme specified. In addition to the encoded token string, we also record the starting positions in the source Web page for each token. Once the occurrences of the pattern in the encoded token string are found, the corresponding positions in the Web pages can be counted.

 

5.     EXPERIMENTS

Extraction results of a rule are often evaluated by retrieval rate and accuracy rate. Retrieval rate is defined as the ratio of the number of desired data records enumerated by a maximal repeat to the number of desired data records contained in the input text. Accuracy rate is defined as the ratio of the number of desired data records enumerated by a maximal repeat to the number of occurrence of the maximal repeat. A data record is said to be enumerated by a maximal repeat if the matching percentage is greater than a bound determined by the user. The matching percentage is used because the pattern may contain only a portion of the data record. For example, suppose an input text contains 20 desired data records, and a maximal repeat that occurs 25 times enumerates 18 of them. Then the retrieval rate is 18/20 and the accuracy rate is 18/25. A higher retrieval rate indicates that more records are extracted.

We first show the number of maximal repeats validated by our system using fourteen state-of-the-art search engines, each with ten Web pages. The number of validated maximal repeats is sometimes an indicator of the extraction result. If the number of validated maximal repeats is 1, the maximal repeat is very likely the target pattern. When the number of the maximal repeats is increased, it implies that there is a great variance in the records and each discovered maximal repeat only captures part of the target information block.

 

Table 1. Number of maximal repeats validated

Search Engine

Maximal

Repeats

Regularity

< 0.5

Density

>0.25

<1.5

AltaVista     

Cora

Excite

Galaxy

Hotbot

Infoseek

Lycos

Magellan

Metacrawler

NorthernLight

Openfind

Savvysearch

Stpt.com

Webcrawler

213.1

100.5

120.9

72.5

74.8

143.9

210.0

42.8

190.7

119.3

76.2

119.2

72.5

82.7

49.4

38.4

11.7

26.6

20.4

17.1

18.0

12.7

34.5

21.2

36.9

44.3

39.5

20.5

43.5

14.6

8.0

11.7

14.2

7.5

5.6

6.0

14.6

15.3

9.5

35.1

33.2

14.3

6.8

3.5

1.0

5.0

9.4

7.2

3.9

1.0

6.8

3.8

2.4

8.2

3.7

1.3

Average

117.1

27.9

16.7

4.7

 

There are several control parameters which can affect the number of maximal repeats validated, including the regularity threshold, the density thresholds, the minimum length and frequency of a pattern, and the encoding scheme, etc. The effect of different encoding scheme will be discussed in next section.