STARTS - Stanford Proposal for Internet Meta-Searching
Reference implementation - description, download, instructions,
notes:
http://www2.cs.cornell.edu/lagoze/starts/starts_reference.htm
Input form linked to a running version of the reference implementation:
http://www2.cs.cornell.edu/lagoze/starts/StartsInput.htm
- (They restarted the server for me, but it might go down again any
time. Mohamed Kholief will install the server locally. 1/20/98.)When used
Filter Expression ([basic-1 author] "lagoze"), others all default,
the following was returned:
@SQResults{Version{10}: STARTS 1.1
Sources{4}: cstr
ActualFilterExpression{17}: (author "lagoze")
ActualRankingExpression{0}:
NumDocSOIFs{1}: 1
}
@SQRDocument{Version{10}: STARTS
1.1
RawScore{6}: 2552.0
Sources{5}: cstr
[basic-1 title]{72}: A protocol and server for a
distributed digital technical report library
[basic-1 author]{28}: Davis, James R. Lagoze, Carl
[basic-1 linkage]{20}: CORNELLCS//TR94-1418
DocSize{5}: 21933
DocCount{4}: 3427
}
In another test run, with Filter Expression ([basic-1 author]
!= "lagoze"), I obtained nearly exactly the same result, with the only
difference ActualFilterExpression being {18}, rather than {17}. After
I asked for an explanation, the response was that although legal, they
suspected that != was not recognized by the compiler yet. They would
further investigate the reason. (Another test with ([basic-1
author] = "lagoze") resulted in {20}.)
The following is extracted and edited from:
http://www-db.stanford.edu/~gravano/starts.html
(final writeup of STARTS protocol)
http://www-db.stanford.edu/pub/gravano/1996/sigmod97.ps
(sigmod paper)
This draft is based on feedback from people from various organizations,
and it developed a simple protocol that text search engines should follow
to facilitate searching and indexing multiple collections of text documents.
The goal of STARTS - to facilitate metasearchers
Given a query, a metasearcher (i.e., services that provide a unified
view of the multiple sources) has to perform (at least) three tasks
to provide a unified interface over a (large) number of document sources:
-
Choose the best sources to evaluate the query
-
Evaluate the query at these sources
-
Merge the query results from these sources
Some background of STARTS
In the architecture, there is a (potentially large) number of resources.
Each resource consists of one or more sources, and simply exports contact
information for its sources. A source is a collection of flat documents
(e.g., we do not consider any nesting of documents) with an associated
search engine that accepts queries from clients and produces results.
The protocol is meant for machine-to-machine communication, not
for users - information needs between sources and metasearchers,
not so much on how formatted or transported. It does not deal with
security issues or with error reporting.
Query language
A query consists of two parts:
-
a filter expression, and
-
a ranking expression.
A filter expression is Boolean in nature and defines the documents that
qualify for the answer. The ranking expression associates a score with
these documents and ranks them accordingly.
Example:
Consider the following query with filter expression:
((author "Garcia Molina") and (title "databases"))
and ranking expression:
list((body-of-text "distributed") (body-of-text "databases"))
This query returns documents having "Garcia Molina" as one
of the authors and the word "databases" in their title. The documents
that match the filter expression are then ranked according to how well
their text matches the words "distributed" and "databases."
Atomic Terms
An l-string is either a string (e.g., "Ullman") or a string
qualified with its associated language, and optionally, with its associated
country (e.g., [en-US "behavior"]). A term is an l-string
modified by an unordered list of attributes (e.g., (author "Ullman")).
An attribute is either a field or a modifier.
The "Basic-1" set of attributes
The attributes not marked with (New) are from the GILS
(Global Information Locator Service / Government Information Locator Service)
attribute set (which in turn inherits all of the Z39.50-1995
Bib-1 use attributes).
-
Fields: What portion of the text is associated with the term (e.g.,
the author portion, the title portion, etc.). At most
one should be specified for each term. If no field is specified, "Any"
is assumed. Those fields marked with (Req) must be supported. ("Supported"
means that the source must recognize these fields. However, the source
may freely interpret them.) The rest of the fields are optional.
(Our fields correspond to the Z39.50 "use attributes.")
-
Title (Req)
-
Author
-
Body-of-text
-
Document-text (New)
(For relevance feedback, similar documents.)
-
Date/time-last-modified (Req)
(Formatted according to the International
Standard ISO 8601 (e.g., "1996-12-31"))
-
Any (Req)
-
Linkage (Req)
(URL of the document)
-
Linkage-type
(MIME type of the document)
-
Cross-reference-linkage
(List of URLs in document)
-
Language
(The language(s) of the document, as a list of language tags as defined
in RFC 1766.)
(For example, a query term (language "en-US") matches a document
with value for the language field "en-US es". This document has
parts in American English and in Spanish.)
-
Free-form-text (New)
(A string, maybe representing a query in some query language not in
the protocol, that the source somehow knows how to interpret)
-
Modifiers: What values the term represents (e.g., treat the term
as a stem, as its phonetics (soundex), etc.).
Zero or more modifiers can be specified for each term. (Our modifiers correspond
to the Z39.50 "relation attributes.")
-
<, <=, =, >=, >, !=
(If applicable (e.g., for fields like "Date/time-last-modified"),
default: =)
-
Phonetic (soundex)
(Default: no soundex)
-
Stem
(Default: no stemming)
-
Thesaurus (New)
(Default: no thesaurus expansion)
-
Right-truncation
(Default: the term "as is," without right-truncating it)
-
Left-truncation
(Default: the term "as is," without left-truncating it)
-
Case-sensitive (New)
(Default: case insensitive)
Complex Filter Expressions
We use operators to build complex filter expressions from the terms.
The "Basic-1"-type filter expressions use the following operators (which
are borrowed from the Z39.50-1995
type-101 queries). If a source supports filter expressions, it must support
all these operators.
-
and
-
or
-
and-not
-
prox, specifying two terms, the required distance between them,
and whether the order of the terms matters.
Example:
Consider two terms t1 and t2 and the following filter
expression:
(t1 prox[3,T] t2)
The documents that match this filter expression contain t1
followed by t2 with at most three words in between them. "T"
(for "true") indicates that the word order matters (i.e., that t1
has to be appear before t2).
Complex Ranking Expressions
We also use operators to build complex ranking expressions from the terms.
The "Basic-1"-type ranking expressions use the operators above ("and,"
"or," "and-not," and "prox") plus a new operator,
list, which simply groups together a set of terms. (The "Boolean"
operators would most likely be interpreted as "fuzzy-logic"operators by
the search engines in order to rank the documents.) If a source supports
ranking expressions, it must support all these operators.
Each term in a ranking expression may have a weight (a number between
0 and 1) associated with it, indicating the relative importance of the
term in the query.
Example:
The following ranking expression indicates that the term "distributed"
is more important than the term "databases":
list(("distributed" 0.7) ("databases" 0.3))
The default field for both terms is Any, as we explained
above.
Further Result Specification
When we query a source, we also specify the following together with the
query.
-
Drop stop words: whether the source should delete the stop words
from the query or not. (A metasearcher knows if it can turn off the use
of stop words at a source from the source's metadata) (Default: Drop the
stop words.)
-
Default attribute set (optional, for notational convenience).
(Default: Basic-1.)
-
Default language (optional, for notational convenience, and overridden
by the specifications at the l-string level).
(No default.)
-
Sources (in the same resource) where to evaluate the query in addition
to the source where the query is submitted.
(Default: no other source.)
-
Answer specification:
-
Fields returned in the query answer
(Default: Title, Linkage)
-
Fields used to sort the query results, and whether the order is ascending
("a") or descending ("d")
(Default: Score of the documents for the query, in descending
order.)
-
Documents returned:
-
Minimum acceptable document score
(No default.)
-
Maximum acceptable number of documents
(Default: 20 documents.)
Example of a Complete Query
- The numbers inside the braces are the lengths of the corresponding
values in bytes.
@SQuery{
Version{10}: STARTS 1.0
FilterExpression{48}: ((author "Ullman") and (title stem "databases"))
RankingExpression{61}: list((body-of-text "distributed")(body-of-text
"databases"))
DropStopWords{1}: T
DefaultLanguage{5}: en-US
AnswerFields{12}: title author
MinDocumentScore{3}: 0.5
MaxNumberDocuments{2}: 10
}
Merging of Ranks
Ranking results from multiple sources may be based on executing different
queries and using different ranking algorithms, and definitely are based
on different collections in the databases. To merge the query results
from multiple sources into a single, meaningful rank, a source should
return the following information for each document in the query result:
-
The unnormalized score of the document for the query
-
The id of the source(s) where the document appears
-
Statistics about each query term in the ranking expression (as modified
by the query fields, if possible):
-
Term-frequency: the number of times that the query term appears
in the document
-
Term-weight: the weight of the query term in the document, as
assigned by the search engine associated with the source (e.g., the normalized
tf.idf weight for the query term in the document, or whatever
other weighing of terms in documents the search engine might use)
-
Document-frequency: the number of documents in the source that
contain the term (this information is also provided as part of the metadata
for the source)
-
Document-size: the size of the document (in
bytes)
-
Document-count: the number of tokens (as
determined by the source) in the document
A metasearcher can discard the sources' scores and use these results to
calibrate document scores from different sources.
Source Metadata
To select right sources for a query and to query them we need information
about the sources' contents and capabilities. Every source is required
to export: a list of metadata attribute-value pairs, describing properties
of the source, and a content summary of the source.
Source metadata attributes - MBasic-1
Each source exports information about itself by giving values to the
metadata attributes below. For example, the value for the metadata attribute
"FieldsSupported" is a list of the fields that can be used for
querying the source (e.g., "Linkage-type").
Below we define the "MBasic-1" set of metadata attributes, borrowing
from the Z39.50-1995
Exp-1 and the GILS attribute
set. The attributes not marked with (New) are from these two attribute
sets. The attributes marked (Req) are required, and the sources
must support them.
-
FieldsSupported (New) (Req)
(What optional fields are supported in addition to the required ones.
Also, each field is optionally accompanied by a list of the languages that
are used in that field in the source. Required fields can also be listed
here with their corresponding language list.)
-
ModifiersSupported (New) (Req)
(What modifiers are supported. Also, each modifier is optionally accompanied
by a list of the languages for which it is supported at the source. (Modifiers
like stem are language dependent.))
-
FieldModifierCombinations (New)
(What field-modifier combinations are supported. For example, stem
might not be supported for the author field at a source.)
-
QueryPartsSupported (New)
(Whether the source supports ranking expressions only ("R"),
filter expressions only ("F"), or both ("RF"). Default:
"RF.")
-
ScoreRange (New) (Req)
(This is the minimum and maximum score that a document can get for
a query; we use this information for merging ranks. Valid bounds include
-infinity and +infinity, respectively.)
-
RankingAlgorithmID (e.g., Acme-1) (New) (Req)
(Even when we do not know the actual algorithm used it is useful to
know that two sources use the same algorithm.)
-
TokenizerIDList (New)
(E.g., (Acme-1 en-US) (Acme-2 es), meaning that tokenizer
Acme-1 is used for strings in American English, and tokenizer
Acme-2 is used for strings in Spanish. Even when we do not know
how the actual tokenizer works, it is useful to know that two sources use
the same tokenizer.)
-
SampleDatabaseResults (New) (Req)
(The URL to get the query results for a sample document collection;
see Section 3.)
-
StopWordList (New) (Req)
-
TurnOffStopWords (New) (Req)
(Whether we can turn off the use of stop words at the source or not.)
-
SourceLanguage
(List of languages present at the source.)
-
SourceName
-
Linkage (Req)
(URL where the source should be queried.)
-
ContentSummaryLinkage (New) (Req)
(The URL of the source content summary; see below.)
-
DateChanged
(The date when the source metadata was last modified.)
-
DateExpires
(The date when the source metadata will be reviewed, and therefore,
when the source metadata should be extracted again.)
-
Abstract (of the source)
-
AccessConstraints
(A description of the constraints or legal prerequisites for accessing
the source.)
-
Contact
(Contact information of the administrator of the source.)
Source content summary
Each source exports data about its contents. This data can be used to
decide if the source is relevant for a given query, for example, and includes:
-
List of words that appear in the source, specifying whether:
-
The words listed are stemmed or not.
-
The words listed include stop words or not.
-
The words listed are case sensitive or not.
-
The words listed are accompanied by the field corresponding to where in
the documents they occurred (e.g., (title "databases"), (abstract
"databases"), etc.).
If possible, the words listed should not be stemmed, and should include
the stop words. Also, the words should be case sensitive, and be accompanied
by their corresponding field information, as shown above.
In addition, the words might be qualified with their corresponding language
(e.g., [en-US "behavior"]).
-
Statistics for each word listed, including at least one of the following:
-
Total number of postings for each word (i.e., the number of times that
the word appears in the source)
-
Document frequency for each word (i.e., the number of documents that contain
the word)
-
Total number of documents in source
Resource definition
Each resource exports contact information about the sources that
it contains. More specifically, a resource simply exports its list
of sources, together with the URLs where the metadata attributes for the
sources can be accessed and the format of this data. (Currently there is
only one format for this metadata, as Sections
5.3 and 5.4 describe.) Using this information,
a metasearcher learns how and where to contact each of the sources in the
resource.
Sample Syntax and Communication
The final
writeup gives a sample syntax of the query language, query results,
and metadat using attribute-value pairs in Harvest's Summary Object Interchange
Format (SOIF). It also gives suggestions on how to extend the sample
syntax by each source.