STARTS - Stanford Proposal for Internet Meta-Searching

Reference implementation - description, download, instructions, notes:
http://www2.cs.cornell.edu/lagoze/starts/starts_reference.htm

Input form linked to a running version of the reference implementation:
http://www2.cs.cornell.edu/lagoze/starts/StartsInput.htm
- (They restarted the server for me, but it might go down again any time. Mohamed Kholief will install the server locally. 1/20/98.)When used Filter Expression  ([basic-1 author] "lagoze"), others all default, the following was returned:
@SQResults{Version{10}: STARTS 1.1
Sources{4}:     cstr
ActualFilterExpression{17}:     (author "lagoze")
ActualRankingExpression{0}:
NumDocSOIFs{1}: 1
}
@SQRDocument{Version{10}:       STARTS 1.1
RawScore{6}:    2552.0
Sources{5}:     cstr
[basic-1 title]{72}:    A protocol and server for a distributed digital technical report library
[basic-1 author]{28}:   Davis, James R. Lagoze, Carl
[basic-1 linkage]{20}:  CORNELLCS//TR94-1418
DocSize{5}:     21933
DocCount{4}:    3427
}

In another test run, with Filter Expression  ([basic-1 author] != "lagoze"), I obtained nearly exactly the same result, with the only difference ActualFilterExpression being {18}, rather than {17}.  After I asked for an explanation, the response was that although legal, they suspected that != was not recognized by the compiler yet.  They would further investigate the reason.   (Another test with ([basic-1 author] = "lagoze") resulted in {20}.)

The following is extracted and edited from:
http://www-db.stanford.edu/~gravano/starts.html (final writeup of STARTS protocol)
http://www-db.stanford.edu/pub/gravano/1996/sigmod97.ps (sigmod paper)

This draft is based on feedback from people from various organizations, and it developed a simple protocol that text search engines should follow to facilitate searching and indexing multiple collections of text documents.

The goal of STARTS - to facilitate metasearchers

Given a query, a metasearcher (i.e., services that provide a unified view of the multiple sources) has to perform (at least) three tasks to provide a unified interface over a (large) number of document sources:

Some background of STARTS

In the architecture, there is a (potentially large) number of resources. Each resource consists of one or more sources, and simply exports contact information for its sources. A source is a collection of flat documents (e.g., we do not consider any nesting of documents) with an associated search engine that accepts queries from clients and produces results.

The protocol is meant for machine-to-machine communication, not for users - information needs between sources and metasearchers, not so much on how formatted or transported. It does not deal with security issues or with error reporting.

Query language

A query consists of two parts:

A filter expression is Boolean in nature and defines the documents that qualify for the answer. The ranking expression associates a score with these documents and ranks them accordingly.

Example:

Consider the following query with filter expression:
((author "Garcia Molina") and (title "databases"))

and ranking expression:

list((body-of-text "distributed") (body-of-text "databases"))

This query returns documents having "Garcia Molina" as one of the authors and the word "databases" in their title. The documents that match the filter expression are then ranked according to how well their text matches the words "distributed" and "databases."

Atomic Terms

An l-string is either a string (e.g., "Ullman") or a string qualified with its associated language, and optionally, with its associated country (e.g., [en-US "behavior"]).  A term is an l-string modified by an unordered list of attributes (e.g., (author "Ullman")).  An attribute is either a field or a modifier.

The "Basic-1" set of attributes

The attributes not marked with (New) are from the GILS (Global Information Locator Service / Government Information Locator Service) attribute set (which in turn inherits all of the Z39.50-1995 Bib-1 use attributes).

Complex Filter Expressions

We use operators to build complex filter expressions from the terms. The "Basic-1"-type filter expressions use the following operators (which are borrowed from the Z39.50-1995 type-101 queries). If a source supports filter expressions, it must support all these operators.

Example:

Consider two terms t1 and t2 and the following filter expression:
(t1 prox[3,T] t2)

The documents that match this filter expression contain t1 followed by t2 with at most three words in between them. "T" (for "true") indicates that the word order matters (i.e., that t1 has to be appear before t2).

Complex Ranking Expressions

We also use operators to build complex ranking expressions from the terms. The "Basic-1"-type ranking expressions use the operators above ("and," "or," "and-not," and "prox") plus a new operator, list, which simply groups together a set of terms. (The "Boolean" operators would most likely be interpreted as "fuzzy-logic"operators by the search engines in order to rank the documents.) If a source supports ranking expressions, it must support all these operators.

Each term in a ranking expression may have a weight (a number between 0 and 1) associated with it, indicating the relative importance of the term in the query.

Example:

The following ranking expression indicates that the term "distributed" is more important than the term "databases":
list(("distributed" 0.7) ("databases" 0.3))

The default field for both terms is Any, as we explained above.

Further Result Specification

When we query a source, we also specify the following together with the query. Example of a Complete Query

 - The numbers inside the braces are the lengths of the corresponding values in bytes.

@SQuery{
Version{10}: STARTS 1.0
FilterExpression{48}: ((author "Ullman") and (title stem "databases"))
RankingExpression{61}: list((body-of-text "distributed")(body-of-text "databases"))
DropStopWords{1}: T
DefaultLanguage{5}: en-US
AnswerFields{12}: title author
MinDocumentScore{3}: 0.5
MaxNumberDocuments{2}: 10
}

Merging of Ranks

Ranking results from multiple sources may be based on executing different queries and using different ranking algorithms, and definitely are based on different collections in the databases.  To merge the query results from multiple sources into a single, meaningful rank, a source should return the following information for each document in the query result:

A metasearcher can discard the sources' scores and use these results to calibrate document scores from different sources.

Source Metadata

To select right sources for a query and to query them we need information about the sources' contents and capabilities.  Every source is required to export: a list of metadata attribute-value pairs, describing properties of the source, and a content summary of the source.

Source metadata attributes - MBasic-1

Each source exports information about itself by giving values to the metadata attributes below. For example, the value for the metadata attribute "FieldsSupported" is a list of the fields that can be used for querying the source (e.g., "Linkage-type").

Below we define the "MBasic-1" set of metadata attributes, borrowing from the Z39.50-1995 Exp-1 and the GILS attribute set. The attributes not marked with (New) are from these two attribute sets. The attributes marked (Req) are required, and the sources must support them.

Source content summary

Each source exports data about its contents. This data can be used to decide if the source is relevant for a given query, for example, and includes:

Resource definition

Each resource exports contact information about the sources that it contains. More specifically, a resource simply exports its list of sources, together with the URLs where the metadata attributes for the sources can be accessed and the format of this data. (Currently there is only one format for this metadata, as Sections 5.3 and 5.4 describe.) Using this information, a metasearcher learns how and where to contact each of the sources in the resource.

Sample Syntax and Communication

The final writeup gives a sample syntax of the query language, query results, and metadat using attribute-value pairs in Harvest's Summary Object Interchange Format (SOIF).  It also gives suggestions on how to extend the sample syntax by each source.