Graphs from the Search Engine API vs WUI study Study home | Research | Home

From 2006-05-28 to 2006-10-30 we submitted four types of queries (3500 total) to each interface:

  1. General search terms. We queried for the top 100 results and the total number of results using 50 popular search terms1 and 50 computer science (CS) terms 2.
  2. URL backlinks. We queried for the number of backlinks to 100 randomly selected URLs. (link: parameter)
  3. Pages indexed for a website. We asked how many pages were indexed for 100 randomly selected websites. (site: parameter)
  4. URL indexing and caching. We queried to see if 100 randomly selected URLs were indexed and/or cached. (info: or url: parameter)
The graphs from our study are shown below. A paper summarizing our findings has been submitted to a conference and is under review.



Menu:

  1. Distance between API and API, WUI and WUI
  2. Distance between API and WUI
  3. WUI vs. API at Day Offsets
  4. Decay of Search Results (Top 100)
  5. Decay of Search Results (Top 10)
  6. Decay Model
  7. Popular vs. CS search terms
  8. Indexed and Cached
  9. Backlinks
  10. Site
  11. Total Terms

Research | Home



Distance between API and API, WUI and WUI

The following graphs show the distances from 1 to 0 (1 = exact match, 0 = completely different) of the top 100 search engine results for all search terms averaged together. The following distance measures are being used:

  1. Kendall tau distance (K) or bubble distance (with a small variation for top k lists proposed by Fagin et al.)
  2. Overlap (P) (or percentage of shared URLs)
  3. Bar-Ilan et al. measure (M) that weighs changes in the top of the list higher than changes in the bottom of the list

MSN API data is missing for 17 days because the API key being used was invalidated for some reason. I had to replace it with a new API key to fix the problem. Other developers have also experienced this.

Results per search term

The following graphs compare WUI results at day n with day n-1 and API results at day n with day n - 1 using overlap (P), Kendall tau (K), and Bar-Ilan (M):

Popular and CS results using P measure (top 100):



Popular and CS results using P measure (top 10):



All results using P,K,M measures (top 100):



All results using P,K,M measures (top 10):

Distance Between WUI vs. API Results

The following graphs show the distance between the WUI and API results on each day.

API vs WUI results for each search engine (top 100):



API vs WUI results for each search engine (top 10):


WUI vs. API at Day Offsets

The following graphs show the K distance between the top 100 search results (averaged together) when comparing day n API results with day n + offset WUI results. The graphs show that the WUI and API results are most similar on the same day (offset 0). This implies that the API results are not older than the WUI data; both result sets are updated at the same rate. The API results appear to just have a slightly different ranking order.



Decay of Search Results Over Time

The following graphs show the decay of the top 100 search results over time. Decay is determined by computing the overlap of the search results from day 1 to each consecutive day. Here we have averaged all the popular decay scores together and the CS decay scores.

Top 100

Top 100 search result decay for all terms
Decay of top 100 popular search results



Decay of top 100 CS search results

Top 100 averaged over time

Here we examine only the top 10 results for all search terms. These graphs were composed by taking the URL results for day n and comparing them to day n+1, n+2, etc. for every day we obtained search results. The offset (n to n+offset) for each day were averaged together. For example d(Oct 12, Oct 13) was averaged with d(Oct 13, Oct 14), d(Oct 14, Oct 15), etc. This reduces bias by not relying on any particular day as the start day with which to compare all other results.

Averaged decay of top 100 results over time (Popular terms)



Averaged decay of top 100 results over time (CS terms)

Top 10

Decay of top 10 URL search results for all terms

Decay of top 10 popular results when comparing all results to day 1.



Decay of top 10 CS results when comparing all results to day 1.

Top 10 averaged over time

Averaged decay of top 10 results over time (Popular terms)



Averaged decay of top 10 results over time (CS terms)


Decay Model

The following model was fitted to the decay lines of the previous graphs:
f(day) = a - b * log(day)
The values of a, b, and R-squared are reported below:

Search engineTypeTop kInterface a b R-squared Half-life
googlecs10wui0.9940.1750.957672Graph
googlecs10api1.0090.1610.9591480Graph
googlecs100wui1.0110.2190.982215Graph
googlecs100api1.0170.2180.960235Graph
googlepopular10wui1.0020.1950.954376Graph
googlepopular10api1.0200.1910.956529Graph
googlepopular100wui0.9880.2500.98589Graph
googlepopular100api0.9710.2590.98766Graph
msncs10wui0.9820.2040.994228Graph
msncs10api0.9750.1960.991264Graph
msncs100wui1.0340.2120.965327Graph
msncs100api1.0140.2030.965338Graph
msnpopular10wui0.9120.2270.98166Graph
msnpopular10api0.9470.2330.97284Graph
msnpopular100wui1.0170.2870.98164Graph
msnpopular100api1.0060.2810.98163Graph
yahoocs10wui1.1040.3230.96074Graph
yahoocs10api1.1210.3250.95581Graph
yahoocs100wui1.1140.3760.95543Graph
yahoocs100api1.1670.3930.95250Graph
yahoopopular10wui1.0790.2000.951783Graph
yahoopopular10api1.0920.2240.954433Graph
yahoopopular100wui1.1020.3560.96349Graph
yahoopopular100api1.1310.3720.96150Graph


Popular vs. CS search terms

The following graphs show the density of the WUI vs API bubble difference for the popular and CS search terms. Distributions closer to 1.0 mean that the WUI and API are giving more similar results.

Note that in Google, the WUI produces very different results than the API for popular terms, but CS results are more similar. The opposite is true for Yahoo. MSN appears to serve mostly the same results regardless of the type of search term.

When examining the WUI differences each day and the API differences each day, we don't see much variation between the two types of terms.

Distribution of WUI vs. API bubble distance for CS and popular search results

Distribution of WUI day n vs. WUI day n - 1 bubble distance for CS and popular search results

Distribution of API day n vs. API day n - 1 bubble distance for CS and popular search results


Indexed and Cached

The following graphs show the result of asking each search engine if a randomly selected URL was indexed and cached. White dots indicate that the URL was not indexed/cached on that particular day.



Backlinks

Each day we asked how many backlinks were reported for each URL.

Backlink totals for each URL

Total disagreements each day when asking the WUI and API how many backlinks were returned for each URL.

Loose disagreements per day (when the API value is less than 90% of the WUI value or greater than 110% of the WUI value)

The following scatterplots show the number of backlinks the APIs report for each of the 100 URLs on a daily basis vs. the what the WUI reports.

For the most case, all search engines report the same number of backlinks for each URL. Yahoo began redirecting link: and site: queries to their Site Explorer (beta) on 8-15-06. Apparently it is operating on a different data set than the API because the API continued to report numerous backlinks for URLs that the Site Explorer didn't have any backlinks recorded.

For only 1 URL does Google report more than 35K backlinks. MSN and Yahoo report much larger numbers.



Site

Each day we asked each search engine how many pages were indexed for each website.

Totals for each website

Total disagreements each day when asking the WUI and API how many total results were returned for each website.

Loose disagreements per day (when the API value is less than 90% of the WUI value or greater than 110% of the WUI value)

The following scatterplots show the number of resources indexed from each of the 100 websites (using the site: parameter).

  1. Google's API typically underestimates the site size in comparison to the WUI.
  2. MSN reports far fewer pages indexed in general that Google or Yahoo.
  3. The MSN API reported 0 pages indexed while the WUI reported a non-zero value for nearly every website for a period of 16 days (2006-08-29 to 2006-09-13).
  4. Yahoo's API consistently returns 0 indexed pages for www.sexy-pamela-anderson.com when the WUI reports 2-4. There are only 4 other sites where this happened, and only on one or two days.



Total Terms

Each day we asked each search engine how many results did each search term produce.

Totals for each term

Total disagreements each day when asking the WUI and API how many total results were returned for each search term.

Loose disagreements per day (when the API value is less than 90% of the WUI value or greater than 110% of the WUI value)

The following scatterplots show the estimated total results for each search term.

  1. CS terms score higher in general than popular terms.
  2. The Google API and WUI vary frequently on the number of results they indicate are indexed.
  3. Yahoo varies less, but the WUI tends to produce a larger number than the WUI more frequently.
  4. MSN consistently reports the same number for both API and WUI.
  5. Google and MSN never reported 0 for their API or WUI. Yahoo API only reported zero 4 times, but their WUI never reported 0.
  6. For Google there are only a few results that are greater than 2.5 billion.
  7. MSN never reports having anything past 600 million.

Scatterplot of total results obtained daily from API and WUI interfaces.

Research | Home

Page last modified: