From 2006-05-28 to 2006-10-30 we submitted
four types of queries (3500 total) to each interface:
- General search terms. We queried for the top 100
results and the total number of results using 50 popular
search terms1 and 50 computer science (CS) terms 2.
- URL backlinks. We queried for the number of backlinks
to 100 randomly selected URLs. (link: parameter)
- Pages indexed for a website. We asked how many
pages were indexed for 100 randomly selected websites. (site: parameter)
- URL indexing and caching. We queried to see if
100 randomly selected URLs were indexed and/or cached. (info: or url: parameter)
The graphs from our study are shown below. A paper summarizing our
findings has been submitted to a conference and is under review.
Menu:
- Distance between API and API, WUI and WUI
- Distance between API and WUI
- WUI vs. API at Day Offsets
- Decay of Search Results (Top 100)
- Decay of Search Results (Top 10)
- Decay Model
- Popular vs. CS search terms
- Indexed and Cached
- Backlinks
- Site
- Total Terms
Research |
Home
Distance between API and API, WUI and WUI
The following graphs show the distances from 1 to 0 (1 = exact match, 0 = completely different)
of the top 100 search engine results for all search terms averaged together. The following
distance measures are being used:
-
Kendall tau distance (K) or bubble distance (with a small variation for top k lists proposed by Fagin et al.)
- Overlap (P) (or percentage of shared URLs)
- Bar-Ilan et al. measure (M) that weighs changes in the top of the list higher than
changes in the bottom of the list
MSN API data is missing for 17 days because the API key being used was invalidated for some reason.
I had to replace it with a new API key to fix the problem. Other developers have also experienced this.
Results per search term
The following graphs compare WUI results at day n with day n-1 and API results at day n with day n - 1
using overlap (P), Kendall tau (K), and Bar-Ilan (M):
Popular and CS results using P measure (top 100):
|
|
|
Popular and CS results using P measure (top 10):
|
|
|
All results using P,K,M measures (top 100):
|
|
|
All results using P,K,M measures (top 10):
|
|
|
Distance Between WUI vs. API Results
The following graphs show the distance between the WUI and API results on each day.
API vs WUI results for each search engine (top 100):
|
|
|
API vs WUI results for each search engine (top 10):
|
|
|
WUI vs. API at Day Offsets
The following graphs show the K distance between the top 100 search results
(averaged together) when comparing day n API results with day n + offset WUI results.
The graphs show that the WUI and API results are most similar on the same day (offset 0).
This implies that the API results are not older than the WUI data; both
result sets are updated at the same rate. The API results appear to just have a
slightly different ranking order.
Decay of Search Results Over Time
The following graphs show the decay of the top 100 search results over time.
Decay is determined by computing the overlap of the search results from day 1 to
each consecutive day. Here we have averaged all the popular decay scores together
and the CS decay scores.
Top 100
Top 100 search result decay for all terms
Decay of top 100 popular search results
|
|
|
Decay of top 100 CS search results
|
|
|
Top 100 averaged over time
Here we examine only the top 10 results for all search terms.
These graphs were composed by taking the URL results for day n and comparing them to day n+1, n+2, etc.
for every day we obtained search results. The offset (n to n+offset) for each day were averaged together.
For example d(Oct 12, Oct 13) was averaged with d(Oct 13, Oct 14), d(Oct 14, Oct 15), etc.
This reduces bias by not relying on any particular day as the start day with which to
compare all other results.
Averaged decay of top 100 results over time (Popular terms)
|
|
|
Averaged decay of top 100 results over time (CS terms)
|
|
|
Top 10
Decay of top 10 URL search results for all terms
Decay of top 10 popular results when comparing all results to day 1.
|
|
|
Decay of top 10 CS results when comparing all results to day 1.
|
|
|
Top 10 averaged over time
Averaged decay of top 10 results over time (Popular terms)
|
|
|
Averaged decay of top 10 results over time (CS terms)
|
|
|
Decay Model
The following model was fitted to the decay lines of the previous graphs:
f(day) = a - b * log(day)
The values of a, b, and R-squared are reported below:
| Search engine | Type | Top k | Interface |
a |
b |
R-squared |
Half-life |
|
| google | cs | 10 | wui | 0.994 | 0.175 | 0.957 | 672 | Graph |
| google | cs | 10 | api | 1.009 | 0.161 | 0.959 | 1480 | Graph |
| google | cs | 100 | wui | 1.011 | 0.219 | 0.982 | 215 | Graph |
| google | cs | 100 | api | 1.017 | 0.218 | 0.960 | 235 | Graph |
| google | popular | 10 | wui | 1.002 | 0.195 | 0.954 | 376 | Graph |
| google | popular | 10 | api | 1.020 | 0.191 | 0.956 | 529 | Graph |
| google | popular | 100 | wui | 0.988 | 0.250 | 0.985 | 89 | Graph |
| google | popular | 100 | api | 0.971 | 0.259 | 0.987 | 66 | Graph |
| msn | cs | 10 | wui | 0.982 | 0.204 | 0.994 | 228 | Graph |
| msn | cs | 10 | api | 0.975 | 0.196 | 0.991 | 264 | Graph |
| msn | cs | 100 | wui | 1.034 | 0.212 | 0.965 | 327 | Graph |
| msn | cs | 100 | api | 1.014 | 0.203 | 0.965 | 338 | Graph |
| msn | popular | 10 | wui | 0.912 | 0.227 | 0.981 | 66 | Graph |
| msn | popular | 10 | api | 0.947 | 0.233 | 0.972 | 84 | Graph |
| msn | popular | 100 | wui | 1.017 | 0.287 | 0.981 | 64 | Graph |
| msn | popular | 100 | api | 1.006 | 0.281 | 0.981 | 63 | Graph |
| yahoo | cs | 10 | wui | 1.104 | 0.323 | 0.960 | 74 | Graph |
| yahoo | cs | 10 | api | 1.121 | 0.325 | 0.955 | 81 | Graph |
| yahoo | cs | 100 | wui | 1.114 | 0.376 | 0.955 | 43 | Graph |
| yahoo | cs | 100 | api | 1.167 | 0.393 | 0.952 | 50 | Graph |
| yahoo | popular | 10 | wui | 1.079 | 0.200 | 0.951 | 783 | Graph |
| yahoo | popular | 10 | api | 1.092 | 0.224 | 0.954 | 433 | Graph |
| yahoo | popular | 100 | wui | 1.102 | 0.356 | 0.963 | 49 | Graph |
| yahoo | popular | 100 | api | 1.131 | 0.372 | 0.961 | 50 | Graph |
Popular vs. CS search terms
The following graphs show the density of the WUI vs API
bubble difference for the popular and CS search terms.
Distributions closer to 1.0 mean that the WUI and API are
giving more similar results.
Note that in Google, the WUI produces very different results
than the API for popular terms, but CS results are more similar.
The opposite is true for Yahoo. MSN appears to serve mostly
the same results regardless of the type of search term.
When examining the WUI differences each day and the API differences
each day, we don't see much variation between the two types of terms.
Distribution of WUI vs. API bubble distance for CS and popular search results
|
|
|
Distribution of WUI day n vs. WUI day n - 1 bubble distance for CS and popular search results
|
|
|
Distribution of API day n vs. API day n - 1 bubble distance for CS and popular search results
|
|
|
Indexed and Cached
The following graphs show the result of asking each search engine if a
randomly selected URL was indexed and cached. White dots indicate
that the URL was not indexed/cached on that particular day.
Backlinks
Each day we asked how many backlinks were reported for each URL.
Backlink totals for each URL
Total disagreements each day when asking the WUI and API how many backlinks
were returned for each URL.
|
|
|
Loose disagreements per day (when the API value is less than
90% of the WUI value or greater than 110% of the WUI value)
|
|
|
The following scatterplots show the number of backlinks the APIs
report for each of the 100 URLs on a daily basis vs. the what the
WUI reports.
For the most case, all search engines report the same number of
backlinks for each URL. Yahoo began redirecting link: and site:
queries to their Site Explorer (beta) on 8-15-06. Apparently it
is operating on a different data set than the API because the API
continued to report numerous backlinks for URLs that the
Site Explorer didn't have any backlinks recorded.
For only 1 URL does Google report more than 35K backlinks. MSN
and Yahoo report much larger numbers.
Site
Each day we asked each search engine how many pages were indexed for each website.
Totals for each website
Total disagreements each day when asking the WUI and API how many total results
were returned for each website.
|
|
|
Loose disagreements per day (when the API value is less than
90% of the WUI value or greater than 110% of the WUI value)
|
|
|
The following scatterplots show the number of resources indexed from
each of the 100 websites (using the site: parameter).
- Google's API typically underestimates the site size in comparison to the WUI.
- MSN reports far fewer pages indexed in general that Google or Yahoo.
- The MSN API reported 0 pages indexed while the WUI reported a non-zero value for nearly every
website for a period of 16 days (2006-08-29 to 2006-09-13).
- Yahoo's API consistently returns 0 indexed pages for www.sexy-pamela-anderson.com
when the WUI reports 2-4. There are only 4 other sites where this happened, and only on
one or two days.
Total Terms
Each day we asked each search engine how many results did each search term produce.
Totals for each term
Total disagreements each day when asking the WUI and API how many total results
were returned for each search term.
|
|
|
Loose disagreements per day (when the API value is less than
90% of the WUI value or greater than 110% of the WUI value)
|
|
|
The following scatterplots show the estimated total results for each search term.
- CS terms score higher in general than popular terms.
- The Google API and WUI vary frequently on the number of results they indicate are indexed.
- Yahoo varies less, but the WUI tends to produce a larger number than the WUI more frequently.
- MSN consistently reports the same number for both API and WUI.
- Google and MSN never reported 0 for their API or WUI. Yahoo API only reported zero 4 times,
but their WUI never reported 0.
- For Google there are only a few results that are greater than 2.5 billion.
- MSN never reports having anything past 600 million.
Scatterplot of total results obtained daily from API and WUI interfaces.
|
|
|
Research |
Home
Page last modified: