Teaching

# 4,000 Sample URIs

The following is derived from section 3.1 of our "How Much of the Web Is Archived?" (arXiv.org extended version) paper.

The sample URI lists are simple text files. The links below are to the ancillary files copies on arXiv.org. Each sample source contains 1,000 URIs each.

## Open Directory Project (DMOZ) Sample Description

The Open Directory Project (DMOZ) as a URI sample source has a long history. Although it is an imperfect source for many reasons (e.g. its contents appear to be driven by commercial motives and are likely biased in favor of commercial sites), DMOZ was included for comparability with previous studies and because it is one of the oldest sources available. In particular, DMOZ archives dating back to 2000 are readily available, which makes DMOZ a reliable source for old URIs that may no longer exist.

We extracted URIs from every DMOZ archive that was available in December 2010, which includes 100 snapshots of DMOZ made from July 20, 2000 through October 3, 2010. First, a combined list of all unique URIs was produced by merging the 100 archives. During this process, the date of the DMOZ archive in which each URI first existed was captured. This date is indirect evidence of the URI's creation date. From this combined list, 3,806 invalid URIs (RFC 3986 violations) were excluded, 1,807 non-HTTP URIs were excluded, and 17,681 URIs with character set encoding errors were excluded. This resulted in 9,415,486 unique, valid URIs from which to sample. The order of these URIs was randomized by sorting on their CRC and the first 1,000 were selected as the sample.

## Delicious Sample Description

Delicious is a social bookmarking service started in 2003; it allows users to tag, save, manage and share links from a centralized source. Delicious provides two main types of bookmarks. Delicious recent bookmarks are URIs that have been recently added. Delicious popular bookmarks are the currently most popular bookmarks in the Delicious bookmarks set. We retrieved 1,000 URIs from the Delicious Recent Random URI Generator on Nov. 22, 2010. We also considered the Delicious Popular Random URI Generator; however, it's small set of distinct URIs could not provide a good sample.

## Btly Sample Description

The Bitly project is a web-based service for URI shortening. Its popularity grew as a result of being the default URI shortening service on the microblogging service Twitter (from 2009-2010), and now enjoys a significant user base of its own. Any link posted on Twitter is automatically shortened. Bitly creates a short URI that when dereferenced issues an HTTP 301 redirect to a target URI. The shortened URI consists of a short string of alphanumeric characters appended to http://bit.ly/. For example, http://bit.ly/A redirects to http://www.wieistmeineip.de/ip-address:

 % curl -I http://bit.ly/A
HTTP/1.1 301 Moved
Date: Sun, 30 Jan 2011 16:00:48 GMT
Server: nginx
...


Shortened URIs provide an entry point for tracking clicks by appending a "+" to the URI, for example \url{http://bit.ly/A+}. This tracking page reveals when the short URI was created, as well as the dereferences and associated contexts for the dereferences. The creation time of the Bitly is assumed to be greater than or equal to the creation time of the target URI to which the Bitly redirects.

To sample Bitly, we randomly created a series of alphanumeric strings, dereferenced the corresponding Bitly URI, and recorded the target URIs (i.e., the URI in the \textit{Location:} response header). The first 1000 bitlys that returned HTTP 301 responses were used. We also recorded the creation time of the Bitlys via their associated "+" pages.

## Search Engine Sample Description

Search engines play an important role in web page discovery for most casual users of the Web. Previous studies have examined the relationship between the Web as a whole and the portion indexed by search engines. A search engine sample should be an excellent representation of the Web as a whole. However, the sample must be random, representative, and unbiased. One way to tackle the randomness of this sample is by providing the search engines with multiple random queries, getting the results and choosing again at random from them. This intuitive approach is feasible but suffers from several deficiencies and is extremely biased. The deficiencies reside in the necessity of creating a completely diverse query list of all topics and keywords. Also search engines are normally limited to providing only about the first 1,000 results. Bias, on other hand, comes from the fact that search engines present results with preference to their page rank. The higher the popularity of a certain URI, and its adherence to the query, the more probable it will appear first in the returned results.

It is necessary to sample the search engine index efficiently, at random, covering most aspects, while also removing ranking bias and popularity completely. Several studies have investigated solving different aspects of this problem. The most suitable solution was presented by Bar-Yossef and Gurevich.

As illustrated in Bar-Yossef and Gurevich, there are two methods to implement this unbiased random URL sampler from search engine's index. The first is by utilizing a pool of phrases to assemble queries that will be later fed to the search engine. The other approach is based on random walks and does not need a preparation step. The first method was utilized to collect our sample.

We implemented Bar-Yossef and Gurevich with a small modification to the first phase of pool preparation. Bar-Yossef and Gurevich specified that a huge corpus should be assembled, from which the query pool will be created. Instead, the Google N-grams query list was used to create pool of 1,176,470,663 of queries (using 5-grams). A random sampling of the queries was provided to the URI sampler as the second phase. A huge number of URIs were produced; 1,000 were filtered at random to be utilized as the unbiased, random and representative sample of the indexed web.