Telecon With LANL

Date: Wednesday, 27 May, 2015

One Million DMOZ Sample URI Profiling

The list of one million sample URIs from DMOZ used in this analysis is available here.

Relative Cost

For a given pair of an archive and a profiling policy, number of unique keys generated determines the cost of the profile for the archive. If the cost of the complete knowledge profile URIR is considered to be 1 in which the number of keys in it equal to the number of unique URI-Rs in the archive, then relative cost of other profiles can be calculated as follows:

Relative Cost (profile, archive) = No. of keys in the given profile for the archive / No. of keys in URIR profile for the archive

Search Precision

For a given pair of an archive and a profile, and a set of sample lookup URIs, accuracy of predicting the absence or presence of the sample URIs in the archive using the profile determines the precision of the profile for the archive. If the search precision of the complete knowledge profile URIR is considered to be 1 which can predict the presence or absence with 100% accuracy, then search precision of other profiles can be calculated as follows:

Search Precision (profile, archive, sample) = No. of correctly predicted URIs from the sample / No. of all the predicted URIs

In our case so far there are no false negatives hence the formula above is simple. It will change slightly when there will be chances of getting some false negatives as well.

Sample Intermediate Values

Sample URI: https://www.news.BBC.co.uk/Images/Logo.png?width=200&height=80&rotate=90#fragment
Canonical URL: news.bbc.co.uk/Images/Logo.png?height=80&rotate=90&width=200
SURT: uk,co,bbc,news)/Images/Logo.png?height=80&rotate=90&width=200
Registered Domain: uk,co,bbc)/
Subdomain Count: 1
Path Count: 2
Query Count: 3
PathQuery Count: 2 + 3 = 5
Path Initial: I
Path Initial (tight): i

Cost Precision Table

Policy Example ArchiveIt UKWA
Keys (Cost) Existence Prediction (Out of 1M DMOZ Sample URIs) Precision Keys (Cost) Existence Prediction (Out of 1M DMOZ Sample URIs) Precision
H1P0 uk)/ 282 999,912 0.041 162 996,311 0.019
onlydom uk,co,bbc)/ 2,086,552 507,017 0.081 2,011,203 352,347 0.054
tillsubdom uk,co,bbc)/1 2,199,576 497,847 0.082 2,080,278 335,301 0.057
H3P0 uk,co,bbc)/ 3,049,211 445,886 0.092 2,034,487 281,505 0.068
tillpath uk,co,bbc)/1/2 4,681,511 467,861 0.088 4,056,659 260,112 0.074
pathquery* uk,co,bbc)/1/5 4,966,968 464,631 0.088 4,437,415 251,766 0.076
tillquery uk,co,bbc)/1/2/3 5,607,513 457,586 0.090 5,087,497 245,101 0.078
pathquerysuppinit** uk,co,bbc)/1/5/i 10,381,970 349,678 0.117 18,481,581 180,518 0.106
tillinit uk,co,bbc)/1/2/3/I 11,247,577 343,620 0.119 19,332,428 175,530 0.109
HxP1 uk,co,bbc,news)/Images 84,720,364 233,825 0.175 91,433,257 110,177 0.174
URIR uk,co,bbc,news)/Images/Logo.png?height=80&rotate=90&width=200 1,873,600,422 # 40,969 1.000 673,796,541 # 19,121 1.000

Notes

Analysis

Following two figures show the search precision of each profile listed in the above table for the two archives Archive-It and UK Web Archive respectively. (The URIR/Complete Knowledge profile has search precision 1.)

The search precision values illustrated above can bee seen on the relative cost scale below. (The URIR/Complete Knowledge profile has Cost 1.)

The cost and precision values can be seen in perspective of the complete knowledge profile below.