Date: Wednesday, 27 May, 2015
The list of one million sample URIs from DMOZ used in this analysis is available here.
For a given pair of an archive and a profiling policy, number of unique keys generated determines the cost of the profile for the archive. If the cost of the complete knowledge profile URIR
is considered to be 1 in which the number of keys in it equal to the number of unique URI-Rs in the archive, then relative cost of other profiles can be calculated as follows:
For a given pair of an archive and a profile, and a set of sample lookup URIs, accuracy of predicting the absence or presence of the sample URIs in the archive using the profile determines the precision of the profile for the archive. If the search precision of the complete knowledge profile URIR
is considered to be 1 which can predict the presence or absence with 100% accuracy, then search precision of other profiles can be calculated as follows:
In our case so far there are no false negatives hence the formula above is simple. It will change slightly when there will be chances of getting some false negatives as well.
Policy | Example | ArchiveIt | UKWA | ||||
---|---|---|---|---|---|---|---|
Keys (Cost) | Existence Prediction (Out of 1M DMOZ Sample URIs) | Precision | Keys (Cost) | Existence Prediction (Out of 1M DMOZ Sample URIs) | Precision | ||
H1P0 | uk)/ | 282 | 999,912 | 0.041 | 162 | 996,311 | 0.019 |
onlydom | uk,co,bbc)/ | 2,086,552 | 507,017 | 0.081 | 2,011,203 | 352,347 | 0.054 |
tillsubdom | uk,co,bbc)/1 | 2,199,576 | 497,847 | 0.082 | 2,080,278 | 335,301 | 0.057 |
H3P0 | uk,co,bbc)/ | 3,049,211 | 445,886 | 0.092 | 2,034,487 | 281,505 | 0.068 |
tillpath | uk,co,bbc)/1/2 | 4,681,511 | 467,861 | 0.088 | 4,056,659 | 260,112 | 0.074 |
pathquery* | uk,co,bbc)/1/5 | 4,966,968 | 464,631 | 0.088 | 4,437,415 | 251,766 | 0.076 |
tillquery | uk,co,bbc)/1/2/3 | 5,607,513 | 457,586 | 0.090 | 5,087,497 | 245,101 | 0.078 |
pathquerysuppinit** | uk,co,bbc)/1/5/i | 10,381,970 | 349,678 | 0.117 | 18,481,581 | 180,518 | 0.106 |
tillinit | uk,co,bbc)/1/2/3/I | 11,247,577 | 343,620 | 0.119 | 19,332,428 | 175,530 | 0.109 |
HxP1 | uk,co,bbc,news)/Images | 84,720,364 | 233,825 | 0.175 | 91,433,257 | 110,177 | 0.174 |
URIR | uk,co,bbc,news)/Images/Logo.png?height=80&rotate=90&width=200 | 1,873,600,422 | # 40,969 | 1.000 | 673,796,541 | # 19,121 | 1.000 |
pathquery
policy is similar to the "no first char" summarization of LANL.pathquerysuppinit
policy is similar to the "with first char" summarization of LANL.H1P0
is TLD-only profile, URIR
is complete knowledge profile, H3P0
is a profile with three segments from host and no segments from the path, and HxP1
is a profile with all the host segments and one path segment.Following two figures show the search precision of each profile listed in the above table for the two archives Archive-It and UK Web Archive respectively. (The URIR/Complete Knowledge profile has search precision 1.)
The search precision values illustrated above can bee seen on the relative cost scale below. (The URIR/Complete Knowledge profile has Cost 1.)
The cost and precision values can be seen in perspective of the complete knowledge profile below.