Overall Progress |
|
Deliverable |
Sep14 |
Oct14 |
Nov14 |
Dec14 |
Jan15 |
Feb15 |
Mar15 |
Apr15 |
May15 |
Jun15 |
Jul15 |
Aug15 |
Sep15 |
Oct15 |
Nov15 |
Dec15 |
Jan16 |
Feb16 |
D1: baseline development, testing with IIPC members |
|
|
- defined various profiling policies
- defined metrics, structure, and terminology for profiles
- implemented various ways to generate profiles from CDX
- a configuration mechanism to describe the archive
- wrote data cleanup code
- wrote various analysis code
- a simple entry point code for archives to generate profiles
- Ahmed AlSum ran the code and generated various profiles for Stanford archive
|
|
D2: sample URI collection, dissemination and feedback from IIPC |
|
|
|
- collected three million URIs from DMOZ, IA Wayback logs, and Memento Proxy log
- collected one million URIs from UKWA access logs
- scraped entire Reddit posts data and extracted one million URIs from it
- generated profiles using fulltext search for top keywords and random sampling
- implemented a language and collection agnostic random searcher for sampling URIs and profiling
- generating profiles via URI sampling where no fulltext search is available
|
- get feedback from IIPC members
|
D3: collecting/sampling query logs from IIPC members |
|
|
|
- acquired anonymized IA Wayback access logs of one year
- acquired LANL Memento aggregator query logs of 2+ years
- have ODU Memento aggregator query logs of many years
- acquired anonymized UKWA access log sample
- formally asked IIPC members for their access/query logs via email and a blog post
- Acquired access logs from OldWeb.today, UK National Archives, and Stanford University Archive
|
|
D3: instrumenting Memento aggregator |
|
|
|
- an initial code to consume profiles and return ordered list of archives
- implemented my own Memento aggregator called MemGator
- implemented feature in the MemGator to utilize the ordered list of archive
|
- improve code to produce ranked ordered list of archive based on profiles
|
D3: other dimensions for profiling |
|
|
|
- started working on time profiles
- performed analysis of suitable sample size
- started working on language profiles
|
- implement language profiles
- implement hybrid profiles
|
D3: internal crawler |
|
|
|
- discussed possibilities to implement this
- discussed alternate approaches to surface dark archive holdings
- The CDX profiler is generalized so it can even work on a list of URIs. This can be used to generate profiles for dark and private archives.
|
|
D3: analysis, simulation, validation |
|
|
- performed resource requirement analysis
- performed growth analysis
- performed cost and precision analysis
- validated effect of various profiling policies in predicting presence of Mementos in archives
- analyzed precision, specificity, and recall tradeoff
|
- analyze effect of hybrid profiles
|
D4: serialization, transfer, collecting IIPC feedback |
|
|
- implemented JSON-LD serialization, but discarded due to scale related issues
- defined CDXJ format for serialization
- generated 23 different profiles for each of the two archives and three sample query sets
- implemented a way to push profiles in a GitHub repository automatically
- verified file size limits in GitHub and other places
- profile storage and dissemination options discussed
- formally introduced the CDXJ and ORS serialization formats
- a GitHub fork based workflow is implemented to upload the profile in a public place and discovery
|
- get feedback from IIPC members
|