Deliverable |
Sep14 |
Oct14 |
Nov14 |
Dec14 |
Jan15 |
Feb15 |
Mar15 |
Apr15 |
May15 |
Jun15 |
Jul15 |
Aug15 |
Sep15 |
Oct15 |
Nov15 |
Dec15 |
Jan16 |
Feb16 |
D1: baseline development, testing with IIPC members |
|
|
- defined various profiling policies
- defined metrics, structure, and terminology for profiles
- implemented various ways to generate profiles from CDX
- a configuration mechanism to describe the archive
- wrote data cleanup code
- wrote various analysis code
- a simple entry point code for archives to generate profiles
- Ahmed AlSum ran the code and generated various profiles for Stanford archive
|
- ask IIPC members to run the profiler on their collections
|
D2: sample URI collection, dissemination and feedback from IIPC |
|
|
|
- collected three million URIs from DMOZ, IA Wayback logs, and Memento Proxy log
- collected one million URIs from UKWA access logs
- scraped entire Reddit posts data and extracted one million URIs from it
|
- generate URIs from top keywords of various languages
- generate profiles via sampling instead of CDX analysis
- get feedback from IIPC members
|
D3: collecting/sampling query logs from IIPC members |
|
|
|
- acquired anonymized IA Wayback access logs of one year
- acquired LANL Memento aggregator query logs of 2+ years
- have ODU Memento aggregator query logs of many years
- acquired anonymized UKWA access log sample
|
- need to formally ask other IIPC members for their access/query logs
|
D3: instrumenting Memento aggregator |
|
|
|
- an initial code to consume profiles and return ordered list of archives
- implemented my own Memento aggregator called MemGator
- implemented feature in the MemGator to utilize the ordered list of archive
|
- improve code to produce ranked ordered list of archive based on profiles
|
D3: other dimensions for profiling |
|
|
|
- started working on time profiles
- performed analysis of suitable sample size
- started working on language profiles
|
- implement language profiles
- implement hybrid profiles
|
D3: internal crawler |
|
|
|
- discussed possibilities to implement this
- discussed alternate approaches to surface dark archive holdings
|
- the goal of crawling dark archives can possibly be achieved by CDX analysis or some other means
|
D3: analysis, simulation, validation |
|
|
- performed resource requirement analysis
- performed growth analysis
- performed cost and precision analysis
- validated effect of various profiling policies in predicting presence of Mementos in archives
|
- analyze precision and recall tradeoff
- analyze effect of hybrid profiles
|
D4: serialization, transfer, collecting IIPC feedback |
|
|
- implemented JSON-LD serialization, but discarded due to scale related issues
- defined CDXJ format for serialization
- generated 23 different profiles for each of the two archives and three sample query sets
- implemented a way to push profiles in a GitHub repository automatically
- verified file size limits in GitHub and other places
- profile storage and dissemination options discussed
- formally introduced the CDXJ and ORS serialization formats
|
- needs a more polished workflow to upload the profile in a public place
- get feedback from IIPC members
|