ODU Archive Profiling Project Progress

Author: Sawood Alam, Computer Science Department, Old Dominion University, Norfolk, Virginia - 23508
Date: April 10, 2016
Color Key: DONE; REMAINING

Deliverable	Sep14	Oct14	Nov14	Dec14	Jan15	Feb15	Mar15	Apr15	May15	Jun15	Jul15	Aug15	Sep15	Oct15	Nov15	Dec15	Jan16	Feb16
Overall Progress	90%
D1: baseline development, testing with IIPC members	100%
D1: baseline development, testing with IIPC members	defined various profiling policies defined metrics, structure, and terminology for profiles implemented various ways to generate profiles from CDX a configuration mechanism to describe the archive wrote data cleanup code wrote various analysis code a simple entry point code for archives to generate profiles Ahmed AlSum ran the code and generated various profiles for Stanford archive
D2: sample URI collection, dissemination and feedback from IIPC			95%
	collected three million URIs from DMOZ, IA Wayback logs, and Memento Proxy log collected one million URIs from UKWA access logs scraped entire Reddit posts data and extracted one million URIs from it generated profiles using fulltext search for top keywords and random sampling implemented a language and collection agnostic random searcher for sampling URIs and profiling generating profiles via URI sampling where no fulltext search is available									get feedback from IIPC members
D3: collecting/sampling query logs from IIPC members				100%
D3: collecting/sampling query logs from IIPC members	acquired anonymized IA Wayback access logs of one year acquired LANL Memento aggregator query logs of 2+ years have ODU Memento aggregator query logs of many years acquired anonymized UKWA access log sample formally asked IIPC members for their access/query logs via email and a blog post Acquired access logs from OldWeb.today, UK National Archives, and Stanford University Archive
D3: instrumenting Memento aggregator						80%
D3: instrumenting Memento aggregator	an initial code to consume profiles and return ordered list of archives implemented my own Memento aggregator called MemGator implemented feature in the MemGator to utilize the ordered list of archive									improve code to produce ranked ordered list of archive based on profiles
D3: other dimensions for profiling						70%
D3: other dimensions for profiling	started working on time profiles performed analysis of suitable sample size started working on language profiles									implement language profiles implement hybrid profiles
D3: internal crawler								100%
D3: internal crawler	discussed possibilities to implement this discussed alternate approaches to surface dark archive holdings The CDX profiler is generalized so it can even work on a list of URIs. This can be used to generate profiles for dark and private archives.
D3: analysis, simulation, validation								90%
D3: analysis, simulation, validation	performed resource requirement analysis performed growth analysis performed cost and precision analysis validated effect of various profiling policies in predicting presence of Mementos in archives analyzed precision, specificity, and recall tradeoff									analyze effect of hybrid profiles
D4: serialization, transfer, collecting IIPC feedback										95%
D4: serialization, transfer, collecting IIPC feedback	implemented JSON-LD serialization, but discarded due to scale related issues defined CDXJ format for serialization generated 23 different profiles for each of the two archives and three sample query sets implemented a way to push profiles in a GitHub repository automatically verified file size limits in GitHub and other places profile storage and dissemination options discussed formally introduced the CDXJ and ORS serialization formats a GitHub fork based workflow is implemented to upload the profile in a public place and discovery									get feedback from IIPC members

Remarkable Changes since the Last Update

Introduced a GitHub fork system based profile dissemination mechanism
Polished the code to automate the process of profile upload to GitHub
A submission in the IJDL special issue got accepted

Links