ODU Archive Profiling Project Progress

Author: Sawood Alam, Computer Science Department, Old Dominion University, Norfolk, Virginia - 23508
Date: September 09, 2015
Color Key: DONE; REMAINING

Deliverable	Sep14	Nov14	Dec14	Feb15	Apr15	Jun15
D1: baseline development, testing with IIPC members	95%
D1: baseline development, testing with IIPC members	defined various profiling policies defined metrics, structure, and terminology for profiles implemented various ways to generate profiles from CDX a configuration mechanism to describe the archive wrote data cleanup code wrote various analysis code a simple entry point code for archives to generate profiles Ahmed AlSum ran the code and generated various profiles for Stanford archive					ask IIPC members to run the profiler on their collections
D2: sample URI collection, dissemination and feedback from IIPC		60%
	collected three million URIs from DMOZ, IA Wayback logs, and Memento Proxy log collected one million URIs from UKWA access logs scraped entire Reddit posts data and extracted one million URIs from it					generate URIs from top keywords of various languages generate profiles via sampling instead of CDX analysis get feedback from IIPC members
D3: collecting/sampling query logs from IIPC members			80%
D3: collecting/sampling query logs from IIPC members	acquired anonymized IA Wayback access logs of one year acquired LANL Memento aggregator query logs of 2+ years have ODU Memento aggregator query logs of many years acquired anonymized UKWA access log sample					need to formally ask other IIPC members for their access/query logs
D3: instrumenting Memento aggregator				80%
D3: instrumenting Memento aggregator	an initial code to consume profiles and return ordered list of archives implemented my own Memento aggregator called MemGator implemented feature in the MemGator to utilize the ordered list of archive					improve code to produce ranked ordered list of archive based on profiles
D3: other dimensions for profiling				60%
D3: other dimensions for profiling	started working on time profiles performed analysis of suitable sample size started working on language profiles					implement language profiles implement hybrid profiles
D3: internal crawler					60%
D3: internal crawler	discussed possibilities to implement this discussed alternate approaches to surface dark archive holdings					the goal of crawling dark archives can possibly be achieved by CDX analysis or some other means
D3: analysis, simulation, validation					60%
D3: analysis, simulation, validation	performed resource requirement analysis performed growth analysis performed cost and precision analysis validated effect of various profiling policies in predicting presence of Mementos in archives					analyze precision and recall tradeoff analyze effect of hybrid profiles
D4: serialization, transfer, collecting IIPC feedback						90%
D4: serialization, transfer, collecting IIPC feedback	implemented JSON-LD serialization, but discarded due to scale related issues defined CDXJ format for serialization generated 23 different profiles for each of the two archives and three sample query sets implemented a way to push profiles in a GitHub repository automatically verified file size limits in GitHub and other places profile storage and dissemination options discussed formally introduced the CDXJ and ORS serialization formats					needs a more polished workflow to upload the profile in a public place get feedback from IIPC members

Remarkable Changes since the Last Update

Archive profiler code updated to have a single entry point for profile generation to simplify the process
Andy provided an anonymized access log, we generated a sample set of one million URIs from that, and performed various analysis that was done on earlier samples
Ahmed AlSum helped us by running the Archive Profiler on Stanford archive to verify its functionality and ease of use
Ahmed AlSum provided his generated profiles and we performed analysis on that
Generating profiles for remaining UKWA CDX (earlier only ten years of CDXs were profiled)
Performed some analysis to identify the suitable size of sample URI sets to generate profiles of live archives via sampling
Discussed language profiles along with the profiling based on sampling
Discussed the internal crawler and other approaches to surface dark archive holdings
Implemented a Memento aggregator called MemGator to be used as CLI or server and added support for consuming archive profile information in it
Written a blog post to formally introduce CDXJ (or CDX-JSON) format as well as a more generic Object Resource Stream (ORS) format
Discussed profile storage and dissemination options

ODU Archive Profiling Project Progress

Remarkable Changes since the Last Update

Links