ODU Archive Profiling Project Progress

Sawood Alam, Computer Science Department, Old Dominion University, Norfolk, Virginia - 23508
April 10, 2016
Color Key
Overall Progress
Deliverable Sep14 Oct14 Nov14 Dec14 Jan15 Feb15 Mar15 Apr15 May15 Jun15 Jul15 Aug15 Sep15 Oct15 Nov15 Dec15 Jan16 Feb16
D1: baseline development, testing with IIPC members
  • defined various profiling policies
  • defined metrics, structure, and terminology for profiles
  • implemented various ways to generate profiles from CDX
  • a configuration mechanism to describe the archive
  • wrote data cleanup code
  • wrote various analysis code
  • a simple entry point code for archives to generate profiles
  • Ahmed AlSum ran the code and generated various profiles for Stanford archive
D2: sample URI collection, dissemination and feedback from IIPC
  • collected three million URIs from DMOZ, IA Wayback logs, and Memento Proxy log
  • collected one million URIs from UKWA access logs
  • scraped entire Reddit posts data and extracted one million URIs from it
  • generated profiles using fulltext search for top keywords and random sampling
  • implemented a language and collection agnostic random searcher for sampling URIs and profiling
  • generating profiles via URI sampling where no fulltext search is available
  • get feedback from IIPC members
D3: collecting/sampling query logs from IIPC members
  • acquired anonymized IA Wayback access logs of one year
  • acquired LANL Memento aggregator query logs of 2+ years
  • have ODU Memento aggregator query logs of many years
  • acquired anonymized UKWA access log sample
  • formally asked IIPC members for their access/query logs via email and a blog post
  • Acquired access logs from OldWeb.today, UK National Archives, and Stanford University Archive
D3: instrumenting Memento aggregator
  • an initial code to consume profiles and return ordered list of archives
  • implemented my own Memento aggregator called MemGator
  • implemented feature in the MemGator to utilize the ordered list of archive
  • improve code to produce ranked ordered list of archive based on profiles
D3: other dimensions for profiling
  • started working on time profiles
  • performed analysis of suitable sample size
  • started working on language profiles
  • implement language profiles
  • implement hybrid profiles
D3: internal crawler
  • discussed possibilities to implement this
  • discussed alternate approaches to surface dark archive holdings
  • The CDX profiler is generalized so it can even work on a list of URIs. This can be used to generate profiles for dark and private archives.
D3: analysis, simulation, validation
  • performed resource requirement analysis
  • performed growth analysis
  • performed cost and precision analysis
  • validated effect of various profiling policies in predicting presence of Mementos in archives
  • analyzed precision, specificity, and recall tradeoff
  • analyze effect of hybrid profiles
D4: serialization, transfer, collecting IIPC feedback
  • implemented JSON-LD serialization, but discarded due to scale related issues
  • defined CDXJ format for serialization
  • generated 23 different profiles for each of the two archives and three sample query sets
  • implemented a way to push profiles in a GitHub repository automatically
  • verified file size limits in GitHub and other places
  • profile storage and dissemination options discussed
  • formally introduced the CDXJ and ORS serialization formats
  • a GitHub fork based workflow is implemented to upload the profile in a public place and discovery
  • get feedback from IIPC members

Remarkable Changes since the Last Update