ODU Archive Profiling Project Progress

Author
Sawood Alam, Computer Science Department, Old Dominion University, Norfolk, Virginia - 23508
Date
September 09, 2015
Color Key
DONE
REMAINING
Deliverable Sep14 Oct14 Nov14 Dec14 Jan15 Feb15 Mar15 Apr15 May15 Jun15 Jul15 Aug15 Sep15 Oct15 Nov15 Dec15 Jan16 Feb16
D1: baseline development, testing with IIPC members
95%
  • defined various profiling policies
  • defined metrics, structure, and terminology for profiles
  • implemented various ways to generate profiles from CDX
  • a configuration mechanism to describe the archive
  • wrote data cleanup code
  • wrote various analysis code
  • a simple entry point code for archives to generate profiles
  • Ahmed AlSum ran the code and generated various profiles for Stanford archive
  • ask IIPC members to run the profiler on their collections
D2: sample URI collection, dissemination and feedback from IIPC
60%
  • collected three million URIs from DMOZ, IA Wayback logs, and Memento Proxy log
  • collected one million URIs from UKWA access logs
  • scraped entire Reddit posts data and extracted one million URIs from it
  • generate URIs from top keywords of various languages
  • generate profiles via sampling instead of CDX analysis
  • get feedback from IIPC members
D3: collecting/sampling query logs from IIPC members
80%
  • acquired anonymized IA Wayback access logs of one year
  • acquired LANL Memento aggregator query logs of 2+ years
  • have ODU Memento aggregator query logs of many years
  • acquired anonymized UKWA access log sample
  • need to formally ask other IIPC members for their access/query logs
D3: instrumenting Memento aggregator
80%
  • an initial code to consume profiles and return ordered list of archives
  • implemented my own Memento aggregator called MemGator
  • implemented feature in the MemGator to utilize the ordered list of archive
  • improve code to produce ranked ordered list of archive based on profiles
D3: other dimensions for profiling
60%
  • started working on time profiles
  • performed analysis of suitable sample size
  • started working on language profiles
  • implement language profiles
  • implement hybrid profiles
D3: internal crawler
60%
  • discussed possibilities to implement this
  • discussed alternate approaches to surface dark archive holdings
  • the goal of crawling dark archives can possibly be achieved by CDX analysis or some other means
D3: analysis, simulation, validation
60%
  • performed resource requirement analysis
  • performed growth analysis
  • performed cost and precision analysis
  • validated effect of various profiling policies in predicting presence of Mementos in archives
  • analyze precision and recall tradeoff
  • analyze effect of hybrid profiles
D4: serialization, transfer, collecting IIPC feedback
90%
  • implemented JSON-LD serialization, but discarded due to scale related issues
  • defined CDXJ format for serialization
  • generated 23 different profiles for each of the two archives and three sample query sets
  • implemented a way to push profiles in a GitHub repository automatically
  • verified file size limits in GitHub and other places
  • profile storage and dissemination options discussed
  • formally introduced the CDXJ and ORS serialization formats
  • needs a more polished workflow to upload the profile in a public place
  • get feedback from IIPC members

Remarkable Changes since the Last Update

Links