Profile Storage And Dissemination Options

Author: Sawood Alam

Date: Wednesday, 22 July, 2015

After generating profiles and serializing them in CDXJ format we want to store them in a place where they are publicly accessible. Preferably, the storage process should be automated with the help of a script. We also want to have a sense of version in profiles to track the change in profile over time, hence we want to use some sort of revision control system. Profiles will not be accessed frequently for live usage, but they need to be archived for long time and be accessible publicly. Having these goals in mind, we discuss some storage and dissemination options in this document.

File Storage Services

Below we discuss some widely popular file storage services and their pros and cons.

GitHub or BitBucket

GitHub and BitBucket are both code hosting repositories with free and paid plans. They both use Git as the revision control system. BitBucket has option for private repositories in its free plan, but since we are focusing on public repositories here, hence GitHub can be a better choice as it is more popular and it has bigger community.

Pros

Cons

Google Drive

Google drive is a good option for storing large static files. It also keeps track of od revisions of the file, although not as pronounced as GitHub where version management is the first class thing.

Pros

Cons

Amazon Simple Storage Service (S3)

Amazon Simple Storage Service or commonly known as S3 is a very popular service for hosting static content among web developers. It is also good for archiving and backup purposes.

Pros

Cons

DropBox

DropBox is a popular file backup and synchronization service for personal and business usage.

Pros

Cons

Internet Archive

Internet Archive hosts various types of publicly contributed static files and datasets, such as scanned books, audio files, and structured public datasets. We can ask them to host archive profiles as a central repository and make provisions to allow public/moderated/whitelisted contributions.

Pros

Cons

Amazon Public Dataset

Amazon Public Dataset initiative is used by Common Crawl to host terabytes of archived data and metadata. I requested Amazon to get access to host archive profiles as public dataset, but did not hear anything back from them.

Google Public Data

It provides some automatic visualizations but I think it accepts data in specific formats.

Google BigTable and BigQuery

We can possibly transform the data in a form that can be loaded into BigTable and queried using BigQuery, but this option is not free.

Proposed Workflow

All storage options described above except Internet Archive should be distributed in nature, so that individuals can maintain their own repositories and register their repositories in a central location to make the discovery easy. An archive can have its profiles available in different places maintained/generated by different people/scripts as long as it has common ID in the meta section of the profiles. An structured directory structure should be proposed to arrange profiles. Large profiles can be split in smaller pieces based on logical divisions such as one for each TLD or registered domain (in nested directory structure), or simply by some file size limit.

We currently propose the following workflow: