Profile Storage And Dissemination Options

Author: Sawood Alam

Date: Wednesday, 22 July, 2015

After generating profiles and serializing them in CDXJ format we want to store them in a place where they are publicly accessible. Preferably, the storage process should be automated with the help of a script. We also want to have a sense of version in profiles to track the change in profile over time, hence we want to use some sort of revision control system. Profiles will not be accessed frequently for live usage, but they need to be archived for long time and be accessible publicly. Having these goals in mind, we discuss some storage and dissemination options in this document.

File Storage Services

Below we discuss some widely popular file storage services and their pros and cons.

GitHub or BitBucket

GitHub and BitBucket are both code hosting repositories with free and paid plans. They both use Git as the revision control system. BitBucket has option for private repositories in its free plan, but since we are focusing on public repositories here, hence GitHub can be a better choice as it is more popular and it has bigger community.

Pros

Allows easy collaboration.
Provides RESTful API.
Maintains history.
Widely popular in developer community.
GitHub allows unlimited number of binary file hosting with unlimited bandwidth under releases with the restriction of no individual file can be bigger than 2 GBs.

Cons

Git is not suitable for binary (such as compressed) files.
Git does not play well with large text files when the size of individual files is in few hundred MBs or in GBs.
GitHub poses a soft limit of 1 GB per repository, a soft limit of 50 MB per file and a hard limit of 100 MB per file.
Git LFS (Large File Storage) extension increases the ability of versioning large file, but it still caps at about couple GB files, and their beta service offers only 1 GB storage and 1 GB bandwidth under free plan.

Google Drive

Google drive is a good option for storing large static files. It also keeps track of od revisions of the file, although not as pronounced as GitHub where version management is the first class thing.

Pros

Provides various level of sharing options.
Allows collaboration with OAuth if the directory permissions are setup properly.
Provides API to access and contribute files.
File change history is preserved.
Gives unlimited storage with 2 TB single file size limit on academic accounts (.edu domain).

Cons

Hot-linking is not supported officially, hence linking a file based on the directory structure is not easy, as directories and files are identified by some randomly generated long strings.
Hot-linking is possible via reverse proxy, but someone has to run and maintain a reverse proxy service.
Discovery of files is not easy.

Amazon Simple Storage Service (S3)

Amazon Simple Storage Service or commonly known as S3 is a very popular service for hosting static content among web developers. It is also good for archiving and backup purposes.

Pros

Provides various level of sharing options.
Allows collaboration with OAuth if the bucket permissions are setup properly.
Provides API to access and contribute files.
Provides hot-linking and custom domain support.
File versioning is possible when explicitly enabled.

Cons

The service is cheap, but not free.

DropBox

DropBox is a popular file backup and synchronization service for personal and business usage.

Pros

File or folder sharing is possible.
Collaboration is possible by explicitly adding accounts or via an OAuth key.
Provides API to access and collaborate files.
File versions are automatic, but have some limitations.

Cons

Free storage quota is only 2 GB.
Hot-linking is possible via a special directory called "Public" that has limited collaboration option.
File versions are only kept for 30 days that are good for accidental disaster recovery, but not suitable for long term preservation of the history.
Extended versioning is only supported in the paid (Pro) plan.

Internet Archive

Internet Archive hosts various types of publicly contributed static files and datasets, such as scanned books, audio files, and structured public datasets. We can ask them to host archive profiles as a central repository and make provisions to allow public/moderated/whitelisted contributions.

Pros

Centralized repository makes discovery easy.
Archive services are long lasting by nature.
Unlimited storage.

Cons

Need to contact IA for special provision.
Version management will probably be manual unless they have some system in place.
Avoiding spam while keeping the service open for public contribution can be challenging.

Amazon Public Dataset

Amazon Public Dataset initiative is used by Common Crawl to host terabytes of archived data and metadata. I requested Amazon to get access to host archive profiles as public dataset, but did not hear anything back from them.

Google Public Data

It provides some automatic visualizations but I think it accepts data in specific formats.

Google BigTable and BigQuery

We can possibly transform the data in a form that can be loaded into BigTable and queried using BigQuery, but this option is not free.

Proposed Workflow

All storage options described above except Internet Archive should be distributed in nature, so that individuals can maintain their own repositories and register their repositories in a central location to make the discovery easy. An archive can have its profiles available in different places maintained/generated by different people/scripts as long as it has common ID in the meta section of the profiles. An structured directory structure should be proposed to arrange profiles. Large profiles can be split in smaller pieces based on logical divisions such as one for each TLD or registered domain (in nested directory structure), or simply by some file size limit.

We currently propose the following workflow:

Fork the main Archive Profiler repository.
Checkout your fork locally.
Update the config.ini file in the local copy to reflect changes according to your collection/archive and push the changes back to your GitHub fork, so that the changes are not lost and discoverable for other maintainers later.
Run the main script by pointing it to the input files such as CDX, keyword samples, or URL samples.
The script should take care properly organizing the profiles in sub-directories.
Once done, generated profiles should automatically be compressed (and split in parts if necessary to deal with the file size limits) and pushed to your GitHub fork as a release.
All the releases are kept forever in GitHub with the release date and any specified tags, hence they can serve as versions.
Periodically run the profiler on new datasets to create more incremental releases.
Pull changes from the upstream repository and merge periodically to update any changes in the profiler code.
Services that consume profiles can start their discovery by listing all the forks of the upstream (main) repository.
From their they can go to each fork and list all the releases to download compressed profiles locally, combine them, and consume them as desired.