Archive Profile Serialization

Sawood Alam | @ibnesayeed

Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529

Archive Profile

  • High-level digest of an archive
  • Predicts presence of mementos of a URI-R in an archive
  • Provides various statistics about the holdings
  • Small in size
  • Publicly available
  • Easy to update and partially patch
  • Useful for Memento query routing and other things

Profiles Contents

  • How to organize contents?
  • What goes in it?
  • How to serialize it?

Flat Organization


{
    "...": {},
    "stats": {
        "suburi": {
            "edu)/": {
                "urim": {
                    "max": 3,
                    "min": 1,
                    "total": 32
                },
                "urir": 12
            },
            "edu,harvard)/": {
                "urim": {
                    "max": 1,
                    "min": 1,
                    "total": 2
                },
                "urir": 2
            },
            "edu,harvard,law,blogs)/": {
                "urim": {
                    "max": 1,
                    "min": 1,
                    "total": 1
                },
                "urir": 1
            },
            "edu,harvard,law,blogs)/tech": {
                "urim": {
                    "max": 1,
                    "min": 1,
                    "total": 1
                },
                "urir": 1
            },
            "edu,harvard,law,blogs)/tech/rss": {
                "urim": {
                    "max": 1,
                    "min": 1,
                    "total": 1
                },
                "urir": 1
            },
            "...": {}
        },
        "...": {}
    }
}
          

Grouped Organization


{
    "...": {},
    "stats": {
        "tld": {
            "com)/": {
                "urim": {
                    "max": 10,
                    "min": 2,
                    "total": 72
                },
                "urir": 34
            },
            "edu)/": {
                "urim": {
                    "max": 3,
                    "min": 1,
                    "total": 32
                },
                "urir": 12
            },
            "...": {}
        },
        "domain": {
            "com,cnn)/": {
                "urim": {
                    "max": 1,
                    "min": 1,
                    "total": 2
                },
                "urir": 2
            },
            "...": {},
            "edu,harvard)/": {
                "urim": {
                    "max": 1,
                    "min": 1,
                    "total": 2
                },
                "urir": 2
            },
            "...": {}
        },
        "...": {}
    }
}
          

Nested Organization


{
    "...": {},
    "stats": {
        "tld": {
            "com)/": {
                "domain": {
                    "com,adobe)/": {
                        "urim": {
                            "max": 3,
                            "min": 3,
                            "total": 6
                        },
                        "urir": 2
                    },
                    "...": {},
                },
                "urim": {
                    "max": 3,
                    "min": 1,
                    "total": 17
                },
                "urir": 13
            },
            "edu)/": {
                "domain": {
                    "edu,harvard)/": {
                        "urim": {
                            "max": 1,
                            "min": 1,
                            "total": 2
                        },
                        "urir": 2
                    },
                    "...": {}
                },
                "urim": {
                    "max": 3,
                    "min": 1,
                    "total": 32
                },
                "urir": 12
            },
            "...": {}
        },
        "urim": {
            "max": 6,
            "min": 1,
            "total": 18891
        },
        "urir": 6852
    }
}
          

Frequency Metrics


{
    "...": {},
    "stats": {
        "suburi": {
            "com)/": {
                  "urim": {
                      "1stqu": 4.2,
                      "3rdqu": 7.13,
                      "max": 12,
                      "mean": 6.52,
                      "median": 8,
                      "min": 1,
                      "sd": 4.18,
                      "total": 86
                  },
                  "urir": 15
            },
            "...": {}
        },
        "...": {}
    }
}
          

JSON Serialization

  • Can have complex nested data structure
  • JSON-LD for linked data
  • No partial key lookup
  • Unsuitable for text processing tools
  • Allows processing only when fully loaded
  • A single malformed character makes it unparsable
  • Difficult to patch

Sample JSON Profile


{
    "@context": "https://oduwsdl.github.io/context/archprofile.jsonld",
    "@id": "http://www.webarchive.org.uk/ukwa/",
    "about": {
        "accesspoint": "http://www.webarchive.org.uk/wayback/",
        "mechanism": "http://oduwsdl.github.io/terms/mechanism#cdx",
        "name": "UKWA 1996 Collection",
        "profile_updated": "2015-01-20T17:25:30Z",
        "suburi_class": "http://oduwsdl.github.io/terms/suburi#H3P1",
        "more_meta_data": "..."
    },
    "stats": {
        "language": {
            "en-US": {
                "urim": {
                    "max": 13,
                    "min": 1,
                    "total": 47529
                },
                "urir": 25621
            },
            "more_languages": "..."
        },
        "suburi": {
            "uk)/": {
                "urim": {
                    "max": 8,
                    "min": 1,
                    "total": 932432
                },
                "urir": 867817
            },
            "uk,co)/": {
                "urim": {
                    "max": 8,
                    "min": 1,
                    "total": 410979
                },
                "urir": 378686
            },
            "uk,co,bbc)/": {
                "urim": {
                    "max": 2,
                    "min": 1,
                    "total": 128
                },
                "urir": 115
            },
            "uk,co,bbc)/images": {
                "urim": {
                    "max": 1,
                    "min": 1,
                    "total": 3
                },
                "urir": 3
            },
            "more_suburis": "..."
        },
        "time": {
            "199603": {
                "urim": {
                    "max": 5,
                    "min": 1,
                    "total": 124
                },
                "urir": 62
            },
            "more_dates": "..."
        },
        "urim": {
            "max": 1185,
            "min": 1,
            "total": 934747
        },
        "urir": 868513
    }
}
          

CDXJSON Serialization

  • Fusion of CDX and JSON file formats
  • A key followed by strict single line JSON value
  • Unlike CDX, values can have arbitrary attributes
  • Text processing tool friendly
  • No single root node or single document restrictions
  • Enables binary search
  • Enables partial key lookup
  • Error resilient

Sample CDXJSON Profile

Key String SPACE Single Line JSON NEWLINE

@context "https://oduwsdl.github.io/contexts/archiveprofile.jsonld"
@id "http://www.webarchive.org.uk/ukwa/"
@about {"name": "UKWA 1996 Collection", "type": "suburi#H3P1", "...": "..."}
uk)/ {"urim": {"max": 8, "min": 1, "total": 932432}, "urir": 867817},
uk,co)/ {"urim": {"max": 8, "min": 1, "total": 410979}, "urir": 378686},
uk,co,bbc)/ {"urim": {"max": 2, "min": 1, "total": 128}, "urir": 115},
uk,co,bbc)/images {"urim": {"max": 1, "min": 1, "total": 3}, "urir": 3}
          

Conclusions and Future Work

  • CDXJSON offers scalability and failure resilience
  • Reduces the profile size as it allows partial key lookup
  • TODO: Update profiler script to output in CDXJSON
  • TODO: Fomalize CDXJSON format
  • Implementation codes are available at:

Sawood Alam

@ibnesayeed