Skip to content

Hashing service

David Megginson edited this page Feb 13, 2020 · 4 revisions

The HXL Proxy's Hashing service at /api/hash generates an MD5 hash for either an entire HXL dataset, or just the header and hashtag rows. The service takes the following parameters:

Parameter Is required? Description
url yes URL of the HXL dataset to hash
headers_only no If provided ("on"), hash only the last header row and the hashtag row

Output

The output is a JSON report with the 32-character hex-encoded MD5 digest, along with supporting metadata:

{
    "hash": "6da2a59520de5c48549a7572b289c528",
    "url": "https://docs.google.com/spreadsheets/d/1ytPD-f4a8CbNKTfMS3EqZOpBo9LWCk_NDKxJCgmpXA8/edit#gid=1101521524",
    "date": "2018-11-20T16:22:22.836026",
    "headers_only": true,
    "headers": [
        "Registro",
        "Sector/Cluster",
        "Organizaci\u00f3n",
        "Hombres",
        "Mujeres",
        "Pa\u00eds",
        "ISO",
        "Dato"
    ],
    "hashtags": [
        "#meta+id",
        "#sector+name+es",
        "#org+name+es",
        "#targeted+m",
        "#targeted+f",
        "#country+name+es",
        "#country+code",
        "#date"
    ]
}

Use cases

With headers_only specified, the MD5 hash value can tell you if two datasets are essentially of the same type (e.g. HXL-hashtagged API output of the same humanitarian dataset for different countries or time periods).

With headers_only unspecified, the MD5 hash value can tell you whether a dataset has changed in any meaningful way since the last time you hashed it.

Methodology

  • Order of columns and HXL attributes is significant for caching (the same columns in a different order will produce a different MD5 digest).
  • Differences in whitespace are not significant.
  • The hashes are generated over a UTF-8 encoding of the data.
  • All text headers are hashed first, then all hashtags (breadth-first).
  • Null values are treated as empty strings.
Clone this wiki locally