Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate remaining package metadata and add bulk data format #1084

Open
jpmckinney opened this issue Oct 6, 2020 · 18 comments
Open

Deprecate remaining package metadata and add bulk data format #1084

jpmckinney opened this issue Oct 6, 2020 · 18 comments
Labels
Focus - Packages Relating to release packages and record packages Schema Relating to other changes in the JSON Schema (renamed fields, schema properties, etc.)

Comments

@jpmckinney
Copy link
Member

jpmckinney commented Oct 6, 2020

There are issues with packages as discussed in open-contracting/infrastructure#89, #605, and CRM-4282 (all relevant comments reflected here), and the current packaging formats confer very few benefits.

Benefits

The benefits of the current packaging format are:

  1. A standardized way to publish multiple releases/records as a single file
  2. Easy access to metadata:
    • publisher
    • version and extensions
    • license and publicationPolicy

A package also sets uri and publishedDate, but this is metadata about the package itself, not about the releases/records it contains.

Discussion

Metadata

Regarding license and publicationPolicy, paraphrasing open-contracting/infrastructure#89:

  • License and publication policy metadata are important, but it isn't critical that they be distributed as data; that said, they can be expressed in the machine-readable description of the OCDS dataset in a data registry, using DCAT for example (DCAT has a property for license, and a property for publication policy can be added as an extension, which DCAT-US does with other properties).
  • Most open data (CSVs, etc.) have no means of declaring their license or publication policy, but this poses no major problem to reuse – these are instead declared on the HTML pages that serve or link to the data. Users generally only need to refer to these once, so it's not a challenge to data workflows.

See similar comments in #325 (comment)

As such, all metadata provided by the package can be omitted or moved to the release-level, without major issue.

Format

We still want a standardized way to publish multiple releases/records as a single file. A minimal package in the current format with all metadata removed would be:

{
  "releases": [
    // big list of releases
  ]
}

The problem with this format is that naive applications will load the entire file into memory. Because bulk download OCDS files can be very large (GBs), doing so exhausts memory on much consumer hardware. Iterative JSON parsers like ijson can be used to index to the releases array and yield one release at a time (as is done in OCDS Kit, for example); however, relatively few users are aware of such libraries, and many common data analysis tools don't use them (Pandas, for example). Indeed, no OCDS software written by ODS uses iterative parsing, leading to memory being exhausted in critical tools like the Data Review Tool on medium-to-large datasets; retrofitting these tools to parse iteratively is not trivial.

Any JSON format that puts releases/records in JSON arrays will suffer the same issue. The only reasonable options are:

  1. Line-delimited JSON
  2. ZIP files containing individual releases/records

There are other JSON streaming options besides line-delimited JSON, but:

  1. Line-delimited JSON has the widest support and is easy to publish and use, using common JSON libraries
  2. Record separator-delimited JSON is an eccentric format that uses rarely-used record separator characters
  3. Concatenated JSON requires specialized JSON libraries

An advantage of a ZIP file is that it can contain additional information, e.g. a LICENSE.txt or publicationPolicy.pdf. However, OCDS datasets can contain millions of releases/records. Unless the publisher organizes them into directories somehow, the ZIP file will expand into millions of files, which is a barrier to use for many users.

A single (large) line-delimited JSON file is comparatively easier to work with.

Proposal

Deprecate packages, and recommend publication of OCDS releases/records as line-delimited JSON.

@jpmckinney jpmckinney added the Schema Relating to other changes in the JSON Schema (renamed fields, schema properties, etc.) label Oct 6, 2020
@jpmckinney jpmckinney added this to the 1.2.0 milestone Oct 6, 2020
@kindly
Copy link
Contributor

kindly commented Oct 20, 2020

Very happy about deprecating packages however I want add a few other downsides to line delimited JSON (JSONL).

  • The JSONL data in its raw form becomes unreadable as there can be no new lines within the data. So opening a JSONL file needs some preprocessing in order to just read over the raw data. I consider this a usability concern for open data and for people approaching OCDS data. It requires knowledge of a jq or a programming language JSONL library just to be able to read the data.
  • JSONL is still much less common than just raw JSON. It is a developer and user barrier to have to write something that supports JSONL.
  • The files should be labelled as releases.jsonl and not releases.json as it is not JSON. Also JSONL is not really standardized yet, so feels wrong to have as the main data format.

My preference for OCDS would be to have JSON (with top level array) as the canonical format but allow JSONL as a possible format as well. Without accepting the former, even writing documentation would be hard and unreadable (unless there was a JSON format for a single release also).

Also converting from JSON (with top level array) to JSONL and vice versa is also trivial with jq and that would be a task mainly for developers and not something that all data users need to understand.

Another possible option is to use YML as it has a well established record delimiter (---) and has (most likely) even wider support than JSONL libraries. It would accept standard JSON objects between the delimiters as it is superset of JSON. There could also be big readability gains for data in YML. My main concern is that YML has too many other features that we may not want to be used.

@jpmckinney
Copy link
Member Author

jpmckinney commented Oct 20, 2020

Thanks, @kindly. JSONL is not without disadvantages, but your alternative proposals don't resolve the major issues from the issue description.

Presently, all options (including status quo) have disadvantages. We want, at minimum, a packaging format that is "least bad". It's therefore important to consider and weigh all advantages/disadvantages together.

I'll respond to your specific points.

JSONL

  1. I collected a file from each of the APIs reachable by Kingfisher Collect. Out of 42 JSON files, 30 had the releases/records on a single line. Digiwhist's bulk downloads for 35 jurisdictions are tar.gz files of JSONL files, and Colombia's and Portugal's bulk downloads are ZIP files of JSONL files. As such, most OCDS data is not human-readable in its raw form.
  2. The barrier to JSONL is just to learn to split on newlines, which can be solved, whereas the barrier to JSON arrays (discussed in the next section) is totally unsolvable for most people.
  3. I agree that the proposal should use a different file extension.

On the first point, given that most OCDS data is published as a single line of JSON, JSONL would actually be more human-readable, since it at least puts one release/record per line.

Perhaps you didn't mean that users would read the JSON in its actual raw form, but would view it using an application that reformats it for readability. In that case:

  • Browser extensions like JSONView can indent JSON, but not JSONL (I tried with this [sample file](Error: Parse error on line 1:), JSONView in Chrome displays the raw data and a parse error). However, I don't know (1) how many OCDS users have such browser extensions installed and (2) how frequently OCDS users read OCDS data in the browser.
  • Extensions for code editors (like JSON Reindent for Sublime Text) can indent JSON, but not JSONL. However, as above, I don't know how many users actually do this.
  • cat file | python -m json.tool is a common way for Python users to re-indent a file. With JSONL, you can do something like while read LINE; do echo $LINE | python -m json.tool; done < file. People unfamiliar with the command line might have memorized the former, but will need to look up the latter. Update: In Python 3.8, you can do cat file | python -m json.tool --json-lines.

That said, given that many OCDS datasets are either (1) too large to visually inspect or (2) too disaggregated over thousands or millions of files, I doubt many users are productively reading the data in its raw form without pre-processing it or loading it into an analytical environment. We can confirm these and other assumptions with users directly, of course.

I didn't mention that JSONL works very well with line-oriented data tools (e.g. many command-line tools). For example, if you want to filter specific releases and then only load those into an analytical environment, you can, for example, grep for a buyer name or supplier ID. If you want to filter a JSON array, you'll have to use a JSON-specific tool like jq. If you want to take a subset, you can use head or tail. To count releases/records, you can use wc -l.

We can also consider the other streaming formats mentioned in the issue description, which don't have the same downsides.

JSON arrays

One of the big challenges is that large JSON arrays are difficult to work with. Authors of OCDS-specific tools (including long-time authors like ODS) struggle to write code that can run against large datasets on consumer hardware. Most generic JSON tools (including Pandas) have similar issues, and if they support JSON streaming at all, it is with JSONL, not with JSON arrays.

Changing from {"releases": [...]} to [...] doesn't resolve these issues. Note that jq .[] does not stream. You need to use the special incantation jq -cn --stream 'fromstream(1|truncate_stream(inputs))' to convert a large JSON array to JSONL without running out of memory, which most users won't know. OCDS Kit makes this conversion easy, e.g. echo '[1,2,3]' | ocdskit echo --root-path item or echo '{"releases":[1,2,3]}' | ocdskit echo --root-path releases.item. However, requiring people to install and use specific tools and learn specific options to be able to read large OCDS data stored as JSON arrays is not desirable. JSONL, in comparison, can be easily split into items (e.g. using the split command on macOS and *nix, or using a string-split method in any programming language), which can then be read by any JSON reader; there is no need for JSONL-specific libraries.

YAML

YAML (YML) is not a good format for data interchange for many reasons, some of which are:

  • YAML supports serializing arbitrary native data structures, a feature that has caused many severe security issues across its many implementations. We don't want it to be possible to craft OCDS data that runs arbitrary code when parsed.
  • YAML is sensitive to indentation in its non-compact format, which can give unexpected results (without error or warning) if indented improperly. JSON doesn't have this issue.
  • A token's type is based on its content, rather than surrounding syntax (e.g. no is a boolean, 24 is a number, text is a string). To make 24 into a string, you need to put it in quotation marks or use !!str. JSON doesn't have this issue.
  • With YAML being a superset of JSON, we could use a restricted subset of YAML (always quote strings, don't support arbitrary native data structures, etc.), similar to StrictYAML. However, this won't be as well supported.
  • A JSON parser is part of far more programming language's standard library than YAML.

I also note that YAML is not a formal standard (there is no IETF, W3C, ISO, etc. specification).

While a YAML parser could be used to iteratively parse ----delimited JSON, I don't think that this is clearly better than using a string-split method and a JSON parser to iteratively parse newline-delimited JSON.

And, while this is not a sufficient reason to not adopt a new file format, there has already been a lot of investment in OCDS tools for JSON data, which will not work with YAML data.

Some more here: https://en.wikipedia.org/wiki/YAML#Criticism


For reference, here is how to iteratively read documents in PyYAML. I don't know that YAML parsers in other languages offer an easy way to read iteratively. Even with PyYAML, you need to be lucky enough to see this sentence in the docs: "Loader.get_data() constructs and returns a Python object corresponding to the next document in the stream.".

from yaml import Loader

with open('file.yml') as f:
    loader = Loader(f)

    while loader.check_data():
        print(loader.get_data())

@jpmckinney
Copy link
Member Author

jpmckinney commented Oct 20, 2020

For publishers, another benefit of JSONL is that it is easier to create bulk downloads.

The least efficient implementation would load all the data into memory, dump it to a single string, then write it. For example:

import json

with open('file.json', 'w') as f:
    f.write(json.dumps(data))

An easy optimization is to dump it iteratively, e.g. substituting this line:

    json.dump(data, f)  # uses json.JSONEncoder.iterencode() internally

Note: Not all programming languages' standard libraries for JSON do iterative writing (I don't think Ruby's does).

To eliminate the need to load all the data into memory, you need to do something like OCDS Kit's SerializableGenerator borrowed from this StackOverflow answer. This method can be used, for example, to yield a single release at a time to the JSON writer. Most publishers won't write such code on their own.

On the other hand, with JSONL, the publisher would just write each release as it was yielded, without ever having to load all releases into memory.


Another advantage for publishers and users is that compact JSON files have much smaller file sizes than indented JSON files. While HTTP compression can largely eliminate the difference, its implementation depends on a savvy publisher.

@jpmckinney
Copy link
Member Author

In terms of which media type to use, there is some discussion here: spring-projects/spring-framework#21283

That thread mentions RFC7464, whose format is RS JSON-text LF, i.e. a record separator character, any JSON text (can be compact or not), and a line feed (newline). RFC8142 for GeoJSON Text Sequences builds in that RFC.

However, I'm not confident in users' ability to split on record separators (a character most have never heard of, and which isn't rendered or supported in most browsers, text editors, etc.), and I think there is less software support than for JSONL.

@kindly
Copy link
Contributor

kindly commented Oct 21, 2020

Agreed about YAML but I find it a shame there is not well supported safe subset.

I still think we should support both JSON (with top level array) and JSONL with a debate about which should be the canonical format. The data model is the same either way and we should give guidance about when it would be best to use either (say over 1000 releases for JSONL).

Using JSON is still clearly in my eyes the most ergonomic and useful for any case except for data that does not fit in memory.

My main concerns are that using JSONL as the "only" format is that:

  • You will have to explain in the docs, for any example, that this is not real OCDS data and you will need to pre-process it first before it is valid.
  • You will need to preprocess any hand written OCDS before they are valid. I understand in the long run hand written JSON is not good but this is probably a first step into learning how to create OCDS data or for examples.
  • For people just starting to write OCDS in code it is an initial barrier to getting any valid data and iterating on the work.
  • Not sure about it as easy API format to create and consume, pagination is more common.

So for most use cases, except for the very last stage of publication where the data will be too large for memory and for large data consumption, then JSON is a more usable format. However, for larger data JSONL is.

So I see the trade-off as ...
Is it more user friendly to have:

  1. Two (or more) acceptable formats and guidance as to when to use which and how to convert to the other? Not ideal as having multiple formats could add confusion.
  2. To have one and only one format? But that format is not as usable in varous cases.

I personally feel 1 is the better trade off but I can understand why somebody might see 2 as more optimal.

@jpmckinney
Copy link
Member Author

jpmckinney commented Oct 21, 2020

One option is to not deprecate the package schemas, but to do one or both of:

  1. Relax the requirement to publish OCDS data following the package schemas
  2. Offer a bulk download package format (JSONL)

If both are done, then a publisher can publish only JSONL, if they choose. If we omit (1), then a publisher must at least publish data following the package schemas, and can optionally publish the bulk download format. Since some publishers only offer bulk downloads, I think it's better to do both.

I think a simple distinction is to describe each as the "API format" and the "bulk download format", respectively. Web API responses should be relatively small (I don't think we've seen any that are large), in which case the current package schemas are fine. Bulk downloads tend to be large, and therefore should use JSONL.

So, the canonical format would be context-dependent. I think this is better than choosing one for both contexts, since the package schemas cause major headaches for bulk data, and JSONL has disadvantages for API responses.


With respect to the package schema, I think it's best to leave them as-is rather than change them to JSON arrays. That way, we have 2 package formats (the current embedded releases/records, and JSONL) rather than 3 (since the deprecated format still counts). A top-level JSON array is a little simpler than an embedded JSON array, but not by much.

We can also make all package-level metadata optional, so that {"releases": [...]} validates. This also makes it possible to convert from JSONL to the package schemas, without having to put in placeholder values for package-level metadata.

@kindly
Copy link
Contributor

kindly commented Oct 21, 2020

@jpmckinney that suggestion sounds good to me and so does not breaking backwards compatibility with the current package schemas.

I would consider deprecating/removing the package level fields though, so the model will be more consistent in future.

Also would be good to remove any validation constraints (even types) on the package level fields especially "uniqueItems" : true. https://github.com/open-contracting/standard/blob/1.1-dev/schema/release-package-schema.json#L50 .

This means validation tools (like DRT) have no need to load the whole package into memory in any circumstance and can just stream data from the releases key validating each release seperately. This was always considered a constraint on why the DRT never tried to stream the data. The work of getting out the top level keys of the package in a stream like fashion is not really supported by any JSON library (even though theoretically possible). Also "uniqueItems": "True" pretty much requires all the data to be loaded in memory. It would be great if this was never required.

@jpmckinney jpmckinney added the Focus - Packages Relating to release packages and record packages label Oct 24, 2020
@jpmckinney jpmckinney changed the title Deprecate packages Deprecate package metadata and add bulk data format Oct 24, 2020
@yolile
Copy link
Member

yolile commented Oct 29, 2020

@jpmckinney that suggestion sounds good to me and so does not breaking backward compatibility with the current package schemas.
I would consider deprecating/removing the package level fields though, so the model will be more consistent in the future.

I totally agree with this. As even in the API responses, when the publisher uses a lot of extensions the extension list can be really long. Eg https://contrataciones.gov.py/datos/api/v3/doc/ocds/record/ocds-03ad3f-331547-2 https://apiocds.colombiacompra.gov.co:8443/apiCCE2.0/rest/releases

@jpmckinney
Copy link
Member Author

Noting that the extension template should also be updated to only allow extending the release schema, and not packages.

Only the pagination extension extends the package schema. That extension might be merged into OCDS in #928.

@duncandewhurst
Copy link
Contributor

From open-contracting/lib-cove-ocds#68 (comment) the AusTender API extension also extends the package schema. Colombia is also mentioned in that issue, but I don't know if that's because they use the pagination extension.

@jpmckinney
Copy link
Member Author

That extension is nearly identical to the pagination extension, so its use case should be satisfied via #928.

@fmatzdorf
Copy link

As a publisher trying to offer bulk downloads of over 4 million records, we definitely welcome the possibility of using JSONL as a format for these files. Most of our tooling for ETL uses JSONL already, since it is much easier to connect many stages of a data pipeline by streaming records from one place to another, one line at a time. For quick analyses it is also much easier to use command line tools such as grep, count lines using wc -l, and also the examples previously mentioned of using head and tail to gather a subset of the data.

We agree with the proposal to offer JSONL as an alternative format for bulk downloads in 1.2 and keeping the current way of doing things for backward compatibility, but we think it should become deprecated in future releases for consistency and simplicity.

@ColinMaudry
Copy link
Member

ColinMaudry commented Nov 6, 2020

I think a simple distinction is to describe each as the "API format" and the "bulk download format", respectively. Web API responses should be relatively small (I don't think we've seen any that are large), in which case the current package schemas are fine. Bulk downloads tend to be large, and therefore should use JSONL.

@jpmckinney I agree with this approach and the way to word it. The problem is "how to publish MANY records/releases", but for small number of objects the current format is good. As a matter of facts, we observe that small arrays come from API responses or data samples, and big arrays are bulk data files.

@jpmckinney
Copy link
Member Author

Thank you, @fmatzdorf, for confirming those use cases!

@duncandewhurst
Copy link
Contributor

It seems like we have agreement on this issue. Is any further discussion or consultation required before preparing a PR?

@jpmckinney
Copy link
Member Author

Since this issue affects tools, we will need to review whether we have the capacity to modify tools to accept a bulk data format. We will be doing a review of open issues in the new year to decide which to include in 1.2.0 and which to postpone. This might be one of them.

@jpmckinney
Copy link
Member Author

jpmckinney commented Jul 29, 2022

In #1327 (comment) I suggested: "we can move the release/record package schema pages under [a page for APIs], to keep the Reference section somewhat organized."

#1327 didn't end up adding a new page for APIs (instead updating a guidance page), but noting here in case there's some other way to reduce the length of the Reference menu.

@jpmckinney
Copy link
Member Author

jpmckinney commented Jun 7, 2023

Moving to 1.3.0/2.0.0 as we don't have the capacity to assist this transition with tooling, etc.

I split off a workable subset into #1621.

@jpmckinney jpmckinney modified the milestones: 1.2.0, 1.3.0 or 2.0.0 Jun 7, 2023
@jpmckinney jpmckinney changed the title Deprecate package metadata and add bulk data format Deprecate remaining package metadata and add bulk data format Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Focus - Packages Relating to release packages and record packages Schema Relating to other changes in the JSON Schema (renamed fields, schema properties, etc.)
Projects
None yet
Development

No branches or pull requests

6 participants