-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate remaining package metadata and add bulk data format #1084
Comments
Very happy about deprecating packages however I want add a few other downsides to line delimited JSON (JSONL).
My preference for OCDS would be to have JSON (with top level array) as the canonical format but allow JSONL as a possible format as well. Without accepting the former, even writing documentation would be hard and unreadable (unless there was a JSON format for a single release also). Also converting from JSON (with top level array) to JSONL and vice versa is also trivial with Another possible option is to use YML as it has a well established record delimiter ( |
Thanks, @kindly. JSONL is not without disadvantages, but your alternative proposals don't resolve the major issues from the issue description. Presently, all options (including status quo) have disadvantages. We want, at minimum, a packaging format that is "least bad". It's therefore important to consider and weigh all advantages/disadvantages together. I'll respond to your specific points. JSONL
On the first point, given that most OCDS data is published as a single line of JSON, JSONL would actually be more human-readable, since it at least puts one release/record per line. Perhaps you didn't mean that users would read the JSON in its actual raw form, but would view it using an application that reformats it for readability. In that case:
That said, given that many OCDS datasets are either (1) too large to visually inspect or (2) too disaggregated over thousands or millions of files, I doubt many users are productively reading the data in its raw form without pre-processing it or loading it into an analytical environment. We can confirm these and other assumptions with users directly, of course. I didn't mention that JSONL works very well with line-oriented data tools (e.g. many command-line tools). For example, if you want to filter specific releases and then only load those into an analytical environment, you can, for example, We can also consider the other streaming formats mentioned in the issue description, which don't have the same downsides. JSON arraysOne of the big challenges is that large JSON arrays are difficult to work with. Authors of OCDS-specific tools (including long-time authors like ODS) struggle to write code that can run against large datasets on consumer hardware. Most generic JSON tools (including Pandas) have similar issues, and if they support JSON streaming at all, it is with JSONL, not with JSON arrays. Changing from YAMLYAML (YML) is not a good format for data interchange for many reasons, some of which are:
I also note that YAML is not a formal standard (there is no IETF, W3C, ISO, etc. specification). While a YAML parser could be used to iteratively parse And, while this is not a sufficient reason to not adopt a new file format, there has already been a lot of investment in OCDS tools for JSON data, which will not work with YAML data. Some more here: https://en.wikipedia.org/wiki/YAML#Criticism For reference, here is how to iteratively read documents in PyYAML. I don't know that YAML parsers in other languages offer an easy way to read iteratively. Even with PyYAML, you need to be lucky enough to see this sentence in the docs: "Loader.get_data() constructs and returns a Python object corresponding to the next document in the stream.". from yaml import Loader
with open('file.yml') as f:
loader = Loader(f)
while loader.check_data():
print(loader.get_data()) |
For publishers, another benefit of JSONL is that it is easier to create bulk downloads. The least efficient implementation would load all the data into memory, dump it to a single string, then write it. For example: import json
with open('file.json', 'w') as f:
f.write(json.dumps(data)) An easy optimization is to dump it iteratively, e.g. substituting this line: json.dump(data, f) # uses json.JSONEncoder.iterencode() internally Note: Not all programming languages' standard libraries for JSON do iterative writing (I don't think Ruby's does). To eliminate the need to load all the data into memory, you need to do something like OCDS Kit's SerializableGenerator borrowed from this StackOverflow answer. This method can be used, for example, to yield a single release at a time to the JSON writer. Most publishers won't write such code on their own. On the other hand, with JSONL, the publisher would just write each release as it was yielded, without ever having to load all releases into memory. Another advantage for publishers and users is that compact JSON files have much smaller file sizes than indented JSON files. While HTTP compression can largely eliminate the difference, its implementation depends on a savvy publisher. |
In terms of which media type to use, there is some discussion here: spring-projects/spring-framework#21283 That thread mentions RFC7464, whose format is However, I'm not confident in users' ability to split on record separators (a character most have never heard of, and which isn't rendered or supported in most browsers, text editors, etc.), and I think there is less software support than for JSONL. |
Agreed about YAML but I find it a shame there is not well supported safe subset. I still think we should support both JSON (with top level array) and JSONL with a debate about which should be the canonical format. The data model is the same either way and we should give guidance about when it would be best to use either (say over 1000 releases for JSONL). Using JSON is still clearly in my eyes the most ergonomic and useful for any case except for data that does not fit in memory. My main concerns are that using JSONL as the "only" format is that:
So for most use cases, except for the very last stage of publication where the data will be too large for memory and for large data consumption, then JSON is a more usable format. However, for larger data JSONL is. So I see the trade-off as ...
I personally feel 1 is the better trade off but I can understand why somebody might see 2 as more optimal. |
One option is to not deprecate the package schemas, but to do one or both of:
If both are done, then a publisher can publish only JSONL, if they choose. If we omit (1), then a publisher must at least publish data following the package schemas, and can optionally publish the bulk download format. Since some publishers only offer bulk downloads, I think it's better to do both. I think a simple distinction is to describe each as the "API format" and the "bulk download format", respectively. Web API responses should be relatively small (I don't think we've seen any that are large), in which case the current package schemas are fine. Bulk downloads tend to be large, and therefore should use JSONL. So, the canonical format would be context-dependent. I think this is better than choosing one for both contexts, since the package schemas cause major headaches for bulk data, and JSONL has disadvantages for API responses. With respect to the package schema, I think it's best to leave them as-is rather than change them to JSON arrays. That way, we have 2 package formats (the current embedded releases/records, and JSONL) rather than 3 (since the deprecated format still counts). A top-level JSON array is a little simpler than an embedded JSON array, but not by much. We can also make all package-level metadata optional, so that |
@jpmckinney that suggestion sounds good to me and so does not breaking backwards compatibility with the current package schemas. I would consider deprecating/removing the package level fields though, so the model will be more consistent in future. Also would be good to remove any validation constraints (even types) on the package level fields especially This means validation tools (like DRT) have no need to load the whole package into memory in any circumstance and can just stream data from the |
I totally agree with this. As even in the API responses, when the publisher uses a lot of extensions the extension list can be really long. Eg https://contrataciones.gov.py/datos/api/v3/doc/ocds/record/ocds-03ad3f-331547-2 https://apiocds.colombiacompra.gov.co:8443/apiCCE2.0/rest/releases |
Noting that the extension template should also be updated to only allow extending the release schema, and not packages. Only the pagination extension extends the package schema. That extension might be merged into OCDS in #928. |
From open-contracting/lib-cove-ocds#68 (comment) the AusTender API extension also extends the package schema. Colombia is also mentioned in that issue, but I don't know if that's because they use the pagination extension. |
That extension is nearly identical to the pagination extension, so its use case should be satisfied via #928. |
As a publisher trying to offer bulk downloads of over 4 million records, we definitely welcome the possibility of using JSONL as a format for these files. Most of our tooling for ETL uses JSONL already, since it is much easier to connect many stages of a data pipeline by streaming records from one place to another, one line at a time. For quick analyses it is also much easier to use command line tools such as grep, count lines using wc -l, and also the examples previously mentioned of using head and tail to gather a subset of the data. We agree with the proposal to offer JSONL as an alternative format for bulk downloads in 1.2 and keeping the current way of doing things for backward compatibility, but we think it should become deprecated in future releases for consistency and simplicity. |
@jpmckinney I agree with this approach and the way to word it. The problem is "how to publish MANY records/releases", but for small number of objects the current format is good. As a matter of facts, we observe that small arrays come from API responses or data samples, and big arrays are bulk data files. |
Thank you, @fmatzdorf, for confirming those use cases! |
It seems like we have agreement on this issue. Is any further discussion or consultation required before preparing a PR? |
Since this issue affects tools, we will need to review whether we have the capacity to modify tools to accept a bulk data format. We will be doing a review of open issues in the new year to decide which to include in 1.2.0 and which to postpone. This might be one of them. |
In #1327 (comment) I suggested: "we can move the release/record package schema pages under [a page for APIs], to keep the Reference section somewhat organized." #1327 didn't end up adding a new page for APIs (instead updating a guidance page), but noting here in case there's some other way to reduce the length of the Reference menu. |
Moving to 1.3.0/2.0.0 as we don't have the capacity to assist this transition with tooling, etc. I split off a workable subset into #1621. |
There are issues with packages as discussed in open-contracting/infrastructure#89, #605, and CRM-4282 (all relevant comments reflected here), and the current packaging formats confer very few benefits.
Benefits
The benefits of the current packaging format are:
publisher
version
andextensions
license
andpublicationPolicy
A package also sets
uri
andpublishedDate
, but this is metadata about the package itself, not about the releases/records it contains.Discussion
Metadata
publisher
should be moved to the release-level (see Add publisher field (release schema) #325)version
andextensions
can be handled usingdescribedby
at the release-level (see Add describedby field for the extended release schema #426)Regarding
license
andpublicationPolicy
, paraphrasing open-contracting/infrastructure#89:See similar comments in #325 (comment)
As such, all metadata provided by the package can be omitted or moved to the release-level, without major issue.
Format
We still want a standardized way to publish multiple releases/records as a single file. A minimal package in the current format with all metadata removed would be:
The problem with this format is that naive applications will load the entire file into memory. Because bulk download OCDS files can be very large (GBs), doing so exhausts memory on much consumer hardware. Iterative JSON parsers like
ijson
can be used to index to thereleases
array and yield one release at a time (as is done in OCDS Kit, for example); however, relatively few users are aware of such libraries, and many common data analysis tools don't use them (Pandas, for example). Indeed, no OCDS software written by ODS uses iterative parsing, leading to memory being exhausted in critical tools like the Data Review Tool on medium-to-large datasets; retrofitting these tools to parse iteratively is not trivial.Any JSON format that puts releases/records in JSON arrays will suffer the same issue. The only reasonable options are:
There are other JSON streaming options besides line-delimited JSON, but:
An advantage of a ZIP file is that it can contain additional information, e.g. a
LICENSE.txt
orpublicationPolicy.pdf
. However, OCDS datasets can contain millions of releases/records. Unless the publisher organizes them into directories somehow, the ZIP file will expand into millions of files, which is a barrier to use for many users.A single (large) line-delimited JSON file is comparatively easier to work with.
Proposal
Deprecate packages, and recommend publication of OCDS releases/records as line-delimited JSON.
The text was updated successfully, but these errors were encountered: