-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add or enhance a function to extract JSON-Records from an JSON-API #382
Comments
I guess something like a
|
Instead of using the entire JSON as a single record, provide a JSON path to query the JSON for the records to process, e.g. `$.data` to process every entry in a `data` array as a record.
Implemented with a JSON path as suggested by @blackwinter, but as an option in |
Hei, it works fine for that +1 Cool that this worked out so fast. In my opinion the function code needs some documentation about the options/attributes that can be selected. |
Updated the flux-commands.md (pending PR https://github.com/metafacture/metafacture-documentation/pull/14/files). @TobiasNx: may also be a good idea to document that option, especially the "splitting" capability, at other places of documentation - maybe to the cookbook. |
Aside from the unfortunate fact that the proposed implementation leads to parsing the JSON document twice, it also means loading all extracted records into memory simultaneously. This is a potentially serious limitation. I assume we'd have to implement the filtering mechanism ourselves (in terms of our incremental parsing) if we wanted to avoid those downsides. In which case JSON Pointer might be the simpler specification to implement while still satisfying the current use case. Finally, I'd still prefer a generic stream filter rather than extending each individual format decoder ad hoc whenever the need arises... |
Hm, does it? If the
Right, but if I'm not mistaken, all that content has already been loaded into memory as a string when passed to I think there is great benefit in providing (optional) full JSON path support in our JSON decoder. It provides a very flexible mechanism to query any JSON API for records. And the performance cost seems reasonable to me. However, I also like the idea of a generic stream filter, since it would unify different current approaches (the mentioned XML splitting, maybe also extract-element from the metafacture-html module, and this use case here). What I don't understand though, is how we would use a JSON path (or pointer, which would work for our use case here as well) at that point, where we have events, not JSON? Am I missing something here? Are you thinking of basically implementing a JSON pointer syntax for our event stream, without actual JSON involved in the process? But wouldn't it make more sense to use our own flattened event name syntax (like Since we need this functionality in OERSI, I'll merge the approved PR #384 for now. We should reconsider if we want to stick with this for our next actual release or if we have a better solution for this use case by then. So feel free to reopen this or open a new issue at any time. |
Yes, that's what I meant.
Right, I didn't consider this. There are additional data structures/objects with your approach so both memory consumption and GC pressure increase, but not in the way I initially assumed.
Exactly, implementing some path/filter syntax in terms of our stream events (similar, though more involved, to what
Indeed, I thought of that after posting my comment. We're already using (something like) this with |
FTR, that would be |
Yes, that's what I was thinking of. |
Created a new issue to follow up on the generic approach: #385. |
Should we revert this now that #385 is resolved? (Assuming it actually satisfies the use case.) JSONPath is more powerful, though, so it might still be preferable when decoding JSON. If we decide to keep, I'd like to get rid of the |
So @TobiasNx tried it for our OERSI use case and it seems like #385 works here as well. At the same time, I'm using the JsonPath support for an experimental workflow to process data coming from an API returning a JSON array (which plain JsonDecoder currently does not support). I added a test case for that in 7b47a1c. While it might make sense to add array support to JsonDecoder, I think this shows how versatile JsonPath support is here. So I vote for keeping this.
I pushed b4e056b to avoid wrapping the JSON in a list when not using JsonPath support. Is that what you meant? I'll open a PR for both these changes, so we can discuss any details there. |
Another aspect is, as you pointed out in a discussion today, that the |
It might be worthwhile to include the same test without the
OK.
Almost ;) Why does |
I don't think adding a dependency to metafacture-core is a problem per se, in particular since it has been modularized, and we only add the dependency to metafacture-json. What I meant was that if we had no use case at all, adding a feature that also introduces a dependency would be no good. But with the two different use cases we saw for using a JsonPath here, I think it's a useful addition, and worth adding a dependency. |
Addressed comments by @blackwinter in 88d941f and 7b978b6. |
Gradle would produce the following error on Windows (while Linux is not affected): "Cannot access input property 'classpath' of task ':metafix-runner:startScripts'. Accessing unreadable inputs or outputs is not supported. Declare the task as untracked by using Task.doNotTrackState(). For more information, please refer to https://docs.gradle.org/8.10.2/userguide/incremental_build.html#sec:disable-state-tracking in the Gradle documentation."
…hub.com:metafacture/metafacture-fix
While we are able to extract JSON records which are arrayed at the top level in an JSON file we are not able to extract JSON records from an JSON API that has the records in an array in an (sub-)field. At the moment we can't extract or split the records. The JSON file received via the JSON-API is extracted as one record:
Example for file:
https://imoox.at/mooc/local/moochubs/classes/webservice.php
In the field
"data"
there are the JSON records as objects. These objects should each be retrieved as single records.Functional Review: @TobiasNx
Code Review: @dr0i
The text was updated successfully, but these errors were encountered: