-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a paginated (JSON) APIDataSet #1525
Comments
Thanks for the thorough explanation @afuetterer ! We'd be very happy for you to create a PR for this 🙂 |
Thanks for your reply @MerelTheisenQB. I have formulated some questions that I am not sure about. Should I first make an implementation proposal in a PR and discussion will be moved to the PR then? |
Hi @afuetterer, To answer some of your questions:
Yes, I'd add this to the
I understand what you mean here, but I don't see it as a very urgent problem. Other dataset types have less parameters, because they can be grouped like
If possible, I'd suggest doing this in pure python. It's okay to use third party libraries, but it would add extra requirements for this dataset (e.g. https://github.com/kedro-org/kedro/blob/main/setup.py#L54) so it's better to avoid it if possible.
Looking at the |
I'd just clarify 'pure python' + |
Also JSMEPath is already part of Kedro can do similar things... |
Thank you @datajoely for pointing this out. I didn't realize this was already a requirement of kedro. It is doing the same operations on nested JSON paths. I will use that library instead and remove dpath. |
No thank you for your contribution - awesome work and a great feature for the Kedro community 🙏 |
I really like this 👍 But I've also never used
If possible I think the implementation should be as generic as possible so that other people can easily use it across a variety of cases. In practice, this means:
The above isn't to discourage you at all! Just to give some food for thought, and unfortunately I don't really know enough about web APIs to offer too many suggestions here. A few more comments:
|
Thank you all for your valuable feedback. I will start working on a PR this weekend. I will remove dpath, add jmespath instead, and yield requests.Response objects instead of "results" or "items" from inside the returned JSON. This will be more similar to the behavior of the original APIDataSet. |
Following on from @AntonyMilneQB's callout from before, it seems like @afuetterer is proposing something like offset pagination, which is one of the most common types. The other would be cursor pagination. Slack's engineering blog has a nice writeup of both. I don't know the exact context of how this pagination via Kedro would be used in practice, so all I'll say here is that out of the two most-common pagination strategies, this one is often a suitable choice. It's not without drawbacks though. One of those being if items are being often being written to the dataset then page results can be irregular, and another being it doesn't scale so well for very large datasets. |
Closing this issue as our datasets now live in the kedro-plugins repository. Feel free to re-open this issue / migrate the related PR there 😄 . |
Description
The APIDataSet is used to communicate with APIs via the requests library. The _load() method returns a requests.Response object to be handled in a connected node. A lot of APIs return paginated responses, with (standardized) links to traverse the list of results. The paginated responses could contain JSON, XML or other formats.
A PaginatedAPIDataSet or PaginatedJSONAPIDataSet class could be a solution for this. Such a class could be configured with the keys to the "next page" links and then be able to traverse these links when loading.
Context
This dataset is needed, when a API response exceeds the allowed batch size of a response. If you want to access all items/results from a given API call, you need to traverse the links. This dataset class could be useful for other users of the library as well.
I implemented a few dummy solutions of the desired behavior, but I wanted to discuss this here first, before submitting a PR. Should such a dataset handle the traversal of the pagination and return response objects or go even further and return the items/results in a API response? In my use case the API response contains a list of "items" that I am interested in, I don't really need the response object itself. But this might be useful for other use cases. Should the dataset return or yield the paginated responses?
Is anybody else in need of a paginated (JSON) APIDataSet?
Possible Implementation
This is a working example, of course it needs some work. I override the _execute_request() method, but only change the way to access request_args, which is probably not a good idea.
My use case is to get the items in JSON API responses directly. I already posted a similar implementation to Stack Overflow.
Thoughts
api
package?Any feedback is greatly appreciated.
The text was updated successfully, but these errors were encountered: