-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a paginated (JSON) APIDataSet #1587
Add a paginated (JSON) APIDataSet #1587
Conversation
Thanks so much for opening this PR! Some initial thoughts from me on the questions you asked:
In theory I like the idea of having a base
Seems reasonable to me 👍
I'm not sure I understand your question here. Could you elaborate a bit more?
I think it's fine to just ignore that error like you did with |
Oops I didn't mean to close this at all! |
Thanks for your input. I will continue working on this PR this week.
This was just a suggestion. There are also other ways of doing the pagination itself. I added functionality to follow a "next page" link in a JSON response. But the pagination might be done in the URLs, like example.api.com?q=query&p1 -> p2 -> p3, ... without an explicit "next page" link. Then the
I will add better explanation with an example soon. This is actually kind of related to the other PR #1445. |
In response to @AntonyMilneQB in #1525: You asked about the PR #1445. In my opinion the I will add an update to my class this week, so far I would like to add, that from
These argument are needed for the initial request only:
Every subsequent request should be (method=GET) getting the next page url from the first response (if present). In addition to the discussion in #1445 I think it might be useful to have some of these arguments as |
Signed-off-by: Heinz-Alexander Fuetterer <fuetterh@posteo.de>
Signed-off-by: Heinz-Alexander Fuetterer <fuetterh@posteo.de>
Signed-off-by: Heinz-Alexander Fuetterer <fuetterh@posteo.de>
Some more questions: In the base Also there might be a rate limit set up for an API. If the pagination exceeds the allowed rate, it will raise an HTTP 429 and err. I stumbled across this library, which might be useful handling this: https://github.com/JWCook/requests-ratelimiter The limit could then be configured in the |
I think it's a good suggestion. Let's keep it like this and add a clarification in the docstring of
Personally, I think it's fine to just go with one way of pagination and users who find they need another way of paginating can add in that functionality. But of course, open to hear suggestions from others who have used the
I'm definitely in favour of adding some of those arguments as |
I think it's fine to have the requests identify as
That's a good question. The library looks okay, but it doesn't have a lot of activity: only 10 stars and just 1 active contributor. Ideally, we would find an alternative way to make this work. Can we handle it with try/catch logic? How many requests would would be sent out when using this dataset? Is it realistic that it would actually go over the rate limit, or only if a very strict limit has been set? |
Thanks for the input. I will adapt the PR soon. Regarding the user-agent: I guess the user-agent is not important for a kedro user or the logs generated by kedro. But for a service provider it might be interesting who is doing "all the requests" to their service. Regarding rate-limiting: It depends on which metric/quality measurement you value most. Is it GitHub stars, maturity, number of adopters, recent activity, test coverage? Regarding the number of requests: It really depends on the thing you want to "harvest". I am for example interested in metadata or research datasets accessible from a REST API. Depending on the query it can be a few requests up to tens or hundreds. For example: If you are interested in metadata of Zenodo-records, they allow 60 requests per minute. https://developers.zenodo.org/#rate-limiting |
Thank you for all your work on this @afuetterer! I'm going to have a proper look through and comment on it later this week 🙂 FYI very relevant question on the rate limiting - clearly it's an issue just using Just thought I'd mention it here in case looking at a related case gives another perspective on this anyway. |
Thanks @AntonyMilneQB. I am waiting for the PR #1633 to be accepted and then will adapt my subclass here. So any feedback will be appreciated, but this PR is still very much a work in progress. I will not be able to work on this for the next two weeks. But I will pick it up after summer break. Regarding the rate limiting, using an It is an interesting problem, that was not on my mind. I was only thinking a single |
Hi @afuetterer do you still want to complete this PR? We'd like to get all PRs related to datasets to be merged soon now we're moving our datasets code to a different package (see our medium blog post for more details) Otherwise we'll close this PR and ask you to re-open it on the new repo when it's ready. |
Hi @afuetterer, we are releasing a new version of Kedro this week which will include a release of our new |
Hi @merelcht, thank you. I will do that. Yay for the new release. |
Description
This PR is my approach to implement a paginated (JSON) APIDataSet, as described in #1525. This is my first PR to this project.
Development notes
I added a
PaginatedAPIDataSet
andPaginatedJSONAPIDataSet
, which implements a_get_next_page()
method.make test
andmake lint
are failing locally, due to files I did not touch. I am not sure, what happened there. I committed with--no-verify
to be able to push to GitHub.I left a few TODOs in my code, as this is still work in progress.
I am very glad to get some feedback on these questions:
PaginatedAPIDataSet
andPaginatedJSONAPIDataSet
allows other PaginatedAPIDataSets to be added, e.g. XML or other formats, or other ways to traverse pagination. What do you think? Is this overkill?PaginatedAPIDataSet
needs to make multiple requests to an API, so I went for arequests.Session
object and needed to change_execute_request()
._request_args
, which is changed duringload()
, I am not sure, if this needs to be the same after calling load() once.PaginatedAPIDataSet
yieldsrequests.Response
objects. This is caught by mypy, because the return type differs (requests.Response
vs.Iterator[requests.Response])
in the overwritten method. How to deal with this?I will of course add more tests to bring test coverage up to 100%.
Checklist
RELEASE.md
file