Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a paginated (JSON) APIDataSet #1587

Closed
wants to merge 4 commits into from
Closed

Add a paginated (JSON) APIDataSet #1587

wants to merge 4 commits into from

Conversation

afuetterer
Copy link

@afuetterer afuetterer commented Jun 2, 2022

Description

This PR is my approach to implement a paginated (JSON) APIDataSet, as described in #1525. This is my first PR to this project.

Development notes

I added a PaginatedAPIDataSetand PaginatedJSONAPIDataSet, which implements a _get_next_page() method.

make testand make lintare failing locally, due to files I did not touch. I am not sure, what happened there. I committed with --no-verify to be able to push to GitHub.

I left a few TODOs in my code, as this is still work in progress.

I am very glad to get some feedback on these questions:

  • Adding PaginatedAPIDataSet and PaginatedJSONAPIDataSet allows other PaginatedAPIDataSets to be added, e.g. XML or other formats, or other ways to traverse pagination. What do you think? Is this overkill?
  • The PaginatedAPIDataSet needs to make multiple requests to an API, so I went for a requests.Session object and needed to change _execute_request().
  • PaginatedAPIDataSet needs access to _request_args, which is changed during load(), I am not sure, if this needs to be the same after calling load() once.
  • As you suggested PaginatedAPIDataSet yields requests.Response objects. This is caught by mypy, because the return type differs (requests.Response vs. Iterator[requests.Response]) in the overwritten method. How to deal with this?

I will of course add more tests to bring test coverage up to 100%.

Checklist

@merelcht
Copy link
Member

merelcht commented Jun 7, 2022

Thanks so much for opening this PR! Some initial thoughts from me on the questions you asked:

Adding PaginatedAPIDataSet and PaginatedJSONAPIDataSet allows other PaginatedAPIDataSets to be added, e.g. XML or other formats, or other ways to traverse pagination. What do you think? Is this overkill?

In theory I like the idea of having a base PaginatedAPIDataSet class that can be extended, but I do wonder how much this is actually needed by users. In your experience, do you see other formats being used as often as JSON?
In any case, I think if we decide to have both classes, it should be very clear that PaginatedAPIDataSet is just a base class and it needs to be extended before it can be used to load data.

The PaginatedAPIDataSet needs to make multiple requests to an API, so I went for a requests.Session object and needed to change _execute_request().

Seems reasonable to me 👍

PaginatedAPIDataSet needs access to _request_args, which is changed during load(), I am not sure, if this needs to be the same after calling load() once.

I'm not sure I understand your question here. Could you elaborate a bit more?

As you suggested PaginatedAPIDataSet yields requests.Response objects. This is caught by mypy, because the return type differs (requests.Response vs. Iterator[requests.Response]) in the overwritten method. How to deal with this?

I think it's fine to just ignore that error like you did with # type: ignore

@merelcht merelcht closed this Jun 7, 2022
@merelcht
Copy link
Member

merelcht commented Jun 7, 2022

Oops I didn't mean to close this at all!

@merelcht merelcht reopened this Jun 7, 2022
@merelcht merelcht requested review from merelcht and antonymilne June 7, 2022 12:10
@afuetterer
Copy link
Author

Thanks for your input. I will continue working on this PR this week.

Thanks so much for opening this PR! Some initial thoughts from me on the questions you asked:

Adding PaginatedAPIDataSet and PaginatedJSONAPIDataSet allows other PaginatedAPIDataSets to be added, e.g. XML or other formats, or other ways to traverse pagination. What do you think? Is this overkill?

In theory I like the idea of having a base PaginatedAPIDataSet class that can be extended, but I do wonder how much this is actually needed by users. In your experience, do you see other formats being used as often as JSON? In any case, I think if we decide to have both classes, it should be very clear that PaginatedAPIDataSet is just a base class and it needs to be extended before it can be used to load data.

This was just a suggestion. PaginatedJSONAPIDataSet inherits from APIDataSet and adds pagination traversal methods. I thought, that there are other ways to have paginated responses and maybe a middle layer class could be useful.
There might be cases where you need to traverse e.g. XML responses or maybe just plain HTML pages with a "next page" button.

There are also other ways of doing the pagination itself. I added functionality to follow a "next page" link in a JSON response. But the pagination might be done in the URLs, like example.api.com?q=query&p1 -> p2 -> p3, ... without an explicit "next page" link. Then the ._get_next_page() method would need to deal with that instead of searching for a key in a JSON response

PaginatedAPIDataSet needs access to _request_args, which is changed during load(), I am not sure, if this needs to be the same after calling load() once.

I'm not sure I understand your question here. Could you elaborate a bit more?

I will add better explanation with an example soon. This is actually kind of related to the other PR #1445.

@afuetterer
Copy link
Author

In response to @AntonyMilneQB in #1525: You asked about the PR #1445.

In my opinion the APIDataSet's _request_args attribute is useful if you have just a single request. In the paginated api data set some of the arguments can be reused for multiple requests, hence the usage of a requests.Session object.

I will add an update to my class this week, so far I would like to add, that from _request_args I think these arguments are needed in the session, so for every request:

  • timeout
  • auth

These argument are needed for the initial request only:

  • url
  • method
  • data
  • params
  • json

Every subsequent request should be (method=GET) getting the next page url from the first response (if present).

In addition to the discussion in #1445 I think it might be useful to have some of these arguments as load args. What do you think? This would be a breaking change and I do not want to suggest it easily.

Signed-off-by: Heinz-Alexander Fuetterer <fuetterh@posteo.de>
Signed-off-by: Heinz-Alexander Fuetterer <fuetterh@posteo.de>
Signed-off-by: Heinz-Alexander Fuetterer <fuetterh@posteo.de>
@afuetterer
Copy link
Author

Some more questions:

In the base APIDataSet if there are no headers passed as arguments, the headers will be set to None.
Should the PaginatedAPIDataSet identify itself as a part of kedro? It will possibly created a lot of requests.
Also the headers should include "connection: keep-alive". I handled this, with updating the default requests headers with provided headers. It will identify as python-requests then. Is this alright?

Also there might be a rate limit set up for an API. If the pagination exceeds the allowed rate, it will raise an HTTP 429 and err.

I stumbled across this library, which might be useful handling this: https://github.com/JWCook/requests-ratelimiter

The limit could then be configured in the PaginatedAPIDataSet. What do you think?

@merelcht
Copy link
Member

This was just a suggestion. PaginatedJSONAPIDataSet inherits from APIDataSet and adds pagination traversal methods. I thought, that there are other ways to have paginated responses and maybe a middle layer class could be useful. There might be cases where you need to traverse e.g. XML responses or maybe just plain HTML pages with a "next page" button.

I think it's a good suggestion. Let's keep it like this and add a clarification in the docstring of PaginatedAPIDataSet to make it clear that class should be extended.

There are also other ways of doing the pagination itself. I added functionality to follow a "next page" link in a JSON response. But the pagination might be done in the URLs, like example.api.com?q=query&p1 -> p2 -> p3, ... without an explicit "next page" link. Then the ._get_next_page() method would need to deal with that instead of searching for a key in a JSON response

Personally, I think it's fine to just go with one way of pagination and users who find they need another way of paginating can add in that functionality. But of course, open to hear suggestions from others who have used the APIDataSet more in practice.

In addition to the discussion in #1445 I think it might be useful to have some of these arguments as load args. What do you think? This would be a breaking change and I do not want to suggest it easily.

I'm definitely in favour of adding some of those arguments as load_args, but at the same time it would be great to get most of this work in as a non-breaking change. We are soon going to be extracting all the datasets in a separate package so that we can do breaking changes to the datasets more easily.

@merelcht
Copy link
Member

Some more questions:

In the base APIDataSet if there are no headers passed as arguments, the headers will be set to None. Should the PaginatedAPIDataSet identify itself as a part of kedro? It will possibly created a lot of requests. Also the headers should include "connection: keep-alive". I handled this, with updating the default requests headers with provided headers. It will identify as python-requests then. Is this alright?

I think it's fine to have the requests identify as python-requests, but I guess my counter question to you is, as a user would you find it easier to debug/read logs with these requests marked as python-requests or something like kedro-requests? Are there any other potential benefits to marking the requests as being part of Kedro?

Also there might be a rate limit set up for an API. If the pagination exceeds the allowed rate, it will raise an HTTP 429 and err.

I stumbled across this library, which might be useful handling this: https://github.com/JWCook/requests-ratelimiter

The limit could then be configured in the PaginatedAPIDataSet. What do you think?

That's a good question. The library looks okay, but it doesn't have a lot of activity: only 10 stars and just 1 active contributor. Ideally, we would find an alternative way to make this work. Can we handle it with try/catch logic? How many requests would would be sent out when using this dataset? Is it realistic that it would actually go over the rate limit, or only if a very strict limit has been set?

@afuetterer
Copy link
Author

Thanks for the input. I will adapt the PR soon.

Regarding the user-agent: I guess the user-agent is not important for a kedro user or the logs generated by kedro. But for a service provider it might be interesting who is doing "all the requests" to their service.

Regarding rate-limiting: It depends on which metric/quality measurement you value most. Is it GitHub stars, maturity, number of adopters, recent activity, test coverage? requests-ratelimiter is a wrapper around pyratelimiter. I am not sure I can (or should) try to do something similar with try/except clauses. A few candidates offer decorator solutions with varying number of stars or recent activity:
https://www.githubcompare.com/tomasbasham/ratelimit+serpentai/requests-respectful+razerm/ratelimiter+vutran1710/pyratelimiter+jwcook/requests-ratelimiter

Regarding the number of requests: It really depends on the thing you want to "harvest". I am for example interested in metadata or research datasets accessible from a REST API. Depending on the query it can be a few requests up to tens or hundreds. For example: If you are interested in metadata of Zenodo-records, they allow 60 requests per minute. https://developers.zenodo.org/#rate-limiting
If your query's number of results exceeds 60 response pages, the API dataset should go to sleep for a while and then return sending requests.

@antonymilne
Copy link
Contributor

Thank you for all your work on this @afuetterer! I'm going to have a proper look through and comment on it later this week 🙂

FYI very relevant question on the rate limiting - clearly it's an issue just using APIDataSet repeatedly itself. Maybe the easiest solution here is just to put a sleep into APIDataSet when it happens. It's maybe a bit trickier to deal with when the rate limit is hit by multiple different datasets loading in quick succession? https://stackoverflow.com/questions/72691571/rate-limiting-kedro-api-requests

Just thought I'd mention it here in case looking at a related case gives another perspective on this anyway.

@afuetterer
Copy link
Author

Thanks @AntonyMilneQB. I am waiting for the PR #1633 to be accepted and then will adapt my subclass here. So any feedback will be appreciated, but this PR is still very much a work in progress.

I will not be able to work on this for the next two weeks. But I will pick it up after summer break.

Regarding the rate limiting, using an APIDataSet multiple times against the same API, I don't know how to tackle that elegantly. The decorator based approaches in the libraries I suggested in my previous comment, are used on a function/method, that sends requests. In our case it could be load() or _execute_request(). But the example on SO would not be handled by this I guess. You needed to keep track of rate limiting one level higher, in the pipeline itself? Could this be stored in memory as per "async rate limiting" approaches?

It is an interesting problem, that was not on my mind. I was only thinking a single PaginatedAPIDataSetdoing the requests.

@merelcht merelcht mentioned this pull request Nov 7, 2022
10 tasks
@merelcht
Copy link
Member

merelcht commented Nov 7, 2022

Hi @afuetterer do you still want to complete this PR? We'd like to get all PRs related to datasets to be merged soon now we're moving our datasets code to a different package (see our medium blog post for more details)

Otherwise we'll close this PR and ask you to re-open it on the new repo when it's ready.

@merelcht
Copy link
Member

Hi @afuetterer, we are releasing a new version of Kedro this week which will include a release of our new kedro-datasets repo. From then on all dataset changes need to happen on the new repository. I will close this PR here and when you're ready to continue, you can re-open it on our kedro-datasets repo. Thank you!

@merelcht merelcht closed this Nov 21, 2022
@afuetterer
Copy link
Author

Hi @merelcht, thank you. I will do that. Yay for the new release.

@afuetterer afuetterer changed the title [WIP] Add a paginated (JSON) APIDataSet Add a paginated (JSON) APIDataSet Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants