Add a paginated (JSON) APIDataSet #1587

afuetterer · 2022-06-02T17:25:59Z

Description

This PR is my approach to implement a paginated (JSON) APIDataSet, as described in #1525. This is my first PR to this project.

Development notes

I added a PaginatedAPIDataSetand PaginatedJSONAPIDataSet, which implements a _get_next_page() method.

make testand make lintare failing locally, due to files I did not touch. I am not sure, what happened there. I committed with --no-verify to be able to push to GitHub.

I left a few TODOs in my code, as this is still work in progress.

I am very glad to get some feedback on these questions:

Adding PaginatedAPIDataSet and PaginatedJSONAPIDataSet allows other PaginatedAPIDataSets to be added, e.g. XML or other formats, or other ways to traverse pagination. What do you think? Is this overkill?
The PaginatedAPIDataSet needs to make multiple requests to an API, so I went for a requests.Session object and needed to change _execute_request().
PaginatedAPIDataSet needs access to _request_args, which is changed during load(), I am not sure, if this needs to be the same after calling load() once.
As you suggested PaginatedAPIDataSet yields requests.Response objects. This is caught by mypy, because the return type differs (requests.Response vs. Iterator[requests.Response]) in the overwritten method. How to deal with this?

I will of course add more tests to bring test coverage up to 100%.

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Waiting for APIDataSet reduce number of arguments and move to Load args #1633 to be accepted
Add load_args

merelcht · 2022-06-07T12:05:36Z

Thanks so much for opening this PR! Some initial thoughts from me on the questions you asked:

Adding PaginatedAPIDataSet and PaginatedJSONAPIDataSet allows other PaginatedAPIDataSets to be added, e.g. XML or other formats, or other ways to traverse pagination. What do you think? Is this overkill?

In theory I like the idea of having a base PaginatedAPIDataSet class that can be extended, but I do wonder how much this is actually needed by users. In your experience, do you see other formats being used as often as JSON?
In any case, I think if we decide to have both classes, it should be very clear that PaginatedAPIDataSet is just a base class and it needs to be extended before it can be used to load data.

The PaginatedAPIDataSet needs to make multiple requests to an API, so I went for a requests.Session object and needed to change _execute_request().

Seems reasonable to me 👍

PaginatedAPIDataSet needs access to _request_args, which is changed during load(), I am not sure, if this needs to be the same after calling load() once.

I'm not sure I understand your question here. Could you elaborate a bit more?

As you suggested PaginatedAPIDataSet yields requests.Response objects. This is caught by mypy, because the return type differs (requests.Response vs. Iterator[requests.Response]) in the overwritten method. How to deal with this?

I think it's fine to just ignore that error like you did with # type: ignore

merelcht · 2022-06-07T12:05:55Z

Oops I didn't mean to close this at all!

afuetterer · 2022-06-15T11:21:22Z

Thanks for your input. I will continue working on this PR this week.

Thanks so much for opening this PR! Some initial thoughts from me on the questions you asked:

Adding PaginatedAPIDataSet and PaginatedJSONAPIDataSet allows other PaginatedAPIDataSets to be added, e.g. XML or other formats, or other ways to traverse pagination. What do you think? Is this overkill?

In theory I like the idea of having a base PaginatedAPIDataSet class that can be extended, but I do wonder how much this is actually needed by users. In your experience, do you see other formats being used as often as JSON? In any case, I think if we decide to have both classes, it should be very clear that PaginatedAPIDataSet is just a base class and it needs to be extended before it can be used to load data.

This was just a suggestion. PaginatedJSONAPIDataSet inherits from APIDataSet and adds pagination traversal methods. I thought, that there are other ways to have paginated responses and maybe a middle layer class could be useful.
There might be cases where you need to traverse e.g. XML responses or maybe just plain HTML pages with a "next page" button.

There are also other ways of doing the pagination itself. I added functionality to follow a "next page" link in a JSON response. But the pagination might be done in the URLs, like example.api.com?q=query&p1 -> p2 -> p3, ... without an explicit "next page" link. Then the ._get_next_page() method would need to deal with that instead of searching for a key in a JSON response

PaginatedAPIDataSet needs access to _request_args, which is changed during load(), I am not sure, if this needs to be the same after calling load() once.

I'm not sure I understand your question here. Could you elaborate a bit more?

I will add better explanation with an example soon. This is actually kind of related to the other PR #1445.

afuetterer · 2022-06-15T12:13:46Z

In response to @AntonyMilneQB in #1525: You asked about the PR #1445.

In my opinion the APIDataSet's _request_args attribute is useful if you have just a single request. In the paginated api data set some of the arguments can be reused for multiple requests, hence the usage of a requests.Session object.

I will add an update to my class this week, so far I would like to add, that from _request_args I think these arguments are needed in the session, so for every request:

timeout
auth

These argument are needed for the initial request only:

url
method
data
params
json

Every subsequent request should be (method=GET) getting the next page url from the first response (if present).

In addition to the discussion in #1445 I think it might be useful to have some of these arguments as load args. What do you think? This would be a breaking change and I do not want to suggest it easily.

Signed-off-by: Heinz-Alexander Fuetterer <fuetterh@posteo.de>

afuetterer · 2022-06-18T15:46:05Z

Some more questions:

In the base APIDataSet if there are no headers passed as arguments, the headers will be set to None.
Should the PaginatedAPIDataSet identify itself as a part of kedro? It will possibly created a lot of requests.
Also the headers should include "connection: keep-alive". I handled this, with updating the default requests headers with provided headers. It will identify as python-requests then. Is this alright?

Also there might be a rate limit set up for an API. If the pagination exceeds the allowed rate, it will raise an HTTP 429 and err.

I stumbled across this library, which might be useful handling this: https://github.com/JWCook/requests-ratelimiter

The limit could then be configured in the PaginatedAPIDataSet. What do you think?

merelcht · 2022-06-20T10:03:09Z

This was just a suggestion. PaginatedJSONAPIDataSet inherits from APIDataSet and adds pagination traversal methods. I thought, that there are other ways to have paginated responses and maybe a middle layer class could be useful. There might be cases where you need to traverse e.g. XML responses or maybe just plain HTML pages with a "next page" button.

I think it's a good suggestion. Let's keep it like this and add a clarification in the docstring of PaginatedAPIDataSet to make it clear that class should be extended.

There are also other ways of doing the pagination itself. I added functionality to follow a "next page" link in a JSON response. But the pagination might be done in the URLs, like example.api.com?q=query&p1 -> p2 -> p3, ... without an explicit "next page" link. Then the ._get_next_page() method would need to deal with that instead of searching for a key in a JSON response

Personally, I think it's fine to just go with one way of pagination and users who find they need another way of paginating can add in that functionality. But of course, open to hear suggestions from others who have used the APIDataSet more in practice.

In addition to the discussion in #1445 I think it might be useful to have some of these arguments as load args. What do you think? This would be a breaking change and I do not want to suggest it easily.

I'm definitely in favour of adding some of those arguments as load_args, but at the same time it would be great to get most of this work in as a non-breaking change. We are soon going to be extracting all the datasets in a separate package so that we can do breaking changes to the datasets more easily.

merelcht · 2022-06-20T10:16:27Z

Some more questions:

In the base APIDataSet if there are no headers passed as arguments, the headers will be set to None. Should the PaginatedAPIDataSet identify itself as a part of kedro? It will possibly created a lot of requests. Also the headers should include "connection: keep-alive". I handled this, with updating the default requests headers with provided headers. It will identify as python-requests then. Is this alright?

I think it's fine to have the requests identify as python-requests, but I guess my counter question to you is, as a user would you find it easier to debug/read logs with these requests marked as python-requests or something like kedro-requests? Are there any other potential benefits to marking the requests as being part of Kedro?

Also there might be a rate limit set up for an API. If the pagination exceeds the allowed rate, it will raise an HTTP 429 and err.

I stumbled across this library, which might be useful handling this: https://github.com/JWCook/requests-ratelimiter

The limit could then be configured in the PaginatedAPIDataSet. What do you think?

That's a good question. The library looks okay, but it doesn't have a lot of activity: only 10 stars and just 1 active contributor. Ideally, we would find an alternative way to make this work. Can we handle it with try/catch logic? How many requests would would be sent out when using this dataset? Is it realistic that it would actually go over the rate limit, or only if a very strict limit has been set?

afuetterer · 2022-06-20T14:16:47Z

Thanks for the input. I will adapt the PR soon.

Regarding the user-agent: I guess the user-agent is not important for a kedro user or the logs generated by kedro. But for a service provider it might be interesting who is doing "all the requests" to their service.

Regarding rate-limiting: It depends on which metric/quality measurement you value most. Is it GitHub stars, maturity, number of adopters, recent activity, test coverage? requests-ratelimiter is a wrapper around pyratelimiter. I am not sure I can (or should) try to do something similar with try/except clauses. A few candidates offer decorator solutions with varying number of stars or recent activity:
https://www.githubcompare.com/tomasbasham/ratelimit+serpentai/requests-respectful+razerm/ratelimiter+vutran1710/pyratelimiter+jwcook/requests-ratelimiter

Regarding the number of requests: It really depends on the thing you want to "harvest". I am for example interested in metadata or research datasets accessible from a REST API. Depending on the query it can be a few requests up to tens or hundreds. For example: If you are interested in metadata of Zenodo-records, they allow 60 requests per minute. https://developers.zenodo.org/#rate-limiting
If your query's number of results exceeds 60 response pages, the API dataset should go to sleep for a while and then return sending requests.

antonymilne · 2022-06-20T20:34:00Z

Thank you for all your work on this @afuetterer! I'm going to have a proper look through and comment on it later this week 🙂

FYI very relevant question on the rate limiting - clearly it's an issue just using APIDataSet repeatedly itself. Maybe the easiest solution here is just to put a sleep into APIDataSet when it happens. It's maybe a bit trickier to deal with when the rate limit is hit by multiple different datasets loading in quick succession? https://stackoverflow.com/questions/72691571/rate-limiting-kedro-api-requests

Just thought I'd mention it here in case looking at a related case gives another perspective on this anyway.

afuetterer · 2022-06-22T06:25:54Z

Thanks @AntonyMilneQB. I am waiting for the PR #1633 to be accepted and then will adapt my subclass here. So any feedback will be appreciated, but this PR is still very much a work in progress.

I will not be able to work on this for the next two weeks. But I will pick it up after summer break.

Regarding the rate limiting, using an APIDataSet multiple times against the same API, I don't know how to tackle that elegantly. The decorator based approaches in the libraries I suggested in my previous comment, are used on a function/method, that sends requests. In our case it could be load() or _execute_request(). But the example on SO would not be handled by this I guess. You needed to keep track of rate limiting one level higher, in the pipeline itself? Could this be stored in memory as per "async rate limiting" approaches?

It is an interesting problem, that was not on my mind. I was only thinking a single PaginatedAPIDataSetdoing the requests.

merelcht · 2022-11-07T14:43:56Z

Hi @afuetterer do you still want to complete this PR? We'd like to get all PRs related to datasets to be merged soon now we're moving our datasets code to a different package (see our medium blog post for more details)

Otherwise we'll close this PR and ask you to re-open it on the new repo when it's ready.

merelcht · 2022-11-21T16:26:25Z

Hi @afuetterer, we are releasing a new version of Kedro this week which will include a release of our new kedro-datasets repo. From then on all dataset changes need to happen on the new repository. I will close this PR here and when you're ready to continue, you can re-open it on our kedro-datasets repo. Thank you!

afuetterer · 2022-11-22T19:10:51Z

Hi @merelcht, thank you. I will do that. Yay for the new release.

merelcht closed this Jun 7, 2022

merelcht reopened this Jun 7, 2022

merelcht requested review from merelcht and antonymilne June 7, 2022 12:10

afuetterer added 3 commits June 18, 2022 15:44

feat: add paginated api dataset

b3eb8a3

Signed-off-by: Heinz-Alexander Fuetterer <fuetterh@posteo.de>

fix: add finally clause to exception handling

bbc4446

Signed-off-by: Heinz-Alexander Fuetterer <fuetterh@posteo.de>

fix: update pagination handling logic

89f3e94

Signed-off-by: Heinz-Alexander Fuetterer <fuetterh@posteo.de>

Merge branch 'kedro-org:main' into feat/1525-paginated-apidataset

9b4c621

merelcht mentioned this pull request Nov 7, 2022

Release Kedro 0.18.4 #2004

Closed

10 tasks

merelcht closed this Nov 21, 2022

afuetterer changed the title ~~[WIP] Add a paginated (JSON) APIDataSet~~ Add a paginated (JSON) APIDataSet Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a paginated (JSON) APIDataSet #1587

Add a paginated (JSON) APIDataSet #1587

afuetterer commented Jun 2, 2022 •

edited

Loading

merelcht commented Jun 7, 2022 •

edited

Loading

merelcht commented Jun 7, 2022

afuetterer commented Jun 15, 2022

afuetterer commented Jun 15, 2022

afuetterer commented Jun 18, 2022

merelcht commented Jun 20, 2022

merelcht commented Jun 20, 2022

afuetterer commented Jun 20, 2022

antonymilne commented Jun 20, 2022

afuetterer commented Jun 22, 2022

merelcht commented Nov 7, 2022 •

edited

Loading

merelcht commented Nov 21, 2022

afuetterer commented Nov 22, 2022

Add a paginated (JSON) APIDataSet #1587

Add a paginated (JSON) APIDataSet #1587

Conversation

afuetterer commented Jun 2, 2022 • edited Loading

Description

Development notes

Checklist

merelcht commented Jun 7, 2022 • edited Loading

merelcht commented Jun 7, 2022

afuetterer commented Jun 15, 2022

afuetterer commented Jun 15, 2022

afuetterer commented Jun 18, 2022

merelcht commented Jun 20, 2022

merelcht commented Jun 20, 2022

afuetterer commented Jun 20, 2022

antonymilne commented Jun 20, 2022

afuetterer commented Jun 22, 2022

merelcht commented Nov 7, 2022 • edited Loading

merelcht commented Nov 21, 2022

afuetterer commented Nov 22, 2022

afuetterer commented Jun 2, 2022 •

edited

Loading

merelcht commented Jun 7, 2022 •

edited

Loading

merelcht commented Nov 7, 2022 •

edited

Loading