Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a paginated (JSON) APIDataSet #1525

Closed
afuetterer opened this issue May 11, 2022 · 11 comments
Closed

Add a paginated (JSON) APIDataSet #1525

afuetterer opened this issue May 11, 2022 · 11 comments
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature

Comments

@afuetterer
Copy link

Description

The APIDataSet is used to communicate with APIs via the requests library. The _load() method returns a requests.Response object to be handled in a connected node. A lot of APIs return paginated responses, with (standardized) links to traverse the list of results. The paginated responses could contain JSON, XML or other formats.
A PaginatedAPIDataSet or PaginatedJSONAPIDataSet class could be a solution for this. Such a class could be configured with the keys to the "next page" links and then be able to traverse these links when loading.

Context

This dataset is needed, when a API response exceeds the allowed batch size of a response. If you want to access all items/results from a given API call, you need to traverse the links. This dataset class could be useful for other users of the library as well.

I implemented a few dummy solutions of the desired behavior, but I wanted to discuss this here first, before submitting a PR. Should such a dataset handle the traversal of the pagination and return response objects or go even further and return the items/results in a API response? In my use case the API response contains a list of "items" that I am interested in, I don't really need the response object itself. But this might be useful for other use cases. Should the dataset return or yield the paginated responses?

Is anybody else in need of a paginated (JSON) APIDataSet?

Possible Implementation

This is a working example, of course it needs some work. I override the _execute_request() method, but only change the way to access request_args, which is probably not a good idea.
My use case is to get the items in JSON API responses directly. I already posted a similar implementation to Stack Overflow.

import copy
import logging
import time
from typing import Any, Dict, Iterable, List, Union

import dpath.util
import requests
from kedro.extras.datasets.api import APIDataSet
from kedro.io.core import DataSetError
from requests.auth import AuthBase

log = logging.getLogger(__name__)


class PaginatedJSONAPIDataSet(APIDataSet):
    def __init__(
        self,
        url: str,
        method: str = "GET",
        data: Any = None,
        params: Dict[str, Any] = None,
        headers: Dict[str, Any] = None,
        auth: Union[Iterable[str], AuthBase] = None,
        json: Union[List, Dict[str, Any]] = None,
        timeout: int = 60,
        credentials: Union[Iterable[str], AuthBase] = None,
        # mandatory?
        path_to_next_page: str = None,
        path_to_items: str = None,  # multiple keys possible to access next link in nested json, separate with "/", like "key1/key2"
    ):
        super().__init__(
            url, method, data, params, headers, auth, json, timeout, credentials
        )
        self.path_to_next_page = path_to_next_page
        self.path_to_items = path_to_items

    def _get_next_page(self, response):  # returns link or none
        next_page = None
        if dpath.util.search(response.json(), self.path_to_next_page):
            next_page = dpath.util.get(response.json(), self.path_to_next_page)
        return next_page

    def _find_items(self, response):
        return dpath.util.get(response.json(), self.path_to_items)

    def _execute_request(self, request_args) -> requests.Response:
        try:
            response = requests.request(**request_args)
            response.raise_for_status()
        except requests.exceptions.HTTPError as exc:
            raise DataSetError("Failed to fetch data", exc) from exc
        except OSError as exc:
            raise DataSetError("Failed to connect to the remote server") from exc
        return response

    def _load(self):
        return self._get_items()

    def _get_items(self):
        request_args = copy.deepcopy(self._request_args)

        # as long as we find a next link, keep on going
        while request_args["url"]:
            # log.info(request_args)
            response = self._execute_request(request_args)

            for item in self._find_items(response):
                yield item

            request_args["url"] = self._get_next_page(response)
            # TODO, remove args after first hit, args are part of url then
            # otherwise it creates a longer and longer url
            # first call: url&a1=1&a2=2
            # second call: url&a1=1&a2=2&a1=1&a2=2
            request_args.pop("params", None)
            time.sleep(1)
# toy example with a paginated API, to demonstrate pagination traversal
dataset = PaginatedJSONAPIDataSet(
    url="https://pokeapi.co/api/v2/pokemon",
    path_to_next_page="next",
    path_to_items="results",
    params={
        "limit": 500
    }
)
data = dataset.load()
print(type(data)) # <class 'generator'>
items = list(data)
print(len(items)) # 1126
print(items[0]) # {'name': 'bulbasaur', 'url': 'https://pokeapi.co/api/v2/pokemon/1/'}
# data catalog
toy_example:
  type: <location>PaginatedJSONAPIDataSet
  url: https://pokeapi.co/api/v2/pokemon
  path_to_next_page: next
  path_to_items: results
  params:
    limit: 500

Thoughts

  • There needs to be a way to handle rate limiting as well.
  • Where would such a class be located, in the api package?
  • the init () method gets more and more parameters, how could this be tackled?
  • I used a third party library to access a (possible nested) key in a dictionary, this can of course be done in pure Python as well

Any feedback is greatly appreciated.

@afuetterer afuetterer added the Issue: Feature Request New feature or improvement to existing feature label May 11, 2022
@merelcht merelcht added the Community Issue/PR opened by the open-source community label May 16, 2022
@merelcht
Copy link
Member

Thanks for the thorough explanation @afuetterer ! We'd be very happy for you to create a PR for this 🙂

@afuetterer
Copy link
Author

Thanks for your reply @MerelTheisenQB. I have formulated some questions that I am not sure about. Should I first make an implementation proposal in a PR and discussion will be moved to the PR then?

@merelcht
Copy link
Member

Hi @afuetterer,

To answer some of your questions:

Where would such a class be located, in the api package?

Yes, I'd add this to the kedro.extras.datasets.api package

the init () method gets more and more parameters, how could this be tackled?

I understand what you mean here, but I don't see it as a very urgent problem. Other dataset types have less parameters, because they can be grouped like load_args and save_args, but it doesn't look like that would be a very useful way of doing things with this dataset using the requests library. I would just leave it like this. If in future we suddenly get tons more APIDataSet like datasets with unmaintainable numbers of parameters we can have a look at it.

I used a third party library to access a (possible nested) key in a dictionary, this can of course be done in pure Python as well

If possible, I'd suggest doing this in pure python. It's okay to use third party libraries, but it would add extra requirements for this dataset (e.g. https://github.com/kedro-org/kedro/blob/main/setup.py#L54) so it's better to avoid it if possible.

Should such a dataset handle the traversal of the pagination and return response objects or go even further and return the items/results in a API response?

Looking at the APIDataSet we currently have, it is returning the response object itself. I'd suggest keeping it consistent and return a response object here as well. But I'd like to hear what others think, especially people who have used Kedro in practice more (cc: @noklam @datajoely @AntonyMilneQB)

@datajoely
Copy link
Contributor

I'd just clarify 'pure python' + requests since that's already expected in the existing API dataset.

@datajoely
Copy link
Contributor

Also JSMEPath is already part of Kedro can do similar things...

@afuetterer
Copy link
Author

Also JSMEPath is already part of Kedro can do similar things...

Thank you @datajoely for pointing this out. I didn't realize this was already a requirement of kedro. It is doing the same operations on nested JSON paths. I will use that library instead and remove dpath.

@datajoely
Copy link
Contributor

No thank you for your contribution - awesome work and a great feature for the Kedro community 🙏

@antonymilne
Copy link
Contributor

antonymilne commented May 23, 2022

I really like this 👍 But I've also never used APIDataSet in practice so would definitely benefit from someone who knows it and the typical ways that web APIs work better. Maybe @limdauto @noklam @tynandebold? Lim made this useful comment on the PR where APIDataSet was first added, anticipating exactly the sort of pagination you're doing here:

IMHO, a generic API dataset seems too generic to be useful out of the box to many users without further modification. As an example, in my experience, most APIs are paginated and they are paginated differently. Some also follow a hypermedia format where you have to follow links indefinitely. So users will most certainly need to add further modification to this dataset if they want to use it. I'm just wondering whether it's worth adding a generic, base dataset in this case.

If possible I think the implementation should be as generic as possible so that other people can easily use it across a variety of cases. In practice, this means:

  • as per @MerelTheisenQB, return the request object rather than the json (also for consistency with APIDataSet)
  • accommodate whichever common schemes there are for pagination (unfortunately I don't know what these are at all - @limdauto? @tynandebold? @studioswong). e.g. your path_to_next_page seems like a good argument since it doesn't assume the field is called "next" (would that be a sensible default value? I don't know). But I wonder if there's some other common pagination schemes that might not be handled by your dataset
  • not introducing too much "custom syntax", e.g. "key1/key2" for path_to_items, unless it's already well-established or very obvious

The above isn't to discourage you at all! Just to give some food for thought, and unfortunately I don't really know enough about web APIs to offer too many suggestions here. A few more comments:

@afuetterer
Copy link
Author

Thank you all for your valuable feedback. I will start working on a PR this weekend.

I will remove dpath, add jmespath instead, and yield requests.Response objects instead of "results" or "items" from inside the returned JSON. This will be more similar to the behavior of the original APIDataSet.

@tynandebold
Copy link
Member

Following on from @AntonyMilneQB's callout from before, it seems like @afuetterer is proposing something like offset pagination, which is one of the most common types. The other would be cursor pagination. Slack's engineering blog has a nice writeup of both.

I don't know the exact context of how this pagination via Kedro would be used in practice, so all I'll say here is that out of the two most-common pagination strategies, this one is often a suitable choice. It's not without drawbacks though. One of those being if items are being often being written to the dataset then page results can be irregular, and another being it doesn't scale so well for very large datasets.

@AhdraMeraliQB
Copy link
Contributor

Closing this issue as our datasets now live in the kedro-plugins repository. Feel free to re-open this issue / migrate the related PR there 😄 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature
Projects
Archived in project
Development

No branches or pull requests

6 participants