Add a paginated (JSON) APIDataSet #1525

afuetterer · 2022-05-11T13:14:15Z

Description

The APIDataSet is used to communicate with APIs via the requests library. The _load() method returns a requests.Response object to be handled in a connected node. A lot of APIs return paginated responses, with (standardized) links to traverse the list of results. The paginated responses could contain JSON, XML or other formats.
A PaginatedAPIDataSet or PaginatedJSONAPIDataSet class could be a solution for this. Such a class could be configured with the keys to the "next page" links and then be able to traverse these links when loading.

Context

This dataset is needed, when a API response exceeds the allowed batch size of a response. If you want to access all items/results from a given API call, you need to traverse the links. This dataset class could be useful for other users of the library as well.

I implemented a few dummy solutions of the desired behavior, but I wanted to discuss this here first, before submitting a PR. Should such a dataset handle the traversal of the pagination and return response objects or go even further and return the items/results in a API response? In my use case the API response contains a list of "items" that I am interested in, I don't really need the response object itself. But this might be useful for other use cases. Should the dataset return or yield the paginated responses?

Is anybody else in need of a paginated (JSON) APIDataSet?

Possible Implementation

This is a working example, of course it needs some work. I override the _execute_request() method, but only change the way to access request_args, which is probably not a good idea.
My use case is to get the items in JSON API responses directly. I already posted a similar implementation to Stack Overflow.

import copy
import logging
import time
from typing import Any, Dict, Iterable, List, Union

import dpath.util
import requests
from kedro.extras.datasets.api import APIDataSet
from kedro.io.core import DataSetError
from requests.auth import AuthBase

log = logging.getLogger(__name__)


class PaginatedJSONAPIDataSet(APIDataSet):
    def __init__(
        self,
        url: str,
        method: str = "GET",
        data: Any = None,
        params: Dict[str, Any] = None,
        headers: Dict[str, Any] = None,
        auth: Union[Iterable[str], AuthBase] = None,
        json: Union[List, Dict[str, Any]] = None,
        timeout: int = 60,
        credentials: Union[Iterable[str], AuthBase] = None,
        # mandatory?
        path_to_next_page: str = None,
        path_to_items: str = None,  # multiple keys possible to access next link in nested json, separate with "/", like "key1/key2"
    ):
        super().__init__(
            url, method, data, params, headers, auth, json, timeout, credentials
        )
        self.path_to_next_page = path_to_next_page
        self.path_to_items = path_to_items

    def _get_next_page(self, response):  # returns link or none
        next_page = None
        if dpath.util.search(response.json(), self.path_to_next_page):
            next_page = dpath.util.get(response.json(), self.path_to_next_page)
        return next_page

    def _find_items(self, response):
        return dpath.util.get(response.json(), self.path_to_items)

    def _execute_request(self, request_args) -> requests.Response:
        try:
            response = requests.request(**request_args)
            response.raise_for_status()
        except requests.exceptions.HTTPError as exc:
            raise DataSetError("Failed to fetch data", exc) from exc
        except OSError as exc:
            raise DataSetError("Failed to connect to the remote server") from exc
        return response

    def _load(self):
        return self._get_items()

    def _get_items(self):
        request_args = copy.deepcopy(self._request_args)

        # as long as we find a next link, keep on going
        while request_args["url"]:
            # log.info(request_args)
            response = self._execute_request(request_args)

            for item in self._find_items(response):
                yield item

            request_args["url"] = self._get_next_page(response)
            # TODO, remove args after first hit, args are part of url then
            # otherwise it creates a longer and longer url
            # first call: url&a1=1&a2=2
            # second call: url&a1=1&a2=2&a1=1&a2=2
            request_args.pop("params", None)
            time.sleep(1)

# toy example with a paginated API, to demonstrate pagination traversal
dataset = PaginatedJSONAPIDataSet(
    url="https://pokeapi.co/api/v2/pokemon",
    path_to_next_page="next",
    path_to_items="results",
    params={
        "limit": 500
    }
)
data = dataset.load()
print(type(data)) # <class 'generator'>
items = list(data)
print(len(items)) # 1126
print(items[0]) # {'name': 'bulbasaur', 'url': 'https://pokeapi.co/api/v2/pokemon/1/'}

# data catalog
toy_example:
  type: <location>PaginatedJSONAPIDataSet
  url: https://pokeapi.co/api/v2/pokemon
  path_to_next_page: next
  path_to_items: results
  params:
    limit: 500

Thoughts

There needs to be a way to handle rate limiting as well.
Where would such a class be located, in the api package?
the init () method gets more and more parameters, how could this be tackled?
I used a third party library to access a (possible nested) key in a dictionary, this can of course be done in pure Python as well

Any feedback is greatly appreciated.

The text was updated successfully, but these errors were encountered:

merelcht · 2022-05-16T13:14:24Z

Thanks for the thorough explanation @afuetterer ! We'd be very happy for you to create a PR for this 🙂

afuetterer · 2022-05-17T07:29:00Z

Thanks for your reply @MerelTheisenQB. I have formulated some questions that I am not sure about. Should I first make an implementation proposal in a PR and discussion will be moved to the PR then?

merelcht · 2022-05-20T10:43:24Z

Hi @afuetterer,

To answer some of your questions:

Where would such a class be located, in the api package?

Yes, I'd add this to the kedro.extras.datasets.api package

the init () method gets more and more parameters, how could this be tackled?

I understand what you mean here, but I don't see it as a very urgent problem. Other dataset types have less parameters, because they can be grouped like load_args and save_args, but it doesn't look like that would be a very useful way of doing things with this dataset using the requests library. I would just leave it like this. If in future we suddenly get tons more APIDataSet like datasets with unmaintainable numbers of parameters we can have a look at it.

I used a third party library to access a (possible nested) key in a dictionary, this can of course be done in pure Python as well

If possible, I'd suggest doing this in pure python. It's okay to use third party libraries, but it would add extra requirements for this dataset (e.g. https://github.com/kedro-org/kedro/blob/main/setup.py#L54) so it's better to avoid it if possible.

Should such a dataset handle the traversal of the pagination and return response objects or go even further and return the items/results in a API response?

Looking at the APIDataSet we currently have, it is returning the response object itself. I'd suggest keeping it consistent and return a response object here as well. But I'd like to hear what others think, especially people who have used Kedro in practice more (cc: @noklam @datajoely @AntonyMilneQB)

datajoely · 2022-05-20T10:47:54Z

I'd just clarify 'pure python' + requests since that's already expected in the existing API dataset.

datajoely · 2022-05-20T11:00:23Z

Also JSMEPath is already part of Kedro can do similar things...

afuetterer · 2022-05-21T07:20:00Z

Also JSMEPath is already part of Kedro can do similar things...

Thank you @datajoely for pointing this out. I didn't realize this was already a requirement of kedro. It is doing the same operations on nested JSON paths. I will use that library instead and remove dpath.

datajoely · 2022-05-22T14:16:56Z

No thank you for your contribution - awesome work and a great feature for the Kedro community 🙏

antonymilne · 2022-05-23T16:31:39Z

I really like this 👍 But I've also never used APIDataSet in practice so would definitely benefit from someone who knows it and the typical ways that web APIs work better. Maybe @limdauto @noklam @tynandebold? Lim made this useful comment on the PR where APIDataSet was first added, anticipating exactly the sort of pagination you're doing here:

IMHO, a generic API dataset seems too generic to be useful out of the box to many users without further modification. As an example, in my experience, most APIs are paginated and they are paginated differently. Some also follow a hypermedia format where you have to follow links indefinitely. So users will most certainly need to add further modification to this dataset if they want to use it. I'm just wondering whether it's worth adding a generic, base dataset in this case.

If possible I think the implementation should be as generic as possible so that other people can easily use it across a variety of cases. In practice, this means:

as per @MerelTheisenQB, return the request object rather than the json (also for consistency with APIDataSet)
accommodate whichever common schemes there are for pagination (unfortunately I don't know what these are at all - @limdauto? @tynandebold? @studioswong). e.g. your path_to_next_page seems like a good argument since it doesn't assume the field is called "next" (would that be a sensible default value? I don't know). But I wonder if there's some other common pagination schemes that might not be handled by your dataset
not introducing too much "custom syntax", e.g. "key1/key2" for path_to_items, unless it's already well-established or very obvious

The above isn't to discourage you at all! Just to give some food for thought, and unfortunately I don't really know enough about web APIs to offer too many suggestions here. A few more comments:

if we return the response object rather than the json then can we get rid of path_to_items?
PartitionedDataSet is lazy and returns a dictionary of callables rather than the actual loaded datasets themselves. Can we do something similar here? I'm guessing not, because we don't know in advance how many pages we have to lookup. So I think how you return a generator here rather than a list is a good approach 👍
regarding arguments in __init__: very closely related to my question here (see Additional options for APIDataSet (e.g. proxies) #711 for context) - I'd be very interested in hearing your thoughts on that matter because again I'm not familiar enough with web APIs to offer an informed opinion there. If we do introduce load_args to the APIDataSet then adding a couple of extra arguments for your PaginatedAPIDataSet class becomes even more fine though

afuetterer · 2022-05-23T17:04:55Z

Thank you all for your valuable feedback. I will start working on a PR this weekend.

I will remove dpath, add jmespath instead, and yield requests.Response objects instead of "results" or "items" from inside the returned JSON. This will be more similar to the behavior of the original APIDataSet.

tynandebold · 2022-06-14T17:55:28Z

Following on from @AntonyMilneQB's callout from before, it seems like @afuetterer is proposing something like offset pagination, which is one of the most common types. The other would be cursor pagination. Slack's engineering blog has a nice writeup of both.

I don't know the exact context of how this pagination via Kedro would be used in practice, so all I'll say here is that out of the two most-common pagination strategies, this one is often a suitable choice. It's not without drawbacks though. One of those being if items are being often being written to the dataset then page results can be irregular, and another being it doesn't scale so well for very large datasets.

AhdraMeraliQB · 2023-03-23T10:26:03Z

Closing this issue as our datasets now live in the kedro-plugins repository. Feel free to re-open this issue / migrate the related PR there 😄 .

afuetterer added the Issue: Feature Request New feature or improvement to existing feature label May 11, 2022

merelcht added the Community Issue/PR opened by the open-source community label May 16, 2022

merelcht added this to Kedro Framework May 16, 2022

afuetterer mentioned this issue Jun 2, 2022

Add a paginated (JSON) APIDataSet #1587

Closed

7 tasks

AhdraMeraliQB closed this as completed Mar 23, 2023

github-project-automation bot moved this to Done in Kedro Framework Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a paginated (JSON) APIDataSet #1525

Add a paginated (JSON) APIDataSet #1525

afuetterer commented May 11, 2022

merelcht commented May 16, 2022

afuetterer commented May 17, 2022

merelcht commented May 20, 2022

datajoely commented May 20, 2022

datajoely commented May 20, 2022

afuetterer commented May 21, 2022

datajoely commented May 22, 2022

antonymilne commented May 23, 2022 •

edited

Loading

afuetterer commented May 23, 2022

tynandebold commented Jun 14, 2022

AhdraMeraliQB commented Mar 23, 2023

Add a paginated (JSON) APIDataSet #1525

Add a paginated (JSON) APIDataSet #1525

Comments

afuetterer commented May 11, 2022

Description

Context

Possible Implementation

Thoughts

merelcht commented May 16, 2022

afuetterer commented May 17, 2022

merelcht commented May 20, 2022

datajoely commented May 20, 2022

datajoely commented May 20, 2022

afuetterer commented May 21, 2022

datajoely commented May 22, 2022

antonymilne commented May 23, 2022 • edited Loading

afuetterer commented May 23, 2022

tynandebold commented Jun 14, 2022

AhdraMeraliQB commented Mar 23, 2023

antonymilne commented May 23, 2022 •

edited

Loading