Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the APIDataset.save stores the response received #748

Open
npfp opened this issue Jul 3, 2024 · 13 comments
Open

Make the APIDataset.save stores the response received #748

npfp opened this issue Jul 3, 2024 · 13 comments

Comments

@npfp
Copy link

npfp commented Jul 3, 2024

Description

When running a POST or PUT request with the APIDataset, the response is currently lost while it would be useful to store it.

Context

We rely a lot on the APIDataset to fetch but also to save data to external API. Keeping tracks or the answer is then really important to us.

Possible Implementation

We built a custom APIDataset that takes a filepath argument. If this argument is not None, a TextDataset(filepath=filepath) is created and is called in the _execute_save_request:

self.local_dataset.save(response.text)

Possible Alternatives

Not found any other.

I would then like to make a PR with this proposed change but before making the actual PR, I wanted to double check with you that this feature would be of interest for the community.

@datajoely
Copy link
Contributor

This is a great point, I think the other slightly more robust way to do this would is to add a logging.info(response.text) call so that this sort of stuff can be picked up within an observability stack

@npfp
Copy link
Author

npfp commented Jul 3, 2024

Ah right I didn't think to this way.

In our case, some of the external endpoints send back to us an id and some pieces of information that we use as starting point of another pipeline in a subsequent run so the idea of storing the response.

@datajoely
Copy link
Contributor

That makes sense, I think the ambition is right, we should store this. I guess this was built under the assumption we only cared about 200 responses, but POST functionality was added by the community later and this is a key point.

@npfp
Copy link
Author

npfp commented Jul 3, 2024

Great, I will make a PR then.

@datajoely
Copy link
Contributor

before you do any work - I'd maybe like to get some other contributors opinion! @noklam @merelcht any thoughts here?

@astrojuanlu astrojuanlu transferred this issue from kedro-org/kedro Jul 3, 2024
@merelcht
Copy link
Member

merelcht commented Jul 3, 2024

It makes total sense to me to save the response. I wouldn't save it as an other type of dataset though (e.g. TextDataset mentioned in the description), but rather directly save it to a file format that makes sense. The main reason for that is that to me it feels odd to have one type of dataset be the return type of another dataset.

@npfp
Copy link
Author

npfp commented Jul 4, 2024

@merelcht I see, regarding the use of TextDataset it was really to reduce the maintenance/test burden by relying on a maintained dataset while keeping the interface simple.

When you say

directly save it to a file format that makes sense

do you mean use directly the open context manager/write operation?

@MinuraPunchihewa
Copy link
Contributor

@merelcht I am happy to work on this unless @npfp is already in the process.

@merelcht
Copy link
Member

Sure go for it @MinuraPunchihewa. I see I never got back to the question:

do you mean use directly the open context manager/write operation?

Yes! My point was mainly that I wouldn't import another kedro dataset like TextDataset and return that, but indeed directly use the open context manager and write operation.

@npfp
Copy link
Author

npfp commented Oct 23, 2024

For your information, I implemented a custom APIDataset with a nested dataset to store the response. I got inspired by the partitioned dataset.

I didn't go for the open context manager because I wanted to have flexibility of where I would store the responses: locally, or in s3, in json or in raw text, in a versioned way or not, etc.

Also I decided to implement the load method when using POST, PUT mode: this method would just load the response that has been saved, so it can be processed downstream in the pipeline.

I'm aware this design is arguable.

What I could do is to open a PR as the code is pretty much ready and we discuss this there? I could do the PR by the beginning of next week. If it doesn't fit to the kedro spirit then we can just close it.

@merelcht @MinuraPunchihewa What do you think?

@MinuraPunchihewa
Copy link
Contributor

@npfp Please go ahead and open your PR.

@datajoely
Copy link
Contributor

Thanks - @npfp please do! It will be a great place to co-design the best solution possible 💪

@npfp
Copy link
Author

npfp commented Oct 23, 2024

This is pretty much the design we used for our custom dataset:

#905

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants