Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add async I/O methods [WIP] #749

Closed

Conversation

duckontheweb
Copy link
Contributor

@duckontheweb duckontheweb commented Feb 14, 2022

Related Issue(s)

Description

Implements asynchronous I/O operations by adding async equivalents to the following StacIO methods:

  • read_text / read_text_async (abstract methods)
  • write_text / write_text_async (abstract methods)
  • read_json / read_json_async
  • save_json / save_json_async
  • read_stac_object / read_stac_object_async

This PR also creates a new DefaultStacIOAsync class with default implementations of the async methods using aiofiles (for local file I/O) and httpx (for reading via HTTP). In order to take advantage of these async methods, users must install PySTAC using the aiofiles and httpx extras or create their own implementations of the async abstract methods using their libraries of choice (e.g. aiohttp).

I added a Catalog.save_async method that takes advantage of these new async I/O methods as an example of how me might implement async I/O throughout the code base. Feedback would be appreciated on the approach of having sync methods call their equivalent async method, and on the naming conventions used.

There is still quite a bit of work to do on this before we can consider the linked issues to be closed, but I would like feedback on the changes to the Catalog and StacIO APIs before I do too much more work on the rest of the code. I've added some additional TODOs in the PR checklist.

This PR represents a breaking change, as it requires developers to implement the abstract read_text_async and write_text_async methods in any custom StacIO implementations. However, users who are using the default implementation should not need to make any changes.

PR Checklist

  • Code is formatted (run pre-commit run --all-files)
  • Tests pass (run scripts/test)
  • Documentation has been updated to reflect changes, if applicable
  • This PR maintains or improves overall codebase code coverage.
  • Changes are added to the CHANGELOG. See the docs for information about adding to the changelog.
  • Test performance of synchronous StacIO methods calling async methods vs. using separately maintained synchronous code.
  • Implement asynchronous versions of remaining methods that involve I/O operations (e.g. Catalog.walk)
  • Update tutorials and other docs to use async methods
  • Merge changes from Add performance benchmarks #748 and add benchmarks to test I/O performance improvements/degradations

@duckontheweb duckontheweb added enhancement dependencies Pull requests that update a dependency file breaking Represents a potentially breaking change that may warrant a major version release labels Feb 14, 2022
@duckontheweb duckontheweb added this to the 2.0.0 milestone Feb 14, 2022
@duckontheweb duckontheweb changed the base branch from main to 2.0 February 15, 2022 14:00
@duckontheweb
Copy link
Contributor Author

Changed the base branch to 2.0 since this introduces breaking changes.

Copy link
Member

@gadomski gadomski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tl;dr: +0

I don't have any strong objections to the way things are currently laid out. It feels a little awkward to have all of the RuntimeError checks inside the save method, but if that's how it has to be, so it goes.

I'll mention a couple of possible other implementations that popped to mind; my understanding is pretty low for the Python async ecosystem so these are probably Very Wrong™:

  • Go "async-first" and make save (and friends) async, and use "blocking" to describe non-async methods (e.g. read_blocking). This is probably a bad idea because PySTAC is so foundational and we don't want people to have to go async to just use it.
  • Use two different classes for Async and Sync objects, e.g. AsyncItem. I think this probably breaks a lot of the type inference stuff used to detect item types?
  • Is there any way to "store" asyncio information on StacIO to avoid all those try/excepts? Probably not...just something that I thought of while reading tokio's tutorial on bridging w/ sync code: https://tokio.rs/tokio/topics/bridging.

@schwehr
Copy link
Collaborator

schwehr commented Feb 16, 2022

I've never used asyncio, so I don't have a lot to offer in review.

@duckontheweb
Copy link
Contributor Author

Thanks @gadomski!

It feels a little awkward to have all of the RuntimeError checks inside the save method, but if that's how it has to be, so it goes.

I agree. I was going to abstract this into a separate _run_coroutine method, but I didn't want to add that complexity until there was consensus on the general approach.

  • Go "async-first" and make save (and friends) async, and use "blocking" to describe non-async methods (e.g. read_blocking). This is probably a bad idea because PySTAC is so foundational and we don't want people to have to go async to just use it.

I thought about this, too, and was wary for the same reasons. I suspect that the majority of our users are not as familiar with async in Python and might be at least intimidated by seeing the methods they are used to using switched to async.

  • Use two different classes for Async and Sync objects, e.g. AsyncItem. I think this probably breaks a lot of the type inference stuff used to detect item types?

I had toyed with this idea a bit, too. I think the general recommendation for async programming is to put sync and async I/O operations in different classes, but as you mentioned that might cause problems with some of our STAC object type sniffing. It also would make it a little trickier to take advantage of async operations within synchronous methods (like here) for parallelizing operations.

That said, I'm open to other ways of structuring this. One thought would be to have an AsyncItem class as you suggested but only have it contain the I/O operations and make it available within the normal Item class (maybe in an Item.async_io property). That way our normal STAC object detection methods could still return the normal classes, and we could isolate the async operations to a separate class. If that seems worth pursuing I can put in an alternate PR with that structure.

  • Is there any way to "store" asyncio information on StacIO to avoid all those try/excepts? Probably not...just something that I thought of while reading tokio's tutorial on bridging w/ sync code: https://tokio.rs/tokio/topics/bridging.

I had originally structured the StacIO changes by having a separate AsyncStacIO class that was available as StacIO.async_io in the normal StacIO instances and could be used by StacIO to do async operations. I ran into a some problems with making the sync and async methods available within the methods on STACObject, Catalog, etc., but those might be alleviated if we separate out the async logic as suggested above. This makes me think it might be worth a separate PR to explore structuring the the async code that way so we have a good point of comparison.

@duckontheweb duckontheweb removed the request for review from schwehr February 28, 2022 14:16
@duckontheweb
Copy link
Contributor Author

This is an interesting idea. It looks like their approach is to wrap the async functions to create blocking versions, which is different from the httpx approach of having completely separate async and blocking code. It might be worth pursuing this approach to see if it simplifies the code at all.

@sgillies
Copy link
Contributor

sgillies commented Feb 28, 2022

When I was updating Planet's stactools plugin I noticed that stactools already depends on aiohttp. Maybe it would be worth considering using that instead of httpx? So that a user's stactools install doesn't contain more web clients than it needs.

@gadomski gadomski mentioned this pull request Jan 9, 2023
10 tasks
@gadomski gadomski deleted the branch stac-utils:2.0 March 1, 2023 20:00
@gadomski gadomski closed this Mar 1, 2023
@gadomski gadomski removed this from the v2.0 milestone Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Represents a potentially breaking change that may warrant a major version release dependencies Pull requests that update a dependency file enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make pystac handle item creation asynchronously Large catalog async optimizations
4 participants