Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries to file download to avoid throwing an exception in the case of normal retry errors #1002

Merged
merged 11 commits into from
Jul 29, 2023

Conversation

jreiberkyle
Copy link
Contributor

@jreiberkyle jreiberkyle commented Jul 21, 2023

Related Issue(s):

Closes #974

Proposed Changes:

For inclusion in changelog (if applicable):

  1. Orders and data downloads now incorporate retries and worker limits. Retries make them more robust in the case of common http errors and including them in the worker limits reduces the chance of overloading the server and triggering the errors in the first place. For this to happen, the write functionality has been moved from models.StreamingBody (which has been removed) to http.Session

Side note: Adding functionality while deleting a net of 59 lines seems like a win =)

Things I didn't do but believe would be a good idea:

  1. include ServerError as a retryable http error (this makes many subscription failure tests very long because they are using ServerError to initiate a failure)
  2. Move tests/unit/test_Session.py into tests/unit/test_http.py so all Session tests are together
  3. Move the rest of models.py into http.py, it's getting pretty small without StreamingBody

Not intended for changelog:

  1. models.Response now has new filename and length properties which are None unless the response is from a GET request for a downloadable resource.

Diff of User Interface

Old behavior:

Given a list of activated order ids stored in oids.txt, the following script would result in some downloads failing due to http errors:

import asyncio
from collections import Counter

import planet

import logging; logging.basicConfig(filename='log.txt', level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')


async def download(count=1, directory_path='downloads'):
    async with planet.Session() as s:
        client = s.client('orders')
                
        with open('oids.txt', 'r') as f:
            order_ids = f.readlines()

        oids = [order_id.strip() for order_id in order_ids[:count]]

        res = await asyncio.gather(*[
            _download_order(client, oid, directory_path)
                    for oid in oids], return_exceptions=True)
        
        errors =[r for r in res if issubclass(type(r), Exception)]

        print(Counter([type(r) for r in errors]))
        if errors:
            raise(errors[0])


async def _download_order(client, order_id, directory):
    with planet.reporting.StateBar(state='polling') as reporter:
        await client.wait(order_id, callback=reporter.update_state, max_attempts=0, delay=7)
        reporter.update_state('downloading')
        await client.download_order(order_id, directory=directory, progress_bar=False, overwrite=True)
        reporter.update_state('downloaded')


asyncio.run(download(count=100))

New behavior:

All downloads are successful

PR Checklist:

  • This PR is as small and focused as possible
  • If this PR includes proposed changes for inclusion in the changelog, the title of this PR summarizes those changes and is ready for inclusion in the Changelog.
  • [] I have updated docstrings for function changes and docs in the 'docs' folder for user interface / behavior changes
  • [] This PR does not break any examples or I have updated them

(Optional) @mentions for Notifications:

@jreiberkyle jreiberkyle changed the base branch from main to maint-2.1 July 21, 2023 22:16
@jreiberkyle jreiberkyle requested a review from sgillies July 21, 2023 22:26
Copy link
Contributor

@sgillies sgillies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreiberkyle Excellent! I'm requesting that two client features (directory assurance and progress bar) be moved up a layer.

planet/http.py Outdated
new = response.num_bytes_downloaded - previous
progress.update(new - previous)
previous = new
progress.update()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreiberkyle do you mean to put this statement within the chunk iteration? Unless you do, I don't see how progress is shown.

planet/http.py Outdated
Comment on lines 405 to 407
directory: Path = Path('.'),
overwrite: bool = False,
progress_bar: bool = False) -> Path:
Copy link
Contributor

@sgillies sgillies Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreiberkyle this is not the right level for these features. Let's allow the Session class to work in a simple space with no progress bars and no directories. Just requests and responses and bytes going in and out. If we do, I think we'll find it easier to refactor and maintain over time. Principles of least concern and strong boundaries are what I'm invoking.

Concretely, let's give the download methods the responsibility of ensuring that directories exist and that we have a valid destination filename. Also, those methods should create progress bars and pass them to Session.write, which should just do progress.update() (dependency injection).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on separating concerns. I'm getting stuck on initiating the progress bar, which requires a total which can only be read by the response. I will look deeper into how to make this work with the client download functions handling creating of the progress bar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this time, retry is managed by Session. For Session to manage retry, it has to submit the original request for the download (so it can retry that request). When it submits the request, the response is not available to the client download function. So, the download function cannot access the total value in the response headers. There are two ways I can think of to allow download function to create the progress bar:

  1. pass two callbacks to Session.write, one for update and one for setting the total
  2. have the download function submit the original request to get the total value, then have Session.write submit the original request again as a part of the write process

I tend to think more than one callback is complex and so prefer option 2 even though it will be less efficient communication-wise, but I am not sure my aversion to multiple callbacks is warrented. @sgillies what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like tqdm really supports setting the total after the progress bar has been initialized, so I am going to go with option 2: have the download function submit the original request to get the total, then have session.write submit it again to perform the download.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreiberkyle option 2 is better, I agree. Can the double requests be avoided by making a decorator form of the retrier and re-trying not the HTTP requests, but download_asset()? Like this

    @retry
    async def download_asset(...):
        ...
        async with self._session._client.stream, method='GET', url=location) as resp:
            content_length = resp.headers['Content-Length']
            dl_path = Path(directory, filename or parsed_from_content_disposition_header)
            dl_path.parent.mkdir(exist_ok=True, parents=True)
            return await self._session._write_response(
                resp,
                dl_path,
                overwrite=overwrite,
                progress_bar=tqdm(total=content_length, ...)
             )

Maybe for the next iteration? I think double requests are tolerable for now. Would it be best to make sure the first one gets closed/finalized before the second one is made? That's something to look into.

It also occurs to me that file sizes are in the order manifest.json file, so there's another way to get them. Also making another request, but only one extra per order.

Copy link
Contributor

@sgillies sgillies Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decorating the client method has an example here: https://github.com/hynek/stamina#production-grade-retries-made-easy. I bet you could turn the SDK's retrier into such a decorator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed that we can generalize the retry function and make it work for retrying download_asset(). In that case, we may want to change Session.request so that it does not also retry (thereby squaring the effective MAX_RETRIES), and then change all the functions calling Session.request so they DO retry. At any rate, it would require some more thought. So, I went with option 2 - two calls.

@jreiberkyle jreiberkyle self-assigned this Jul 29, 2023
@jreiberkyle
Copy link
Contributor Author

@sgillies merging to prepare a release. You will get a chance to review before the release as you will be the release manager =)

@jreiberkyle jreiberkyle merged commit 082033b into maint-2.1 Jul 29, 2023
@jreiberkyle jreiberkyle deleted the download-failure-974 branch July 29, 2023 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Download failure under heavy load
2 participants