Skip to content

Conversation

cklin
Copy link
Contributor

@cklin cklin commented Sep 18, 2025

This PR changes overlay-base database download to pass the segmentTimeoutInMs option to restoreCache(), so that restoreCache() itself can properly abort slow downloads.

The waitForResultWithTimeLimit() wrapper around restoreCache() remains as a second line of defense, but with a higher 10-minute time limit, to guard against cache restore hangs outside segment downloads.

See linked issue for rationale and alternatives considered.

Risk assessment

For internal use only. Please select the risk level of this change:

  • Low risk: Changes are fully under feature flags, or have been fully tested and validated in pre-production environments and are highly observable, or are documentation or test only.

Merge / deployment checklist

  • Confirm this change is backwards compatible with existing workflows.
  • Consider adding a changelog entry for this change.
  • Confirm the readme and docs have been updated if necessary.

@cklin cklin marked this pull request as ready for review September 18, 2025 21:58
@cklin cklin requested a review from a team as a code owner September 18, 2025 21:58
@Copilot Copilot AI review requested due to automatic review settings September 18, 2025 21:58
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the overlay-base database download mechanism to use the native timeout capabilities of the restoreCache() function instead of a custom timeout wrapper. The change replaces waitForResultWithTimeLimit() with direct usage of actionsCache.restoreCache() and its segmentTimeoutInMs option to provide more effective timeout control for download operations.

Key changes:

  • Remove custom timeout wrapper in favor of native cache API timeout option
  • Configure segment-level timeout for more granular download control
  • Clean up unused timeout constant while preserving it for other cache operations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/overlay-database-utils.ts Replace waitForResultWithTimeLimit() wrapper with direct restoreCache() call using segmentTimeoutInMs option
lib/init-action.js Generated JavaScript code reflecting the TypeScript changes, including variable renaming to avoid conflicts

esbena
esbena previously approved these changes Sep 19, 2025
Copy link
Contributor

@esbena esbena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one minor concern about download speeds outside the cloud.

{
// Azure SDK download (which is the default) uses 128MB segments.
// Setting segmentTimeoutInMs to 3000 translates to segment download
// speed of about 40 MB/s, which should be achievable unless the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't 40MB/s in the high end for more residential connections?
What if this is somehow run from outside Actions? For instance for local test?

Copy link
Member

@mbg mbg Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you would generally run this locally. For tests, we'd mock the behaviour. Beyond that, I haven't checked, but I'd imagine that the API endpoints used by the Actions cache also are only accessible from hosted runners or at least require some suitable auth token. Setting that up locally is unlikely to ever be worth it. + even if we wanted to, it would be easy to adjust this value for local testing if it's too high.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

40MB/s is indeed very high for residential connections. I don't think I could reach this download speed from my connection at home.

That said, I am still inclined to keep this setting, at least for now.

  • I agree with Michael that we are not very likely to run this code locally.
  • Using overlay analysis is a performance optimization, and there is a point where download would take up so much time that the entire workflow ends up being slower end-to-end. That break-even point will probably differ from one repository to another, but 40MB/s seems like a good figure to start with.

As we gain more experience, I am definitely open to adjusting this limit as appropriate.

mbg
mbg previously approved these changes Sep 19, 2025
Copy link
Member

@mbg mbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me, if you're happy with it.

In terms of the motivating issue, I suppose this is a change in objective? Rather than an overall timeout (which didn't work), there's now a timeout for individual segments (which does work, but there's no overall timeout)? So you might spend more time than MAX_CACHE_OPERATION_MS downloading the overlay cache if it's large, but no more than 3000ms per segment.

actionsCache.restoreCache([dbLocation], cacheRestoreKeyPrefix),
() => {
logger.info("Timed out downloading overlay-base database from cache");
const foundKey = await actionsCache.restoreCache(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any harm in keeping this wrapped in waitForResultWithTimeLimit with a suitable time limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there is no harm, and there would be benefits (such as protecting against hangs outside segment downloads).

I put the waitForResultWithTimeLimit() wrapper back, with a (longer) 10-minute time limit.

Thanks for the suggestion!

This commit changes overlay-base database download to pass the
segmentTimeoutInMs option to restoreCache(), so that restoreCache()
itself can properly abort slow downloads.

The waitForResultWithTimeLimit() wrapper around restoreCache() remains
as a second line of defense, but with a higher 10-minute time limit, to
guard against cache restore hangs outside segment downloads.
@cklin cklin dismissed stale reviews from mbg and esbena via 80273e2 September 19, 2025 16:40
@cklin cklin force-pushed the cklin/overlay-restore-timeout branch from 489021a to 80273e2 Compare September 19, 2025 16:40
@cklin cklin merged commit c22ae04 into main Sep 19, 2025
299 checks passed
@cklin cklin deleted the cklin/overlay-restore-timeout branch September 19, 2025 17:25
@github-actions github-actions bot mentioned this pull request Sep 25, 2025
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants