Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test run stuck retrying with eager_fetch=False #16096

Open
stuhood opened this issue Jul 8, 2022 · 11 comments
Open

test run stuck retrying with eager_fetch=False #16096

stuhood opened this issue Jul 8, 2022 · 11 comments

Comments

@stuhood
Copy link
Member

stuhood commented Jul 8, 2022

https://github.com/pantsbuild/pants/runs/7244788636?check_suite_focus=true

@stuhood stuhood self-assigned this Jul 8, 2022
@stuhood stuhood added this to the 2.13.x milestone Jul 12, 2022
@stuhood
Copy link
Member Author

stuhood commented Jul 20, 2022

The case in the description appears to be kicked off by a missing test output (for Run Pytest for build-support/migration-support/migrate_to_toml_config_test.py:tests) that would have been consumed by coverage report generation (which currently occurs after writing the test summary to stderr). We backtrack to actually run the test, but the test goal never successfully consumes the retried test (as evidenced by the test goal repeatedly rendering the original test status of "...succeeded in 0.66s (cached remotely)"). As to the infinite retry: that's likely due to #16075.

So: why doesn't the test consume the newly produced test result after retry? If it did, the infinite retry would not be an issue afaict, because the new test output would be consumed successfully.

@stuhood
Copy link
Member Author

stuhood commented Jul 22, 2022

as evidenced by the test goal repeatedly rendering the original test status of "...succeeded in 0.66s (cached remotely)"

I have an (unsatisfying) answer to this portion: ProcessResultMetadata is excluded from ProcessResult equality, so if an otherwise-identical ProcessResult was being consumed, then it could still be possible to see the "cached remotely" suffix. As mentioned, that's unsatisfying: in 90% of cases we would want that portion to be ignored, but in this case, even if the result was completely identical, we'd want to use the new value rather than the old.

But that still doesn't actually answer how we're ending with a completely identical output for the second run:

  1. Because it was actually identical? But if so, then backtracking should have caused the missing digest to actually be produced on the second run.
  2. Because of a bug causing it to consume a stale value? I haven't thought of anything here yet.

I haven't been able to repro locally, so I think that my next step is to rule out one of the above by adding some additional debug output.

@stuhood
Copy link
Member Author

stuhood commented Jul 22, 2022

I've posted #16277 to add debug output to differentiate the two cases above. If/when the issue occurs again, folks should post a link to the repro to this issue, and then feel free to set remote_cache_eager_fetch = true again.

@jsirois
Copy link
Contributor

jsirois commented Jul 22, 2022

I've been out - is there any reason to be leaning on "if" and the possibility of red main again? When I last left this was a pretty reliable main CI red.

@stuhood
Copy link
Member Author

stuhood commented Jul 22, 2022

I've been out - is there any reason to be leaning on "if" and the possibility of red main again? When I last left this was a pretty reliable main CI red.

This is a new issue: #16075 fixed #15995, and the case described in this issue is seemingly rarer.

stuhood added a commit that referenced this issue Jul 23, 2022
This is debug output (and labels itself as such) to assist with differentiating the cases of #16096.

[ci skip-build-wheels]
@stuhood
Copy link
Member Author

stuhood commented Jul 27, 2022

No repros of this case yet in pantsbuild/pants CI... so I'll go ahead and cherry-pick #16277 so that users trying out lazy fetch who encounter this issue can help us gather data.

stuhood added a commit to stuhood/pants that referenced this issue Jul 27, 2022
This is debug output (and labels itself as such) to assist with differentiating the cases of pantsbuild#16096.

[ci skip-build-wheels]
@stuhood
Copy link
Member Author

stuhood commented Jul 27, 2022

Ironically (?) but helpfully, I actually got repros in my cherry-pick PR:

The debug output from #16277 doesn't actually trigger, which itself is useful data. Will resume this tomorrow.

jyggen pushed a commit to jyggen/pants that referenced this issue Jul 27, 2022
This is debug output (and labels itself as such) to assist with differentiating the cases of pantsbuild#16096.

[ci skip-build-wheels]
@stuhood
Copy link
Member Author

stuhood commented Jul 28, 2022

So the relevant aspect of #16277 not triggering in the repro on #16327 is that it means that the test goal is not retrying the invalidated process. That might be because:

  1. a different process is running the second time
    • this seems unlikely, because the re-runs of the test goal are observing the "same ProcessResult" (with an identical output Digest)
  2. the relevant (transitive) dependency test process of the test goal is not being dirtied, such that when the test goal re-runs, it thinks that the test process (transitively) does not need to re-run
    • this seems more likely, because AFAICT, it's only processes which run after the test goal has side-effected (rendered to the console and written a Digest to the workspace), which experience the issue.
  3. ...?

I'm still not able to reproduce this case in an integration test, which makes me suspect that it is related to the stack of @rules that is running under the test goal in particular.

Case 2 is relatively easy to confirm (...assuming a repro), so I've pushed #16342 to try and get data on "what" is being invalidated in the relevant cases. As with #16277, if it doesn't repro while landing the PR, I'll likely try cherry-picking it.

stuhood added a commit to stuhood/pants that referenced this issue Aug 2, 2022
This is debug output (and labels itself as such) to assist with differentiating the cases of pantsbuild#16096.

[ci skip-build-wheels]
stuhood added a commit to stuhood/pants that referenced this issue Aug 2, 2022
This is debug output (and labels itself as such) to assist with differentiating the cases of pantsbuild#16096.

[ci skip-rust]
[ci skip-build-wheels]
stuhood added a commit that referenced this issue Aug 3, 2022
…#16398)

As reported in #16096, backtracking still has some kinks to work out, and is unlikely to be fully stable before `2.13.0` ships.

To gain "most" of the network-usage benefit of deferring fetching cache content while significantly narrowing the window in which backtracking is necessary, this change replaces the `remote_cache_eager_fetch` flag with a `cache_content_behavior` ternary option. As described in the `help` for the new option, the modes are:
* `fetch`: Eagerly fetches cache content: the default. Equivalent to `eager_fetch=True`, the previous default.
* `validate`: Validates that cache content exists either locally or remotely, without actually fetching files. This cannot be the default (without changing behavior), because it can still require backtracking if cache content expires during the lifetime of `pantsd` or a single run.
* `defer`: Does not validate or fetch content: equivalent to `eager_fetch=False`. Hopefully can eventually be made the default once all issues like #16096 are resolved.

Because `validate` narrows the window in which we backtrack to cases where content expires during a run, it should be relatively safe for use even within the `2.13.x` release.

[ci skip-build-wheels]
stuhood added a commit to stuhood/pants that referenced this issue Aug 3, 2022
…pantsbuild#16398)

As reported in pantsbuild#16096, backtracking still has some kinks to work out, and is unlikely to be fully stable before `2.13.0` ships.

To gain "most" of the network-usage benefit of deferring fetching cache content while significantly narrowing the window in which backtracking is necessary, this change replaces the `remote_cache_eager_fetch` flag with a `cache_content_behavior` ternary option. As described in the `help` for the new option, the modes are:
* `fetch`: Eagerly fetches cache content: the default. Equivalent to `eager_fetch=True`, the previous default.
* `validate`: Validates that cache content exists either locally or remotely, without actually fetching files. This cannot be the default (without changing behavior), because it can still require backtracking if cache content expires during the lifetime of `pantsd` or a single run.
* `defer`: Does not validate or fetch content: equivalent to `eager_fetch=False`. Hopefully can eventually be made the default once all issues like pantsbuild#16096 are resolved.

Because `validate` narrows the window in which we backtrack to cases where content expires during a run, it should be relatively safe for use even within the `2.13.x` release.

[ci skip-build-wheels]
@stuhood
Copy link
Member Author

stuhood commented Aug 4, 2022

stuhood added a commit that referenced this issue Aug 4, 2022
… (Cherry-pick of #16398) (#16409)

As reported in #16096, backtracking still has some kinks to work out, and is unlikely to be fully stable before `2.13.0` ships.

To gain "most" of the network-usage benefit of deferring fetching cache content while significantly narrowing the window in which backtracking is necessary, this change replaces the `remote_cache_eager_fetch` flag with a `cache_content_behavior` ternary option. As described in the `help` for the new option, the modes are:
* `fetch`: Eagerly fetches cache content: the default. Equivalent to `eager_fetch=True`, the previous default.
* `validate`: Validates that cache content exists either locally or remotely, without actually fetching files. This cannot be the default (without changing behavior), because it can still require backtracking if cache content expires during the lifetime of `pantsd` or a single run.
* `defer`: Does not validate or fetch content: equivalent to `eager_fetch=False`. Hopefully can eventually be made the default once all issues like #16096 are resolved.

Because `validate` narrows the window in which we backtrack to cases where content expires during a run, it should be relatively safe for use even within the `2.13.x` release.

[ci skip-build-wheels]
@stuhood stuhood removed this from the 2.13.x milestone Aug 17, 2022
@stuhood stuhood added the remote label Aug 26, 2022
cczona pushed a commit to cczona/pants that referenced this issue Sep 1, 2022
…pantsbuild#16398)

As reported in pantsbuild#16096, backtracking still has some kinks to work out, and is unlikely to be fully stable before `2.13.0` ships.

To gain "most" of the network-usage benefit of deferring fetching cache content while significantly narrowing the window in which backtracking is necessary, this change replaces the `remote_cache_eager_fetch` flag with a `cache_content_behavior` ternary option. As described in the `help` for the new option, the modes are:
* `fetch`: Eagerly fetches cache content: the default. Equivalent to `eager_fetch=True`, the previous default.
* `validate`: Validates that cache content exists either locally or remotely, without actually fetching files. This cannot be the default (without changing behavior), because it can still require backtracking if cache content expires during the lifetime of `pantsd` or a single run.
* `defer`: Does not validate or fetch content: equivalent to `eager_fetch=False`. Hopefully can eventually be made the default once all issues like pantsbuild#16096 are resolved.

Because `validate` narrows the window in which we backtrack to cases where content expires during a run, it should be relatively safe for use even within the `2.13.x` release.

[ci skip-build-wheels]
@stuhood
Copy link
Member Author

stuhood commented Sep 22, 2022

Note that this issue can still occur with cache_content_behavior=validate if an artifact disappears after validate, but before it is actually fetched.

@stuhood
Copy link
Member Author

stuhood commented Dec 7, 2022

Also very related is: #16667.

@stuhood stuhood removed their assignment Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants