Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote execution is caching failed process #17142

Closed
Eric-Arellano opened this issue Oct 7, 2022 · 5 comments · Fixed by #17198
Closed

Remote execution is caching failed process #17142

Eric-Arellano opened this issue Oct 7, 2022 · 5 comments · Fixed by #17198
Assignees
Labels

Comments

@Eric-Arellano
Copy link
Contributor

The remote execution service I'm using is failing when running this process:

env_process_result = await Get(
ProcessResult,
Process(
["env", "-0"],
description=f"Extract environment variables from {description_of_env_source}",
level=LogLevel.DEBUG,
),
)

I was confused that rerunning Pants did not result in anything being submitted to the RE servers. Turns out, -ldebug showed this:

11:42:19.00 [DEBUG] Running Extract environment variables from the remote execution environment under semaphore with concurrency id: 1, and concurrency: 1
11:42:19.24 [DEBUG] Remote execution: Extract environment variables from the remote execution environment
11:42:19.24 [DEBUG] built REv2 request (...}
11:42:19.24 [DEBUG] Completed: Remote cache lookup for: Extract environment variables from the remote execution environment
11:42:19.24 [DEBUG] remote cache hit for: "Extract environment variables from the remote execution environment" digest=Digest { hash: Fingerprint<df716cefa1e1d42f1483193e95580accf9d0ea2b443aab1dd73e88628842a10b>, size_bytes: 139 } response=FallibleProcessResultWithPlatform { stdout_digest: Digest { hash: Fingerprint<e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855>, size_bytes: 0 }, stderr_digest: Digest { hash: Fingerprint<8d67d941a3599044bcba432b2378540c7ed3abafcbd79238b89126517dba6f8a>, size_bytes: 49 }, exit_code: 1, output_directory: DirectoryDigest { digest: Digest { hash: Fingerprint<e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855>, size_bytes: 0 }, tree: "Some(..)" }, platform: Linux_x86_64, metadata: ProcessResultMetadata { total_elapsed: Some(Duration { secs: 0, nanos: 471865000 }), source: HitRemotely, source_run_id: RunId(2) } }

This was the case regardless of --no-pansd and --no-remote-cache-read.

I would expect that a) a failed process would not be cached, and b) --remote-cache-read should toggle the cache, generally.

I worked around this by adding ProcessCacheScope.PER_SESSION, which guarantees a cache miss every run.

@Eric-Arellano Eric-Arellano self-assigned this Oct 7, 2022
@stuhood
Copy link
Member

stuhood commented Oct 7, 2022

So, we have a remedy for this, I think... but it's a little bit fiddly.

We now explicitly wrap the remote_cache::CommandRunner around the remote::CommandRunner, and so we will read/write (or not) to the cache using our Process semantics in that runner. But the remote ExecutionService server can also choose to do a cache read/write when we execute a process, and it can choose its own semantics for the write.

The fiddly solution for this would be to set do_not_cache only on the Action which we use in remote::CommandRunner, and not in the Action that we use in remote_cache::CommandRunner.

Whether that fiddly solution is worth it though is unclear. cc @tdyas , @illicitonion

@illicitonion
Copy link
Contributor

illicitonion commented Oct 7, 2022

There's a separate field which I think is geared towards this use case - skip_cache_lookup - it feels like --no-remote-cache-read should probably set that?

@stuhood
Copy link
Member

stuhood commented Oct 7, 2022

Ah, perfect!

@Eric-Arellano
Copy link
Contributor Author

b) --remote-cache-read should toggle the cache, generally.

Fixed by #17188

a) a failed process would not be cached

I'm super confused why there was a cache hit in the first place. We are only supposed to write to the action cache if the exit code was 0, and the above logs clearly show the code was 1. We've never set ProcessCacheScope.ALWAYS on this process, and it was also running against a recently created BuildGrid instance running on my dev cluster, so I don't think it's corrupted from way before.

Only possibility I can think of is if BuildGrid is directly writing to the ActionCache regardless of what we set ProcessCacheScope to be?

if !hit_cache
&& (result.exit_code == 0 || write_failures_to_cache)
&& self.cache_write
&& use_remote_cache
{
let command_runner = self.clone();
let result = result.clone();
let write_fut = in_workunit!("remote_cache_write", Level::Trace, |workunit| async move {

@tdyas
Copy link
Contributor

tdyas commented Oct 11, 2022

Only possibility I can think of is if BuildGrid is directly writing to the ActionCache regardless of what we set ProcessCacheScope to be?

Remote execution systems will generally always write to the remote Action Cache. That is an expected part of how Remote Execution API works.

Eric-Arellano added a commit that referenced this issue Oct 11, 2022
Eric-Arellano added a commit that referenced this issue Oct 14, 2022
Closes #17142.

We rely on the RemoteCache command runner for caching with remote execution. We always disable remote servers from doing caching themselves not only to avoid wasted work, but more importantly because they do not have our same caching semantics, e.g. `ProcessCacheScope.SUCCESSFUL` vs `ProcessCacheScope.ALWAYS`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants