Remote execution is caching failed process #17142

Eric-Arellano · 2022-10-07T16:56:00Z

The remote execution service I'm using is failing when running this process:

pants/src/python/pants/engine/internals/platform_rules.py

Lines 65 to 72 in 8cabd7d

    
           env_process_result = await Get( 
        
               ProcessResult, 
        
               Process( 
        
                   ["env", "-0"], 
        
                   description=f"Extract environment variables from {description_of_env_source}", 
        
                   level=LogLevel.DEBUG, 
        
               ), 
        
           )

I was confused that rerunning Pants did not result in anything being submitted to the RE servers. Turns out, -ldebug showed this:

11:42:19.00 [DEBUG] Running Extract environment variables from the remote execution environment under semaphore with concurrency id: 1, and concurrency: 1
11:42:19.24 [DEBUG] Remote execution: Extract environment variables from the remote execution environment
11:42:19.24 [DEBUG] built REv2 request (...}
11:42:19.24 [DEBUG] Completed: Remote cache lookup for: Extract environment variables from the remote execution environment
11:42:19.24 [DEBUG] remote cache hit for: "Extract environment variables from the remote execution environment" digest=Digest { hash: Fingerprint<df716cefa1e1d42f1483193e95580accf9d0ea2b443aab1dd73e88628842a10b>, size_bytes: 139 } response=FallibleProcessResultWithPlatform { stdout_digest: Digest { hash: Fingerprint<e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855>, size_bytes: 0 }, stderr_digest: Digest { hash: Fingerprint<8d67d941a3599044bcba432b2378540c7ed3abafcbd79238b89126517dba6f8a>, size_bytes: 49 }, exit_code: 1, output_directory: DirectoryDigest { digest: Digest { hash: Fingerprint<e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855>, size_bytes: 0 }, tree: "Some(..)" }, platform: Linux_x86_64, metadata: ProcessResultMetadata { total_elapsed: Some(Duration { secs: 0, nanos: 471865000 }), source: HitRemotely, source_run_id: RunId(2) } }

This was the case regardless of --no-pansd and --no-remote-cache-read.

I would expect that a) a failed process would not be cached, and b) --remote-cache-read should toggle the cache, generally.

I worked around this by adding ProcessCacheScope.PER_SESSION, which guarantees a cache miss every run.

The text was updated successfully, but these errors were encountered:

stuhood · 2022-10-07T17:11:20Z

So, we have a remedy for this, I think... but it's a little bit fiddly.

We now explicitly wrap the remote_cache::CommandRunner around the remote::CommandRunner, and so we will read/write (or not) to the cache using our Process semantics in that runner. But the remote ExecutionService server can also choose to do a cache read/write when we execute a process, and it can choose its own semantics for the write.

The fiddly solution for this would be to set do_not_cache only on the Action which we use in remote::CommandRunner, and not in the Action that we use in remote_cache::CommandRunner.

Whether that fiddly solution is worth it though is unclear. cc @tdyas , @illicitonion

illicitonion · 2022-10-07T21:59:01Z

There's a separate field which I think is geared towards this use case - skip_cache_lookup - it feels like --no-remote-cache-read should probably set that?

stuhood · 2022-10-07T22:03:00Z

Ah, perfect!

Eric-Arellano · 2022-10-10T23:53:13Z

b) --remote-cache-read should toggle the cache, generally.

Fixed by #17188

a) a failed process would not be cached

I'm super confused why there was a cache hit in the first place. We are only supposed to write to the action cache if the exit code was 0, and the above logs clearly show the code was 1. We've never set ProcessCacheScope.ALWAYS on this process, and it was also running against a recently created BuildGrid instance running on my dev cluster, so I don't think it's corrupted from way before.

Only possibility I can think of is if BuildGrid is directly writing to the ActionCache regardless of what we set ProcessCacheScope to be?

pants/src/rust/engine/process_execution/src/remote_cache.rs

Lines 501 to 508 in 15f662c

    
           if !hit_cache 
        
             && (result.exit_code == 0 || write_failures_to_cache) 
        
             && self.cache_write 
        
             && use_remote_cache 
        
           { 
        
             let command_runner = self.clone(); 
        
             let result = result.clone(); 
        
             let write_fut = in_workunit!("remote_cache_write", Level::Trace, |workunit| async move {

tdyas · 2022-10-11T00:07:39Z

Only possibility I can think of is if BuildGrid is directly writing to the ActionCache regardless of what we set ProcessCacheScope to be?

Remote execution systems will generally always write to the remote Action Cache. That is an expected part of how Remote Execution API works.

…17188) Finished the TODO from #15854 and #15900. Partially addresses #17142.

Closes #17142. We rely on the RemoteCache command runner for caching with remote execution. We always disable remote servers from doing caching themselves not only to avoid wasted work, but more importantly because they do not have our same caching semantics, e.g. `ProcessCacheScope.SUCCESSFUL` vs `ProcessCacheScope.ALWAYS`.

Eric-Arellano self-assigned this Oct 7, 2022

stuhood mentioned this issue Oct 10, 2022

Fix remote cache writes during remote execution #17181

Closed

Eric-Arellano mentioned this issue Oct 10, 2022

--remote-execution no longer implicitly enables --remote-cache #17188

Merged

Eric-Arellano mentioned this issue Oct 11, 2022

Make Pants remote cache reader robust to remote execution always writing to the cache #17189

Closed

Eric-Arellano added a commit that referenced this issue Oct 11, 2022

--remote-execution no longer implicitly enables --remote-cache (#…

fe13e84

…17188) Finished the TODO from #15854 and #15900. Partially addresses #17142.

Eric-Arellano mentioned this issue Oct 11, 2022

Tell remote execution to not use remote cache internally #17198

Merged

Eric-Arellano closed this as completed in #17198 Oct 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote execution is caching failed process #17142

Remote execution is caching failed process #17142

Eric-Arellano commented Oct 7, 2022

stuhood commented Oct 7, 2022

illicitonion commented Oct 7, 2022 •

edited

Loading

stuhood commented Oct 7, 2022

Eric-Arellano commented Oct 10, 2022

tdyas commented Oct 11, 2022

Remote execution is caching failed process #17142

Remote execution is caching failed process #17142

Comments

Eric-Arellano commented Oct 7, 2022

stuhood commented Oct 7, 2022

illicitonion commented Oct 7, 2022 • edited Loading

stuhood commented Oct 7, 2022

Eric-Arellano commented Oct 10, 2022

tdyas commented Oct 11, 2022

illicitonion commented Oct 7, 2022 •

edited

Loading