Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException with bazel@HEAD in CI #488

Closed
sgowroji opened this issue Feb 26, 2024 · 11 comments
Closed

NullPointerException with bazel@HEAD in CI #488

sgowroji opened this issue Feb 26, 2024 · 11 comments

Comments

@sgowroji
Copy link
Member

sgowroji commented Feb 26, 2024

CI: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3681#018de327-837e-4bb4-8986-ec35f627c4d9

Platform: Ubuntu

Logs:

FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.RuntimeException: Unrecoverable error while evaluating node 'UnshareableActionLookupData{actionLookupKey=ConfiguredTargetKey{label=//tests:subpackages_all_w_exclude, config=BuildConfigurationKey[86b97794d575f038822a36ec1392fdddcdcaebfcd711a3ac8088056fc5bb38f5]}, actionIndex=5}' (requested by nodes 'TestCompletionKey{configuredTargetKey=ConfiguredTargetKey{label=//tests:subpackages_all_w_exclude, config=BuildConfigurationKey[86b97794d575f038822a36ec1392fdddcdcaebfcd711a3ac8088056fc5bb38f5]}, topLevelArtifactContext=com.google.devtools.build.lib.analysis.TopLevelArtifactContext@90904c3b, exclusiveTesting=false}')
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:551)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: java.lang.NullPointerException
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:903)
	at com.google.devtools.build.lib.vfs.Path.getRelative(Path.java:121)
	at com.google.devtools.build.lib.sandbox.SandboxStash.takeStashedSandboxInternal(SandboxStash.java:99)
	at com.google.devtools.build.lib.sandbox.SandboxStash.takeStashedSandbox(SandboxStash.java:73)
	at com.google.devtools.build.lib.sandbox.SymlinkedSandboxedSpawn.filterInputsAndDirsToCreate(SymlinkedSandboxedSpawn.java:78)
	at com.google.devtools.build.lib.sandbox.AbstractContainerizingSandboxedSpawn.createFileSystem(AbstractContainerizingSandboxedSpawn.java:135)
	at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.runSpawn(AbstractSandboxSpawnRunner.java:146)
	at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:113)
	at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:456)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:160)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
	at com.google.devtools.build.lib.exec.StandaloneTestStrategy.runTestAttempt(StandaloneTestStrategy.java:652)
	at com.google.devtools.build.lib.exec.StandaloneTestStrategy.beginTestAttempt(StandaloneTestStrategy.java:315)
	at com.google.devtools.build.lib.exec.StandaloneTestStrategy$StandaloneTestRunnerSpawn.execute(StandaloneTestStrategy.java:577)
	at com.google.devtools.build.lib.analysis.test.TestRunnerAction.executeAllAttempts(TestRunnerAction.java:1162)
	at com.google.devtools.build.lib.analysis.test.TestRunnerAction.execute(TestRunnerAction.java:974)
	at com.google.devtools.build.lib.analysis.test.TestRunnerAction.execute(TestRunnerAction.java:951)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1140)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1053)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:166)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:95)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:547)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:903)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:191)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:462)
	... 7 more

Steps:

git clone https://github.com/bazelbuild/bazel-skylib
git reset e60cf0001fdc1500cf5ba2a6bf4ca0b46a1f725f  --hard
export USE_BAZEL_VERSION=9d34f8ab0f1ffb18900feaeb23cb16c93f4e0139
bazel test  //... @bazel_skylib_gazelle_plugin//... 

CC Greenteam @meteorcloudy

@fmeum
Copy link
Contributor

fmeum commented Feb 26, 2024

@oquenchil
Copy link
Contributor

Taking a look.

@oquenchil
Copy link
Contributor

I'm a bit puzzled about why this happens. I can't reproduce locally with 100000 runs per test, I don't see a race condition and even if the variables that I take from the environment TEST_WORKSPACE and TEST_SRCDIR were null, the string placed inside the map wouldn't be null.

The only thing I can think of right now is adding more logging to detect if we put/get null from the concurrent hash map, then log all the environment variables in the environment and all the entries in the map.

We'd check that in and if we see the exception again use the new info to find out what's going on.

@fmeum wdyt?

@fmeum
Copy link
Contributor

fmeum commented Feb 26, 2024

The renameTo call isn't guaranteed to be atomic in all filesystem setups, so couldn't (at least in theory) two test actions both enter the isTestAction branch in takeStashedSandboxInternal for the same stashed sandbox? That would explain the issue.

But yes, more logging would certainly help.

@oquenchil
Copy link
Contributor

I thought it was always atomic within the same filesystem. In which filesystem types isn't it atomic?

@fmeum
Copy link
Contributor

fmeum commented Feb 26, 2024

I thought it was always atomic within the same filesystem. In which filesystem types isn't it atomic?

That does seem to be true for the UnixFileSystem, but the JavaIoFileSystem may not be as File#renameTo isn't guaranteed to be atomic according to its javadocs and is used here: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/vfs/JavaIoFileSystem.java;l=318;drc=9d34f8ab0f1ffb18900feaeb23cb16c93f4e0139

It shouldn't be used on Ubuntu unless there is some other bug, so you are right that this isn't likely to be the cause here.

@fmeum
Copy link
Contributor

fmeum commented Feb 26, 2024

I think that there is a race, I'm just not sure whether it can arise in practice: If spawn #1 blocks on the ConcurrentHashMap#put call in https://cs.opensource.google/bazel/bazel/+/f05c9d0b8d32d29847d5b16af1e5f8c20d11f66d:src/main/java/com/google/devtools/build/lib/sandbox/SandboxStash.java;l=160, then it will have already moved its sandbox dir into the stash. Spawn #2 could discover and move this directory before the put returns and would then run into an NPE when querying the map.

@oquenchil
Copy link
Contributor

Ah yes, but would we have seen this error? From the error I'd say there was an entry in the map but it was actually null somehow.

Apart from logging the environment variables and the contents of the map, can you think of anything else that would be useful to log?

@fmeum
Copy link
Contributor

fmeum commented Feb 26, 2024

Ah yes, but would we have seen this error? From the error I'd say there was an entry in the map but it was actually null somehow.

Map#get also returns null if the key isn't in the map. I think that's more likely than the value actually being null.

Apart from logging the environment variables and the contents of the map, can you think of anything else that would be useful to log?

The result of the readdir of the stashes directory would also be interesting.

@oquenchil
Copy link
Contributor

Wonderful, thanks!

@sgowroji
Copy link
Member Author

sgowroji commented Mar 1, 2024

Downstream CI of bazel-skylib is Green now https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3689#018df801-66e8-42eb-8fcf-b047fad7a8cd. We can close this issue.

@sgowroji sgowroji closed this as completed Mar 4, 2024
copybara-service bot pushed a commit to bazelbuild/bazel that referenced this issue Mar 14, 2024
Attempts to address NPE reported in: bazelbuild/bazel-skylib#488 (comment) and #21665 (comment)

The `put()` call to the runfiles dir map is placed before the call that stashes the corresponding directory to address the race condition described here: bazelbuild/bazel-skylib#488 (comment).

The exception will now log:
- entries in the runfiles dir map
- environment variables
- stashes on disk

Closes #21668.

PiperOrigin-RevId: 615739651
Change-Id: Ida90e334d1d1f890cf204d272134726bb1f70eb9
bazel-io pushed a commit to bazel-io/bazel that referenced this issue Mar 14, 2024
Attempts to address NPE reported in: bazelbuild/bazel-skylib#488 (comment) and bazelbuild#21665 (comment)

The `put()` call to the runfiles dir map is placed before the call that stashes the corresponding directory to address the race condition described here: bazelbuild/bazel-skylib#488 (comment).

The exception will now log:
- entries in the runfiles dir map
- environment variables
- stashes on disk

Closes bazelbuild#21668.

PiperOrigin-RevId: 615739651
Change-Id: Ida90e334d1d1f890cf204d272134726bb1f70eb9
github-merge-queue bot pushed a commit to bazelbuild/bazel that referenced this issue Mar 14, 2024
…message (#21692)

Attempts to address NPE reported in:
bazelbuild/bazel-skylib#488 (comment)
and
#21665 (comment)

The `put()` call to the runfiles dir map is placed before the call that
stashes the corresponding directory to address the race condition
described here:
bazelbuild/bazel-skylib#488 (comment).

The exception will now log:
- entries in the runfiles dir map
- environment variables
- stashes on disk

Closes #21668.

Commit
59dbf7a

PiperOrigin-RevId: 615739651
Change-Id: Ida90e334d1d1f890cf204d272134726bb1f70eb9

Co-authored-by: Pedro <plf@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants