Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build graph invalidation with custom flags from external repositories #17853

Open
layus opened this issue Mar 22, 2023 · 6 comments
Open

Build graph invalidation with custom flags from external repositories #17853

layus opened this issue Mar 22, 2023 · 6 comments
Labels
team-Core Skyframe, bazel query, BEP, options parsing, bazelrc type: bug untriaged

Comments

@layus
Copy link
Contributor

layus commented Mar 22, 2023

Description of the bug:

To fix issues with rules_docker transitions leaking into our own transitions and multiplicating the number of different configurations, we have added build --@io_bazel_rules_docker//transitions:enable=no to our .bazelrc, as per bazelbuild/rules_docker#2068

That works, but introduced a subtle and hard to track issue where running some bazel commands like cquery, dump or config would in some cases invalidate large parts of the daemon's build graph. I started investigating several months ago, and saw a breakthrough only recently.

It seems that these less frequently used commands have trouble taking that value into account. There is also some history with workspace label prefixes in custom build flags. See in general #11128 and #9177 for historical, somewhat related issues.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

This works for me, be I still have to provide a minimal repo where it works:

echo "build --@io_bazel_rules_docker//transitions:enable=no" >> .bazelrc
# Get a proper build graph in cache
bazel build //...
# List current configurations
bazel config
# Invalidate the cache with a command that does no understand/use the above flag
bazel info output_path
# Observe that the set of cached configurations has changed
bazel config

Which operating system are you running Bazel on?

Ubuntu Linux 20.04 (+nix)

What is the output of bazel info release?

release 6.0.0- (@non-git)

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

Its bazel 6.0.0 from nixpkgs.

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

See description above

Any other information, logs, or outputs that you want to share?

No response

@layus layus changed the title Evaluation invalidation with custom flags from external repositories Build graph invalidation with custom flags from external repositories Mar 22, 2023
@ShreeM01 ShreeM01 added type: bug team-Configurability platforms, toolchains, cquery, select(), config transitions untriaged labels Mar 22, 2023
@layus
Copy link
Contributor Author

layus commented Mar 22, 2023

Took time to make a proper reproduction procedure: https://gist.github.com/layus/57fd0aeba78a9a67e8e684379f829e4d.
I discovered several thing doing so:

  1. Bazel 5 had a warning when running bazel info: WARNING: info command does not support starlark options. Ignoring options: [--@io_bazel_rules_docker//transitions:enable=no]. That warning is gone in bazel 6, but that specific option is still ignored. My guess is because it references an external repo, but that remains to be proven.

  2. On that small example, the graph is not immediately pruned. I suspect this happens on our monorepo due to it's size (600k actions on a full build). Is there some gc of the graph happening above a memory threshold ?

  3. This makes the script simpler, as previous configs are not garbage collected. I am even able to compare them with bazel config $bad $good. We see that the only difference is said flag which is incorrectly omitted when calling bazel info : https://gist.github.com/layus/57fd0aeba78a9a67e8e684379f829e4d#file-output-log-L72-L76

    Displaying diff between configs 35b359 and fbc5f0
    FragmentOptions user-defined {
      @io_bazel_rules_docker//transitions:enable: null, false
    }
    

@layus
Copy link
Contributor Author

layus commented Mar 22, 2023

This in turns hints at more related issues:

#13473, where this comment #13473 (comment) happens to be incorrect:

[...] The info and clean commands no longer warn, and starlark flags have no effect on these commands.
We discover that while starlark flags have no effects on the output of these commands, running these commands has an impact on the daemon state. The incomplete configuration that these commands generate for their own use bump the internal graph version, and allow correct and needed nodes to be garbage collected.

This happens to be an issue because it discards useful state from the daemon. Regenerating that data completely takes more than one minute. bazel info (and bazel {c,}query to some extent) should not have that effect. Outside of the inefficiency, it is extremely surprising to have a sequence like bazel cquery; bazel info; bazel config <some config just printed> fail because the config is not present anymore in the cache.

Data is not retained for as long as naively expected.

@layus
Copy link
Contributor Author

layus commented Mar 23, 2023

I have a rough fix that "works for me" in #17875. Mostly POC and WIP and untested in the parsing error branch, but usable and functional.

@aranguyen aranguyen added team-Core Skyframe, bazel query, BEP, options parsing, bazelrc and removed team-Configurability platforms, toolchains, cquery, select(), config transitions labels Mar 31, 2023
@aranguyen aranguyen removed their assignment Mar 31, 2023
@aranguyen
Copy link
Contributor

@layus as you have discovered already in #17875, commands like info and clean do not take starlark options into consideration. The reasoning behind this current restriction is because The info and clean commands have builds=true set in their annotation but don't actually do any building and starlark option loading might not work in these cases. The configuratbility team no longer owns this part of the code base so I'm routing this to the appropriate team for their input.

@gregestren
Copy link
Contributor

Regenerating that data completely takes more than one minute

Which data specifically do you mean?

I see the effect of bazel config producing different results. But I'm not sure the actual build graph itself gets invalidated because of this.

One other debugging tool to look at this is bazel dump --skyrame=count: this literaly dumps the state of the whole build graph on any call.

Also, I'd expect cquery to be completely consistent with build. Are you seeing otherwise?

@layus
Copy link
Contributor Author

layus commented Aug 9, 2023

I am not able to dig that issue right now, probably in the coming days or weeks. AFAIR the issue is that BASELINE_CONFIGURATION is changed, and that "invalidates" the whole graph. Restoring it to the normal, original value on the next build then needs to check the validity of all nodes. This basically means a traversal of all the graph nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Core Skyframe, bazel query, BEP, options parsing, bazelrc type: bug untriaged
Projects
None yet
Development

No branches or pull requests

4 participants