Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: rework node resolution #1966

Merged
merged 1 commit into from
Nov 14, 2023

Conversation

jedevc
Copy link
Collaborator

@jedevc jedevc commented Jul 24, 2023

This patch reworks and updates the node resolution logic for selecting a node from a builder.

The new implementation reworks the logic to make use of containerd's platforms.MatchComparer interface instead of manually associated strings, and additionally provides a few behavioural changes over the original implementation:

  • Platforms can be matched with non-strict semantics. e.g. i386 builds can be scheduled on an amd64 node, arm/v7 builds can be scheduled on an arm/v8 node.

We also add a new collection of tests for tracking regressions and making the intended behaviour clearer.

build/resolver.go Outdated Show resolved Hide resolved
build/resolver_test.go Outdated Show resolved Hide resolved
Copy link
Member

@crazy-max crazy-max left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit out of scope but it would be nice to check if supported platforms reported by BuildKit are emulated or not and if so, native ones should take priority.

build/resolver_test.go Outdated Show resolved Hide resolved
build/resolver_test.go Outdated Show resolved Hide resolved
build/resolver_test.go Outdated Show resolved Hide resolved
Copy link
Member

@crazy-max crazy-max left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some tests on my side and matching is way smarter than before 🎉

As said in #1966 (review) we could check in follow-up if supported platforms reported by BuildKit are emulated or not and if so, native one should take priority.

I was thinking we would need BuildKit changes to return the native platform as a new worker opt but it doesn't seem necessary as it's always the first item of the list of supported platforms https://github.com/moby/buildkit/blob/687091bb6c8aaa0185cdc570c4db3db533f329d0/util/archutil/detect.go#L23.

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem with this is that (imho) the platforms.Only implementation for arm64 is wrong in containerd. It assumes that all arm64 mean compatibility with arm but that is not the case in practice, eg. no Apple chip is 32bit compatible.

We should at least make sure that if there is both arm node and arm64 node then arm builds will not match with arm64.

build/resolver.go Outdated Show resolved Hide resolved
build/resolver.go Outdated Show resolved Hide resolved
build/resolver.go Outdated Show resolved Hide resolved
build/build.go Outdated Show resolved Hide resolved
build/resolver.go Outdated Show resolved Hide resolved
build/resolver.go Outdated Show resolved Hide resolved
build/resolver.go Outdated Show resolved Hide resolved
build/resolver.go Outdated Show resolved Hide resolved
}
}

func (m *archMatcher) Match(p specs.Platform) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would building on same arch for a different OS be faster than any other node?

Eg. if you are building for darwin/arm64 and have nodes linux/amd64, linux/arm64 then it should not go to linux/arm64 but to the first node.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The theory of this is not about faster speed, but about output artifacts.

OS of a system can be a bit of a lie sometimes, e.g. WSL, or Docker Desktop, etc. However, in those cases, while not perfect, it would be better to match to the architecture, instead of defaulting to just the first.

In your scenario, this could happen with a remote builder connected to Docker Desktop - if I build and --load without a --platform flag, then we'll just select the first builder, which might be the non-native architecture linux/amd64. We should prefer to load the linux/arm64 image to the container store instead.

It's not perfect by any measure, but I think it's a bit more reasonable than just picking the first. This way, we don't require that the user reconfigures the order of their builders (I'm not sure what the UX of that would even be like) to get the desired default behavior.

Copy link
Member

@tonistiigi tonistiigi Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(🤔 I thought I already left a comment for this, either I forgot to submit or it was somewhere else)

The feature you are describing here is determining the implicit platform (if none is given by user) based on DOCKER_CONTEXT/-o context= value if --load is used. If we want this then it should be specific to the --load behavior and look at the context value. Without --load it should be the native platform of the first node.

which might be the non-native architecture linux/amd64.

This is only the case when there is mismatch of configurations between builder and load target. Otherwise the first is always native. I don't understand how ignoring the OS allows determining the load target.

OS of a system can be a bit of a lie sometimes, e.g. WSL, or Docker Desktop, etc

These are server-side platforms. I don't understand where the difference could come from.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this for now, I'll open a follow-up PR to do this at some point.

@jedevc jedevc marked this pull request as draft August 3, 2023 13:17
This patch reworks and updates the node resolution logic for selecting a
node from a builder.

The new implementation reworks the logic to make use of containerd's
platforms.Matcher interface instead of manually associated strings, and
additionally provides a few behavioural changes over the original
implementation, namely platforms can be matched with non-strict
semantics. e.g. i386 builds can be scheduled on an amd64 node, arm/v6
builds can be scheduled on an arm/v7 node.

We also add a new collection of tests for tracking regressions and
making the intended behaviour clearer.

Signed-off-by: Justin Chadwell <me@jedevc.com>
@jedevc jedevc marked this pull request as ready for review August 4, 2023 09:31
@jedevc
Copy link
Collaborator Author

jedevc commented Aug 22, 2023

@crazy-max @tonistiigi does this make sense to take for v0.12?

@crazy-max
Copy link
Member

@crazy-max @tonistiigi does this make sense to take for v0.12?

I would like this one in 0.12 as well. WDYT @tonistiigi?

@jedevc jedevc added this to the v0.12.0 milestone Nov 8, 2023
@tonistiigi tonistiigi merged commit 629b555 into docker:master Nov 14, 2023
59 checks passed
@tonistiigi
Copy link
Member

😬 , I forgot to rerun CI before merging (all was green and no conflicts), and this broke everything so I had to revert. Please reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/driver kind/enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants