Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: annotate tasksUpdated with reason and purge DeepEquals #16421

Merged
merged 7 commits into from
Mar 14, 2023

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Mar 10, 2023

reflect.DeepEqual is not suitable for comparing Job structs.

By design, it

  • treats nil vs empty slices/maps as different

Note that a non-nil empty slice and a nil slice (for example, []byte{} and []byte(nil)) are not deeply equal.

  • compares private struct fields

Struct values are deeply equal if their corresponding fields, both exported and unexported, are deeply equal.

The use of reflect.DeepEqual has been the root of bugs a few times in the past.

#14917
#10896
#9330

This PR introduces the OpaqueMapEquals helper function as a replacement for DeepEqual for the specific cases of comparing map[comparable]interface{}, which is necessary in cases like opaque plugin config, and envoy configs. Unlike DE, OME exhibits the opposite behavior in the two scenarios above. For other (known) types, implement .Equal method on structs and compare values directly, or defer to other sensible helpers.

The tasksUpdated helper has also been modified to return a struct containing the first mis-match of fields triggering a destructive update (as opposed to a simple boolean). Folks tracking down misbehavior now need only to add a log line to understand why tasksUpdated is (or isn't) triggering updates.

Benchmark

before

nomad/scheduler on main
➜ go test -count=3 -bench=BenchmarkTasksUpdated -run BenchmarkTasksUpdated
BenchmarkTasksUpdated-24          105920             14791 ns/op
BenchmarkTasksUpdated-24           82866             14376 ns/op
BenchmarkTasksUpdated-24           89450             14674 ns/op
PASS
ok      github.com/hashicorp/nomad/scheduler    4.485s

after

nomad/scheduler on tasks-updated-annotate
➜ go test -count=3 -bench=BenchmarkTasksUpdated -run BenchmarkTasksUpdated
BenchmarkTasksUpdated-24           54816             24469 ns/op
BenchmarkTasksUpdated-24           47431             24392 ns/op
BenchmarkTasksUpdated-24           48696             24565 ns/op
PASS
ok      github.com/hashicorp/nomad/scheduler    4.436s

The slowdown is primarily caused by OpaqueMapEquals - replacing its implementation with DeepEqual gets the numbers back down. But these numbers are still negligible and the tradeoff for correctness seems worthwhile.

Closes #16290

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to count my Consul Config comparison concerns as blockers. I bet it's ok to make these changes. The breakages we've had with Consul's API in the past have always been a lot more obvious than nil vs empty.

I think any time any of us writes an Equal method we should remind one another to someday use a code generator. 😅

@@ -0,0 +1,48 @@
package lang
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally forgot this package existed. The only other file in here defines Pair which is never referenced directly but only in other types. I think helper is the package most of us look in for common comparison funcs, so my preference would be to use that and kill this, but it really doesn't matter much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, moved this into helper. At some point it might be nice to split up helper into Nomad-specific helpers (like CheckHCLKeys) vs. language constructs (like this), but we can bikeshed that on a rainy day 🙂

Comment on lines 9188 to 9192
func (a *Affinity) Hash() string {
if a == nil {
return "(nil)"
}
return a.String()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this and Spread.Hash for? Having a Hash method return non-hash-derived values seems strange as String() formatting changes for testing/debugging will break Hash compatibility even when the actual struct's contents haven't changed. However these methods don't appear to be called directly anywhere, so I'm not even sure what they're for.

Our use of Hash() methods is already really inconsistent already both in usage pattern (does it generate a hash on every call or does it require calling SetHash first?!) and return type (ints, strings, bytes 🤷). I don't want to make that situation worse without good reason. We've kind of made a mess of any common meaning for Hash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Affinity.Hash and Spread.Hash are being used

here: https://github.com/hashicorp/nomad/pull/16421/files#diff-25a0c4de309c3511db80cb2707981cda3326bfd49f29148222be44ba9a529fe2R488

and here: https://github.com/hashicorp/nomad/pull/16421/files#diff-25a0c4de309c3511db80cb2707981cda3326bfd49f29148222be44ba9a529fe2R517

Previously we created a slice of all the Affinity/Spread on the job from each level and used reflect.DeepEqual on the slice. Implementing Hash() and deferring to HashSet was just a lazy way doing something like creating the slice and using slices.EqualsFunc - but I can make that change.

@@ -1409,7 +1401,7 @@ func (p *ConsulProxy) Equal(o *ConsulProxy) bool {
return false
}

if !opaqueMapsEqual(p.Config, o.Config) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe? Can we be sure that Consul doesn't treat a nested empty map differently from a nil map? If the map is a nested value of Config (eg Config["foo"] = map[string]string{}), I could see nil meaning unset and empty meaning set but use defaults. Even it it's safe today can we be sure it will be safe for all future versions of Consul?

Maybe to put it another way: do we gain anything by not using DeepEqual in these cases? It seems like making any assumptions about what's meaningful in the keys and values of opaque maps is ... well: not opaque!

Copy link
Member Author

@shoenig shoenig Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the flip side - what if Consul (or Envoy!) normalizes these fields one way or the other, and the original job uses the other form?

These opaque fields are a landmine no matter what, and I think it's more likely we run into the normalization problem.

Copy link
Member Author

@shoenig shoenig Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a nice nap I realized this is nonsense - when Consul/Envoy normalize these fields they come back via our requesting of the Consul API, and that comparison is done in consul/service_client, which itself uses DeepEqual for the opaque maps.

https://github.com/hashicorp/nomad/blob/main/command/agent/consul/service_client.go#L343

I'll switch the comparison for envoy fields here back to DeepEqual to match

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Can we add a note to the jobspec checklist about when to implement an Equal method? Do we want to lean on this method for nomad/structs/diff.go as well? (Probably in a separate PR)

nomad/structs/structs.go Show resolved Hide resolved
Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we talked about it in the past, but it would nice to have some automated, randomized, fuzzed tests for this kind of thing.


// tasksUpdated creates a comparison between task groups to see if the tasks, their
// drivers, environment variables or config have been modified.
func tasksUpdated(jobA, jobB *structs.Job, taskGroup string) comparison {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a very loaded question, and it may be a premature micro-optimization, but do we have a sense of which fields are most likely to be updated?

Since we return early on first change detected I think it would make sense to defer comparison for less used fields (like ephemeral disks) to the bottom and bring fields that more likely to change (like task config or env) closer to the top.

Would it be helpful to also quickly hash the structs with something like https://github.com/mitchellh/hashstructure to exit early if the hashes match, meaning that no update happened?

Copy link
Member Author

@shoenig shoenig Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we return early on first change detected I think it would make sense to defer comparison for less used fields

I thought about this, and introduced this Benchmark in case someone wants to improve things. For this PR I really only care about correctness.

Would it be helpful to also quickly hash the structs

Unfortunately no generic hashing or equals methodology will work here - there's a handful of custom comparisons we need to do (i.e. ignoring or normalizing fields) even outside of Canonicalize, which you can see in the helper methods called by tasksUpdated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we could generate a hash, we'd probably want to cache that hash value as part of the struct for it to benefit performance. Otherwise we'd have to rehash on each pass through the scheduler, which would mean chasing all the same set of pointers we need to do to call the helper methods anyways. That's why ex. the node class hash is precalculated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, and introduced this Benchmark in case someone wants to improve things. For this PR I really only care about correctness.

Makes sense to focus on correctness. I think the problem with benchmarking something like this is that it's not an empirical performance change, it's more like a heuristic based on what we think is most often modified.

Just for a quick check I modified the benchmark so it updates a task env and moved the task loop to the top and got some improvement:

goos: darwin
goarch: arm64
pkg: github.com/hashicorp/nomad/scheduler
BenchmarkTasksUpdated-10          125265              9232 ns/op
BenchmarkTasksUpdated-10          138152              8697 ns/op
BenchmarkTasksUpdated-10          128814              8714 ns/op
BenchmarkTasksUpdated-10          136887              8711 ns/op
BenchmarkTasksUpdated-10          138432              8690 ns/op
BenchmarkTasksUpdated-10          137956              8736 ns/op
BenchmarkTasksUpdated-10          136660              8736 ns/op
BenchmarkTasksUpdated-10          138224              8728 ns/op
BenchmarkTasksUpdated-10          137574              8729 ns/op
BenchmarkTasksUpdated-10          133634              8733 ns/op
PASS
ok      github.com/hashicorp/nomad/scheduler    13.029s
goos: darwin
goarch: arm64
pkg: github.com/hashicorp/nomad/scheduler
BenchmarkTasksUpdated-10          238004              4909 ns/op
BenchmarkTasksUpdated-10          237406              4527 ns/op
BenchmarkTasksUpdated-10          327616              3654 ns/op
BenchmarkTasksUpdated-10          324906              3646 ns/op
BenchmarkTasksUpdated-10          325503              3808 ns/op
BenchmarkTasksUpdated-10          325220              3822 ns/op
BenchmarkTasksUpdated-10          320337              3730 ns/op
BenchmarkTasksUpdated-10          279225              3694 ns/op
BenchmarkTasksUpdated-10          321788              3704 ns/op
BenchmarkTasksUpdated-10          317722              3680 ns/op
PASS
ok      github.com/hashicorp/nomad/scheduler    12.379s

@shoenig
Copy link
Member Author

shoenig commented Mar 14, 2023

Can we add a note

I added a bit of clarification to the jopspec guide. As for diffing, I think that mostly boils down to the fieldDiffs helper ... and a lot of other custom code. TBH I don't know where to begin cleaning that up, other than to say we shouldn't be writing custom parsing/diff code.

@shoenig
Copy link
Member Author

shoenig commented Mar 14, 2023

With the swap back to DE for Envoy configs,

BenchmarkTasksUpdated-24           96236             13000 ns/op
BenchmarkTasksUpdated-24           93657             13166 ns/op
BenchmarkTasksUpdated-24           90940             13063 ns/op

... go-cmp is really slow 🙁

@shoenig shoenig merged commit 1a01e87 into main Mar 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.5.x backport to 1.5.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scheduler: overhaul util.go/tasksUpdated
4 participants