Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The distributed coordinate subsystem can be infected by floating point NaN values. #3023

Closed
rboyer opened this issue May 9, 2017 · 6 comments
Assignees
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected

Comments

@rboyer
Copy link
Member

rboyer commented May 9, 2017

consul version for Server

Server: 0.7.1

consul info for Server

Server:

agent:
	check_monitors = 20
	check_ttls = 1
	checks = 21
	services = 26
build:
	prerelease = 
	revision = 'a9455cd
	version = 0.7.1
consul:
	bootstrap = false
	known_datacenters = 2
	leader = false
	leader_addr = 10.1.1.12:8300
	server = true
raft:
	applied_index = 729213386
	commit_index = 729213386
	fsm_pending = 0
	last_contact = 15.263551ms
	last_log_index = 729213386
	last_log_term = 150913
	last_snapshot_index = 729206402
	last_snapshot_term = 150913
	latest_configuration = [{Suffrage:Voter ID:10.1.1.10:8300 Address:10.1.1.10:8300} {Suffrage:Voter ID:10.1.1.11:8300 Address:10.1.1.11:8300} {Suffrage:Voter ID:10.1.1.12:8300 Address:10.1.1.12:8300}]
	latest_configuration_index = 1
	num_peers = 2
	protocol_version = 1
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 150913
runtime:
	arch = amd64
	cpu_count = 4
	goroutines = 460
	max_procs = 4
	os = linux
	version = go1.7.3
serf_lan:
	encrypted = true
	event_queue = 0
	event_time = 78420
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 27565
	members = 66
	query_queue = 0
	query_time = 12
serf_wan:
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 332
	members = 6
	query_queue = 0
	query_time = 8

Operating system and Environment details

$ cat /etc/redhat-release
CentOS release 6.8 (Final)

$ uname -a
Linux host1 2.6.32-642.13.1.el6.x86_64 #1 SMP Wed Jan 11 20:56:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the Issue (and unexpected/desired result)

Somehow NaN values worked their way through the distributed coordinate subsystem in one of our consul clusters. It first was detected by a failure to JSON serialize some *coordinate.Coordinate structs for the following endpoints:

  • /v1/agent/self
  • /v1/coordinate/nodes

In an effort to see which fields in the server's representation of the coordinates were NaN I recompiled consul with the following patch and then hit the /v1/coordinate/nodes endpoint to get the consul server to dump the coordinate database to stdout:

diff --git a/command/agent/coordinate_endpoint.go b/command/agent/coordinate_endpoint.go
index 9245322..8ae1c5f 100644
--- a/command/agent/coordinate_endpoint.go
+++ b/command/agent/coordinate_endpoint.go
@@ -1,9 +1,11 @@
 package agent
 
 import (
	"github.com/hashicorp/consul/consul/structs"
+	"fmt"
 	"net/http"
 	"sort"
 )
 
 // coordinateDisabled handles all the endpoints when coordinates are not enabled,
@@ -79,5 +81,11 @@ func (s *HTTPServer) CoordinateNodes(resp http.ResponseWriter, req *http.Request
 	if out.Coordinates == nil {
 		out.Coordinates = make(structs.Coordinates, 0)
 	}
+
+	fmt.Println("MAGIC DEBUGGING START >>>")
+	defer fmt.Println("MAGIC DEBUGGING STOP >>>")
+	for i, c := range out.Coordinates {
+		fmt.Printf("\t %d, %s, %+v\n", i, c.Node, c.Coord)
+	}
 	return out.Coordinates, nil
 }

This yielded the amusing results below:

MAGIC DEBUGGING START >>>
         0, host1, &{Vec:[NaN NaN NaN NaN NaN NaN NaN NaN] Error:NaN Adjustment:NaN Height:1e-05}
         1, host2, &{Vec:[NaN NaN NaN NaN NaN NaN NaN NaN] Error:NaN Adjustment:NaN Height:1e-05}
...snip... repeated 62 more times for agents and servers...
MAGIC DEBUGGING STOP >>>

It's everywhere! Our current working theory is that the NaN is infectious when any of the vector math, scaling, and the rest of the vivaldi math process even one NaN value. This is pretty awful.

Amusingly you can serialize a NaN in msgpack but not json.

Initial ideas for how to prevent this from happening in the future:

  • any place in the server or agent code (or down in the serf layer) where coordinates are exchanged, inspect them for NaN (and possibly Inf) and reject the coordinate before it gets internalized
  • periodically in the background on agents and servers, scan the internal coordinate database for NaN values and reset the coordinates to shake out bad floating point numbers
  • (bonus) add some sort of mechanism to trigger the entire distributed coordinate subsystem to reset itself. Possibly by adding an epoch number to the coordinates and treat them like Raft Terms: once you see a newer term you upgrade yourself, reset your coordinates, and ignore coordinates from others from lower terms.

Reproduction steps

It's unclear where the original NaN value came from given how many obvious places in the code already safeguard against this.

@slackpad slackpad added the type/bug Feature does not function as expected label May 9, 2017
@slackpad
Copy link
Contributor

slackpad commented May 9, 2017

Thanks for the report @rboyer. We will take a look through the math again and try to mitigate any other NaN-producing spots, and likely do the first two suggestions. Will think about the third.

@rboyer
Copy link
Member Author

rboyer commented May 9, 2017

As we do not currently use coordinates for anything explicitly, my immediate plans are to reconfigure all of our clusters to set disable_coordinates=true in the consul configuration file. This will prevent problems going forward, but is non ideal.

I poked around but without something akin to the 3rd point above, I can't figure out a good way to purge all of the NaN values from the distributed system without completely nuking it. Initially I thought the following hack to just the voting server nodes would reset it, but I suspect I actually have to deploy a more involved change to all of the agents involved as well to get the change to really stick:

diff --git a/consul/coordinate_endpoint.go b/consul/coordinate_endpoint.go
index 9e0df58..6fd8fa5 100644
--- a/consul/coordinate_endpoint.go
+++ b/consul/coordinate_endpoint.go
@@ -172,6 +181,22 @@ func (c *Coordinate) ListNodes(args *structs.DCSpecificRequest, reply *structs.I
                                return err
                        }
 
+                       for _, coord := range coords {
+                               node := coord.Node
+                               sc := coord.Coord
+                               if !sc.IsValid() { // helper function that checks for NaN and returns true
+                                       c.srv.logger.Printf("[WARN] consul.coordinate: Found NaN in coordinate for node %s", node)
+
+                                       c2 := sc.Clone()
+                                       c2.Fix() // helper function that converts NaN into zeroes
+
+                                       // Add the coordinate to the map of pending updates.
+                                       c.updatesLock.Lock()
+                                       c.updates[node] = c2
+                                       c.updatesLock.Unlock()
+                               }
+                       }

@rboyer
Copy link
Member Author

rboyer commented May 9, 2017

More debugging breadcrumbs:

We scraped any coordinate:... line from /var/lib/consul/serf/local.snapshot across all of the servers and agents. If you load those up and execute coordinate.Coordinate.DistanceTo on all pairs of coordinates they all return a time.Duration of -2562047h47m16.854775808s aka an int64 value of -9223372036854775808 aka math.MinInt64.

@slackpad
Copy link
Contributor

slackpad commented May 9, 2017

Sorry for the hassle on this one. If you stopped your cluster and deleted all those lines from the /var/lib/consul/serf/local.snapshot files I think it would fix this, but that's not ideal either.

@slackpad
Copy link
Contributor

@rboyer if you could share any of your /var/lib/consul/serf/local.snapshot files (even grepped for just coordinates) that might help shed some light on how things went awry. If it's possible to post a gist or email me at james hashicorp.com I'd appreciate it!

@deadok22
Copy link

FYI /ui/#/dc/services/ uses coordinates and breaks completely on disable_coordinates=true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

3 participants