Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race with table init channel #53

Merged
merged 1 commit into from
Sep 6, 2024

Conversation

joamaki
Copy link
Contributor

@joamaki joamaki commented Sep 6, 2024

The function to mark a table initialized wrongly closed the channel immediately. This could cause an observer to be woken up too early (before Commit() has finished) and get an uninitialized snapshot.

Fix the timing issue by moving the closing into Commit(), next to the normal watch channel closing. This way the initialized table has been committed before the init watch channel closes.

This issue was detected by pkg/k8s/statedb_test.go in cilium/cilium:

     statedb_test.go:263:
         	Error Trace:	/home/runner/work/cilium/cilium/pkg/k8s/statedb_test.go:263
         	            	/home/runner/work/cilium/cilium/pkg/k8s/statedb_test.go:120
         	Error:      	Not equal:
         	            	expected: "obj1"
         	            	actual  : "garbage"

         	            	Diff:
         	            	--- Expected
         	            	+++ Actual
         	            	@@ -1 +1 @@
         	            	-obj1
         	            	+garbage
         	Test:       	TestStateDBReflector/default

The function to mark a table initialized wrongly closed the
channel immediately. This could cause an observer to be woken
up too early (before Commit() has finished) and get an uninitialized
snapshot.

Fix the timing issue by moving the closing into Commit(), next to
the normal watch channel closing. This way the initialized table has
been committed before the init watch channel closes.

This issue was detected by `pkg/k8s/statedb_test.go` in cilium/cilium:

     statedb_test.go:263:
         	Error Trace:	/home/runner/work/cilium/cilium/pkg/k8s/statedb_test.go:263
         	            	/home/runner/work/cilium/cilium/pkg/k8s/statedb_test.go:120
         	Error:      	Not equal:
         	            	expected: "obj1"
         	            	actual  : "garbage"

         	            	Diff:
         	            	--- Expected
         	            	+++ Actual
         	            	@@ -1 +1 @@
         	            	-obj1
         	            	+garbage
         	Test:       	TestStateDBReflector/default

Signed-off-by: Jussi Maki <jussi.maki@isovalent.com>
@joamaki joamaki requested a review from a team as a code owner September 6, 2024 12:43
@joamaki joamaki requested review from pippolo84 and removed request for a team September 6, 2024 12:43
@joamaki
Copy link
Contributor Author

joamaki commented Sep 6, 2024

Verified fix for the flake in cilium/cilium:

cilium/pkg/k8s$ stress -timeout 10s  ./k8s.test -test.run TestStateDBReflector
5s: 1109 runs so far, 0 failures
10s: 2222 runs so far, 0 failures

Copy link

github-actions bot commented Sep 6, 2024

$ make test
go: downloading github.com/mitchellh/mapstructure v1.5.0
go: downloading golang.org/x/term v0.16.0
go: downloading github.com/spf13/cast v1.6.0
go: downloading github.com/fsnotify/fsnotify v1.7.0
go: downloading github.com/sagikazarmark/slog-shim v0.1.0
go: downloading github.com/spf13/afero v1.11.0
go: downloading golang.org/x/sys v0.17.0
go: downloading github.com/subosito/gotenv v1.6.0
go: downloading github.com/hashicorp/hcl v1.0.0
go: downloading gopkg.in/ini.v1 v1.67.0
go: downloading github.com/magiconair/properties v1.8.7
go: downloading github.com/pelletier/go-toml/v2 v2.1.0
go: downloading golang.org/x/text v0.14.0
	github.com/cilium/statedb/reconciler/benchmark		coverage: 0.0% of statements
	github.com/cilium/statedb/reconciler/example		coverage: 0.0% of statements
ok  	github.com/cilium/statedb	5.811s	coverage: 88.4% of statements
ok  	github.com/cilium/statedb/index	0.005s	coverage: 44.6% of statements
ok  	github.com/cilium/statedb/internal	0.013s	coverage: 93.3% of statements
ok  	github.com/cilium/statedb/part	2.765s	coverage: 82.8% of statements
ok  	github.com/cilium/statedb/reconciler	2.508s	coverage: 92.2% of statements
-----
$ make bench
go test ./... -bench . -benchmem -test.run xxx
goos: linux
goarch: amd64
pkg: github.com/cilium/statedb
cpu: AMD EPYC 7763 64-Core Processor                
BenchmarkDB_WriteTxn_1-4                    	  445275	      2654 ns/op	    376814 objects/sec	    2800 B/op	      32 allocs/op
BenchmarkDB_WriteTxn_10-4                   	 1205655	       995.9 ns/op	   1004168 objects/sec	     742 B/op	      10 allocs/op
BenchmarkDB_WriteTxn_100-4                  	 1402903	       789.0 ns/op	   1267409 objects/sec	     595 B/op	       7 allocs/op
BenchmarkDB_WriteTxn_1000-4                 	 1460799	       818.8 ns/op	   1221258 objects/sec	     549 B/op	       7 allocs/op
BenchmarkDB_WriteTxn_100_SecondaryIndex-4   	  515272	      2289 ns/op	    436855 objects/sec	    1489 B/op	      37 allocs/op
BenchmarkDB_Modify-4                        	    1294	    975426 ns/op	   1025195 objects/sec	  768655 B/op	    8456 allocs/op
BenchmarkDB_GetInsert-4                     	    1189	   1021149 ns/op	    979290 objects/sec	  755677 B/op	    8459 allocs/op
BenchmarkDB_RandomInsert-4                  	    2380	    497357 ns/op	   2010630 objects/sec	  402275 B/op	    7098 allocs/op
BenchmarkDB_RandomReplace-4                 	     318	   3680600 ns/op	    271695 objects/sec	 2347611 B/op	   48570 allocs/op
BenchmarkDB_SequentialInsert-4              	    1504	    823834 ns/op	   1213836 objects/sec	  551320 B/op	    7287 allocs/op
BenchmarkDB_Changes_Baseline-4              	    1201	    974656 ns/op	   1026004 objects/sec	  553101 B/op	   10253 allocs/op
BenchmarkDB_Changes-4                       	     690	   1743496 ns/op	    573560 objects/sec	  993765 B/op	   14494 allocs/op
BenchmarkDB_RandomLookup-4                  	   22141	     54175 ns/op	  18458630 objects/sec	     144 B/op	       1 allocs/op
BenchmarkDB_SequentialLookup-4              	   27142	     45325 ns/op	  22062746 objects/sec	       0 B/op	       0 allocs/op
BenchmarkDB_FullIteration_All-4             	     717	   1675172 ns/op	  59695465 objects/sec	     464 B/op	      12 allocs/op
BenchmarkDB_FullIteration_Get-4             	     210	   5710871 ns/op	  17510498 objects/sec	       0 B/op	       0 allocs/op
BenchmarkDB_PropagationDelay-4              	  479490	      2353 ns/op	        20.00 50th_µs	        24.00 90th_µs	        96.00 99th_µs	    1585 B/op	      24 allocs/op
PASS
ok  	github.com/cilium/statedb	26.776s
PASS
ok  	github.com/cilium/statedb/index	0.004s
PASS
ok  	github.com/cilium/statedb/internal	0.003s
goos: linux
goarch: amd64
pkg: github.com/cilium/statedb/part
cpu: AMD EPYC 7763 64-Core Processor                
Benchmark_Insert_RootOnlyWatch-4    	    7780	    134643 ns/op	   7427072 objects/sec	  104164 B/op	    2041 allocs/op
Benchmark_Insert-4                  	    6019	    187891 ns/op	   5322245 objects/sec	  219052 B/op	    3064 allocs/op
Benchmark_Modify-4                  	    8253	    145408 ns/op	   6877190 objects/sec	  212648 B/op	    1205 allocs/op
Benchmark_GetInsert-4               	    6771	    180407 ns/op	   5543032 objects/sec	  212534 B/op	    1204 allocs/op
Benchmark_Replace-4                 	27196497	        44.48 ns/op	  22479893 objects/sec	       0 B/op	       0 allocs/op
Benchmark_Replace_RootOnlyWatch-4   	27057266	        44.06 ns/op	  22695454 objects/sec	       0 B/op	       0 allocs/op
Benchmark_txn_1-4                   	 2732666	       404.5 ns/op	   2472340 objects/sec	     448 B/op	       7 allocs/op
Benchmark_txn_10-4                  	 7103510	       166.6 ns/op	   6001523 objects/sec	     154 B/op	       2 allocs/op
Benchmark_txn_100-4                 	 7805619	       155.8 ns/op	   6419035 objects/sec	     224 B/op	       2 allocs/op
Benchmark_txn_1000-4                	 6948334	       172.6 ns/op	   5792320 objects/sec	     216 B/op	       2 allocs/op
Benchmark_txn_delete_1-4            	 3115306	       385.1 ns/op	   2596928 objects/sec	     856 B/op	       6 allocs/op
Benchmark_txn_delete_10-4           	 8368215	       141.1 ns/op	   7086982 objects/sec	     132 B/op	       1 allocs/op
Benchmark_txn_delete_100-4          	10282312	       113.4 ns/op	   8815759 objects/sec	      60 B/op	       1 allocs/op
Benchmark_txn_delete_1000-4         	11109649	       107.3 ns/op	   9316159 objects/sec	      26 B/op	       1 allocs/op
Benchmark_Get-4                     	   38984	     31311 ns/op	  31937713 objects/sec	       0 B/op	       0 allocs/op
Benchmark_Iterate-4                 	  177468	      6956 ns/op	 143769986 objects/sec	      80 B/op	       3 allocs/op
Benchmark_Hashmap_Insert-4          	   15304	     76995 ns/op	  12987921 objects/sec	   86550 B/op	      64 allocs/op
Benchmark_Hashmap_Get_Uint64-4      	  163414	      7354 ns/op	 135987515 objects/sec	       0 B/op	       0 allocs/op
Benchmark_Hashmap_Get_Bytes-4       	  154545	      7664 ns/op	 130483084 objects/sec	       0 B/op	       0 allocs/op
Benchmark_Uint64Map_Random-4        	    1382	    863033 ns/op	   1158705 items/sec	 2703321 B/op	    9030 allocs/op
Benchmark_Uint64Map_Sequential-4    	    1479	    800315 ns/op	   1249509 items/sec	 2492406 B/op	    9749 allocs/op
PASS
ok  	github.com/cilium/statedb/part	28.264s
PASS
ok  	github.com/cilium/statedb/reconciler	0.004s
?   	github.com/cilium/statedb/reconciler/benchmark	[no test files]
?   	github.com/cilium/statedb/reconciler/example	[no test files]
go run ./reconciler/benchmark -quiet
1000000 objects reconciled in 2.96 seconds (batch size 1000)
Throughput 337815.89 objects per second
Allocated 6011264 objects, 424771kB bytes, 513432kB bytes still in use

joamaki added a commit to joamaki/cilium that referenced this pull request Sep 6, 2024
A timing bug in closing of the table init channel
is sometimes causing the test to fail:

  statedb_test.go:263:
     Error Trace:    /.../pkg/k8s/statedb_test.go:263
                     /.../pkg/k8s/statedb_test.go:133
     Error:          Not equal:
                     expected: "obj1"
                     actual  : "garbage"

                     Diff:
                     --- Expected
                     +++ Actual
                     @@ -1 +1 @@
                     -obj1
                     +garbage
     Test:           TestStateDBReflector/TransformMany-QueryAll
  FAIL

The root cause for this is a bug in StateDB that causes
the init watch channel to be closed too early. This
is fixed in cilium/statedb#53.

As it takes a bit of time until we've bumped the StateDB
version, work around this issue by repeatedly checking
if the table is initialized, so folks aren't exposed
to the flake.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@joamaki joamaki merged commit f96349e into main Sep 6, 2024
1 check passed
@joamaki joamaki deleted the pr/joamaki/fix-ordering-init-watch branch September 6, 2024 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants