Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make testserver multi-node more robust. #148

Merged
merged 1 commit into from
Oct 3, 2022

Conversation

RichardJCai
Copy link
Contributor

  • Fix race condition in pollListeningURLFile.
  • Add Opts for init node timeout and poll listening url.
    • InitTimeoutOpt, PollListenURLTimeoutOpt
  • Parallelize restart test

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@RichardJCai RichardJCai force-pushed the make_multinode_testserver_robust branch 3 times, most recently from a98fbfc to fc0e44c Compare September 21, 2022 21:05
- Fix race condition in pollListeningURLFile.
- Add Opts for init node timeout and poll listening url.
  - InitTimeoutOpt, PollListenURLTimeoutOpt
- Parallelize restart test
@RichardJCai RichardJCai force-pushed the make_multinode_testserver_robust branch from fc0e44c to dd2bfb9 Compare September 21, 2022 21:13
Copy link
Contributor

@pawalt pawalt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we can test the added arguments? Maybe make a testserver configuration we know will timeout on init and time it to make sure the option is propogated properly?

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss, @RichardJCai, and @ZhouXing19)


testserver/testserver.go line 814 at r1 (raw file):

		ts.mu.RLock()

		nodeDir := fmt.Sprintf("%s%d", ts.baseDir, i)

Why do we need to remove the individual directories if we're going to removeAll after anyway? Is there some case in which we'd fail to remove a single node's directory but not the others?


testserver/testserver_test.go line 463 at r1 (raw file):

			for j := 0; j < 3; j++ {
				for {
					port, err := getFreePort()

This is a very edge case, but this could race if another process is attempting to get a port from:0 at the same time. If that process does a TCP listen on :0 between when these ports are allocated and the testserver is started, it could get assigned one of the ports you've allocated here.

I think if you just feed the testserver :0 as the port to listen on, it'll start up guaranteeing port uniqueness, and you won't have to do any of this allocation.

@RichardJCai
Copy link
Contributor Author

Is there any way we can test the added arguments? Maybe make a testserver configuration we know will timeout on init and time it to make sure the option is propogated properly?

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss, @RichardJCai, and @ZhouXing19)

testserver/testserver.go line 814 at r1 (raw file):

		ts.mu.RLock()

		nodeDir := fmt.Sprintf("%s%d", ts.baseDir, i)

Why do we need to remove the individual directories if we're going to removeAll after anyway? Is there some case in which we'd fail to remove a single node's directory but not the others?

testserver/testserver_test.go line 463 at r1 (raw file):

			for j := 0; j < 3; j++ {
				for {
					port, err := getFreePort()

This is a very edge case, but this could race if another process is attempting to get a port from:0 at the same time. If that process does a TCP listen on :0 between when these ports are allocated and the testserver is started, it could get assigned one of the ports you've allocated here.

I think if you just feed the testserver :0 as the port to listen on, it'll start up guaranteeing port uniqueness, and you won't have to do any of this allocation.

@RichardJCai
Copy link
Contributor Author

Is there any way we can test the added arguments? Maybe make a testserver configuration we know will timeout on init and time it to make sure the option is propogated properly?

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss, @RichardJCai, and @ZhouXing19)

testserver/testserver.go line 814 at r1 (raw file):

		ts.mu.RLock()

		nodeDir := fmt.Sprintf("%s%d", ts.baseDir, i)

Why do we need to remove the individual directories if we're going to removeAll after anyway? Is there some case in which we'd fail to remove a single node's directory but not the others?

testserver/testserver_test.go line 463 at r1 (raw file):

			for j := 0; j < 3; j++ {
				for {
					port, err := getFreePort()

This is a very edge case, but this could race if another process is attempting to get a port from:0 at the same time. If that process does a TCP listen on :0 between when these ports are allocated and the testserver is started, it could get assigned one of the ports you've allocated here.

I think if you just feed the testserver :0 as the port to listen on, it'll start up guaranteeing port uniqueness, and you won't have to do any of this allocation.

@RichardJCai RichardJCai reopened this Sep 22, 2022
Copy link
Contributor Author

@RichardJCai RichardJCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @rafiss, and @ZhouXing19)


testserver/testserver.go line 814 at r1 (raw file):

Previously, pawalt (Peyton Walters) wrote…

Why do we need to remove the individual directories if we're going to removeAll after anyway? Is there some case in which we'd fail to remove a single node's directory but not the others?

They're separate directories, not nested. Ie the base one looks like cockroach-testserver123 and the node directories are cockroach-testserver1230, cockroach-testserver1231...


testserver/testserver_test.go line 463 at r1 (raw file):

This is a very edge case, but this could race if another process is attempting to get a port from:0 at the same time. If that process does a TCP listen on :0 between when these ports are allocated and the testserver is started, it could get assigned one of the ports you've allocated here.

Yeah I thought about using :0 but we need to know what address to feed into the --join flag which includes the port. Pretty annoying problem here.

Copy link
Contributor

@pawalt pawalt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Your call on testing the new arguments. Functionality looks pretty obvious, but never hurts to have some tests.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss, @RichardJCai, and @ZhouXing19)


testserver/testserver_test.go line 463 at r1 (raw file):

Previously, RichardJCai (Richard Cai) wrote…

This is a very edge case, but this could race if another process is attempting to get a port from:0 at the same time. If that process does a TCP listen on :0 between when these ports are allocated and the testserver is started, it could get assigned one of the ports you've allocated here.

Yeah I thought about using :0 but we need to know what address to feed into the --join flag which includes the port. Pretty annoying problem here.

Hmmm that is annoying. Thanks for the clarification.

@RichardJCai
Copy link
Contributor Author

LGTM. Your call on testing the new arguments. Functionality looks pretty obvious, but never hurts to have some tests.

I'm allergic to adding tests for timeouts. Jk I'll add a simple one.

@RichardJCai
Copy link
Contributor Author

Actually I'm gonna merge this as is, adding a testhook for timeout is somewhat annoying and I'm gonna try to get the testserver into CRDB before I leave.

@RichardJCai RichardJCai merged commit f3c635a into master Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants