Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NewServerConfiguration() called twice after config reload with -watch causing e.g. double health probes/jobs #692

Closed
johakoch opened this issue Jan 31, 2023 · 6 comments · Fixed by #730 or #736
Assignees
Labels
bug Something isn't working
Milestone

Comments

@johakoch
Copy link
Collaborator

With -watch, after a config reload,

  • the probe is shut down later than the server, and
  • I get two health probes.

Example:

server {
  hosts = ["*:8080"]
  endpoint "/" {
    proxy {
      backend = "be"
    }
  }
}

definitions {
  backend "be" {
    origin = "https://blackhole.webpagetest.org"
    beta_health {
      interval = "10s"
      timeout = "3s"  
    }
  }
}

Start Couper

INFO[0000] watching configuration file(s)                build=dev type=couper_daemon version=0 watch="map[max-retries:5 retry-delay:500ms]"
INFO[0000] ulimit: max open files: 1048576 (hard limit: 1048576)  build=dev type=couper_daemon version=0
INFO[0000] couper is serving: 0.0.0.0:8080               build=dev type=couper_daemon version=0

Normal health check:

ERRO[0000] backend error: context deadline exceeded      backend=be build=dev custom="map[]" error_type=backend method=GET request="map[headers:map[user-agent:Couper / 0 health-check] host:blackhole.webpagetest.org method:GET name:health-check origin:blackhole.webpagetest.org path: proto:https]" status=0 timings="map[dns:20.997 total:3000.255]" type=couper_backend uid=cfcjk5beg9scfeq3hfig url="https://blackhole.webpagetest.org" version=0
WARN[0003] new health state: failing                     backend=be build=dev type=couper_daemon uid=cfcjk5beg9scfeq3hfig url="https://blackhole.webpagetest.org" version=0

Reload config:

INFO[0008] reloading couper configuration                build=dev type=couper_daemon version=0
INFO[0008] ulimit: max open files: 1048576 (hard limit: 1048576)  build=dev type=couper_daemon version=0
ERRO[0008] retry 1/5 due to listen error: listen tcp4 :8080: bind: address already in use  build=dev type=couper_daemon version=0
WARN[0008] shutting down                                 build=dev deadline=0s delay=0s type=couper_daemon version=0
INFO[0008] http: Server closed: 0.0.0.0:8080             build=dev type=couper_daemon version=0
INFO[0008] ulimit: max open files: 1048576 (hard limit: 1048576)  build=dev type=couper_daemon version=0
INFO[0008] couper is serving: 0.0.0.0:8080               build=dev type=couper_daemon version=0

Double health check log:

ERRO[0008] backend error: context deadline exceeded      backend=be build=dev custom="map[]" error_type=backend method=GET request="map[headers:map[user-agent:Couper / 0 health-check] host:blackhole.webpagetest.org method:GET name:health-check origin:blackhole.webpagetest.org path: proto:https]" status=0 timings="map[dns:24.815 total:3000.822]" type=couper_backend uid=cfcjk7beg9scfeq3hfj0 url="https://blackhole.webpagetest.org" version=0
WARN[0011] new health state: failing                     backend=be build=dev type=couper_daemon uid=cfcjk7beg9scfeq3hfj0 url="https://blackhole.webpagetest.org" version=0
ERRO[0008] backend error: context deadline exceeded      backend=be build=dev custom="map[]" error_type=backend method=GET request="map[headers:map[user-agent:Couper / 0 health-check] host:blackhole.webpagetest.org method:GET name:health-check origin:blackhole.webpagetest.org path: proto:https]" status=0 timings="map[dns:20.787 total:3000.621]" type=couper_backend uid=cfcjk7beg9scfeq3hfjg url="https://blackhole.webpagetest.org" version=0
WARN[0011] new health state: failing                     backend=be build=dev type=couper_daemon uid=cfcjk7beg9scfeq3hfjg url="https://blackhole.webpagetest.org" version=0

Health probe shutdown log:

WARN[0013] shutdown health probe                         backend=be build=dev type=couper_daemon url="https://blackhole.webpagetest.org" version=0

Still double health check logs:

ERRO[0021] backend error: context deadline exceeded      backend=be build=dev custom="map[]" error_type=backend method=GET request="map[headers:map[user-agent:Couper / 0 health-check] host:blackhole.webpagetest.org method:GET name:health-check origin:blackhole.webpagetest.org path: proto:https]" status=0 timings="map[dns:20.353 total:3000.917]" type=couper_backend uid=cfcjkajeg9scfeq3hfk0 url="https://blackhole.webpagetest.org" version=0
ERRO[0024] backend error: backend error: connecting to be 'blackhole.webpagetest.org' failed: i/o timeout: new health state: unhealthy  backend=be build=dev error_type=backend_unhealthy type=couper_daemon uid=cfcjkajeg9scfeq3hfk0 url="https://blackhole.webpagetest.org" version=0
ERRO[0021] backend error: context deadline exceeded      backend=be build=dev custom="map[]" error_type=backend method=GET request="map[headers:map[user-agent:Couper / 0 health-check] host:blackhole.webpagetest.org method:GET name:health-check origin:blackhole.webpagetest.org path: proto:https]" status=0 timings="map[dns:23.634 total:3000.681]" type=couper_backend uid=cfcjkajeg9scfeq3hfkg url="https://blackhole.webpagetest.org" version=0
ERRO[0024] backend error: backend error: connecting to be 'blackhole.webpagetest.org' failed: i/o timeout: new health state: unhealthy  backend=be build=dev error_type=backend_unhealthy type=couper_daemon uid=cfcjkajeg9scfeq3hfkg url="https://blackhole.webpagetest.org" version=0
ERRO[0034] backend error: context deadline exceeded      backend=be build=dev custom="map[]" error_type=backend method=GET request="map[headers:map[user-agent:Couper / 0 health-check] host:blackhole.webpagetest.org method:GET name:health-check origin:blackhole.webpagetest.org path: proto:https]" status=0 timings="map[dns:27.049 total:3000.152]" type=couper_backend uid=cfcjkdreg9scfeq3hfl0 url="https://blackhole.webpagetest.org" version=0
ERRO[0034] backend error: context deadline exceeded      backend=be build=dev custom="map[]" error_type=backend method=GET request="map[headers:map[user-agent:Couper / 0 health-check] host:blackhole.webpagetest.org method:GET name:health-check origin:blackhole.webpagetest.org path: proto:https]" status=0 timings="map[dns:21.148 total:3000.795]" type=couper_backend uid=cfcjkdreg9scfeq3hflg url="https://blackhole.webpagetest.org" version=0
ERRO[0047] backend error: context deadline exceeded      backend=be build=dev custom="map[]" error_type=backend method=GET request="map[headers:map[user-agent:Couper / 0 health-check] host:blackhole.webpagetest.org method:GET name:health-check origin:blackhole.webpagetest.org path: proto:https]" status=0 timings="map[dns:20.394 total:3000.424]" type=couper_backend uid=cfcjkh3eg9scfeq3hfm0 url="https://blackhole.webpagetest.org" version=0
ERRO[0047] backend error: context deadline exceeded      backend=be build=dev custom="map[]" error_type=backend method=GET request="map[headers:map[user-agent:Couper / 0 health-check] host:blackhole.webpagetest.org method:GET name:health-check origin:blackhole.webpagetest.org path: proto:https]" status=0 timings="map[dns:21.6 total:3000.629]" type=couper_backend uid=cfcjkh3eg9scfeq3hfmg url="https://blackhole.webpagetest.org" version=0
@johakoch johakoch added the bug Something isn't working label Jan 31, 2023
@johakoch
Copy link
Collaborator Author

johakoch commented Feb 6, 2023

  • go p.probe(opts.Context) is called in transport.NewProbe()
  • transport.NewProbe() is called in transport.NewBackend()
  • transport.NewBackend() is called in runtime.newBackend()
  • runtime.newBackend() is called in runtime.NewBackend()
  • runtime.NewBackend() is called for conf.Definitions.Backend in runtime.NewServerConfiguration()
  • runtime.NewServerConfiguration() is called in both Run.Execute() and for a reload in realmain()
  • Run.Execute() is also called for a reload in realmain()

With

func realmain(ctx context.Context, arguments []string) int {
...
		case _, more := <-reloadCh:
...
			_, reloadErr = runtime.NewServerConfiguration(cf, logger.WithFields(fields), tmpMemStore)
...
			execCmd, restartSignal = newRestartableCommand(ctx, cmd) // replace previous pair
			go func() {
				// logger settings update gets ignored at this point
				// have to be locked for an update, skip this feature for now
				errCh <- execCmd.Execute(args, confFile, logger)
			}()

and

func (r *Run) Execute(args Args, config *config.Couper, logEntry *logrus.Entry) error {
...
	srvConf, err := runtime.NewServerConfiguration(config, logEntry, memStore)

@johakoch johakoch changed the title Double health probes after config reload with -watch NewServerConfiguration() called twice after config reload with -watch causing e.g. double health probes Feb 6, 2023
@johakoch
Copy link
Collaborator Author

johakoch commented Feb 6, 2023

Jobs are started twice after reload, too.

@johakoch johakoch changed the title NewServerConfiguration() called twice after config reload with -watch causing e.g. double health probes NewServerConfiguration() called twice after config reload with -watch causing e.g. double health probes/jobs Feb 6, 2023
@malud
Copy link
Collaborator

malud commented Feb 16, 2023

We have a context key called request.ConfigDryRun which should prevent go-routines gets fired on object creation.

We should add this condition to all related objects like jobs or maybe jwks_url configuration_url (oidc) too.
Same pattern as in probe.go:100

@malud malud added this to the 1.12.1 milestone Feb 16, 2023
@johakoch johakoch linked a pull request Feb 24, 2023 that will close this issue
@alex-schneider alex-schneider linked a pull request Feb 24, 2023 that will close this issue
@johakoch
Copy link
Collaborator Author

johakoch commented Feb 24, 2023

The ConfigDryRun seems to be working already. The NewServerConfiguration() called directly from realmain() has ConfigDryRun set in its context preventing the probe from running.

But there are 2 calls to execCmd.Execute() (calling NewServerConfiguration() without setting ConfigDryRun) following a reload.
The first is the already mentioned:

	reloadCh := watchConfigFiles(confFile.Files, logger, flags.FileWatchRetries, flags.FileWatchRetryDelay)
	for {
		select {
...
		case _, more := <-reloadCh:
...
			restartSignal <- struct{}{}                              // shutdown running couper
			<-errCh                                                  // drain current error due to cancel and ensure closed ports
			execCmd, restartSignal = newRestartableCommand(ctx, cmd) // replace previous pair
			go func() {
				// logger settings update gets ignored at this point
				// have to be locked for an update, skip this feature for now
				errCh <- execCmd.Execute(args, confFile, logger)
				         ^^^^^^^^^^^^^^^
			}()

The other is caused in response to an error.
(see the error in the log)

ERRO[0008] retry 1/5 due to listen error: listen tcp4 :8080: bind: address already in use  build=dev type=couper_daemon version=0
		case err = <-errCh:
			if err != nil {
				if netErr, ok := err.(*net.OpError); ok {
					if netErr.Op == "listen" && errRetries < flags.FileWatchRetries {
						errRetries++
						logger.Errorf("retry %d/%d due to listen error: %v", errRetries, flags.FileWatchRetries, netErr)

						// configuration load succeeded at this point, just restart the command
						execCmd, restartSignal = newRestartableCommand(ctx, cmd) // replace previous pair
						time.Sleep(flags.FileWatchRetryDelay)

						go func() {
							errCh <- execCmd.Execute(args, confFile, logger)
							         ^^^^^^^^^^^^^^^
						}()
						continue
...

@johakoch
Copy link
Collaborator Author

I guess, the first of the two mentioned calls to execCmd.Execute() comes too early. If I add

				time.Sleep(flags.FileWatchRetryDelay)

(or similar) ahead of the call, there's no retry error and so only one probe running after a reload.

@johakoch
Copy link
Collaborator Author

However, even if the retry is triggered differently, there should only be one probe.

This was referenced Mar 3, 2023
@johakoch johakoch closed this as completed Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants