Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(docker-driver): Propagate promtail's client.Stop properly #2898

Merged
merged 5 commits into from
Dec 1, 2020

Conversation

kavirajk
Copy link
Contributor

@kavirajk kavirajk commented Nov 8, 2020

promtail client stop is not propaged properly to the its batchSend method
in the client's run method.

though run method multiplex on c.quit channel, the other branch can block
in batchSend without even listening on the c.quit channel on the run method.
This fixes it by multiplex on c.quit inside the batchSend as well.

Other suggestion: expose client.run to promtail client and they control it via
context. But that requires API changes like accepting ctx in client.Run method.

fixes: #2361

Before

-bash5.0$ time docker kill 0eb109b4e588
0eb109b4e588

real	12m56.192s
user	0m0.206s
sys	0m0.119s

After

-bash5.0$ time docker kill e67c1a43000a
e67c1a43000a

real	0m0.298s
user	0m0.108s
sys	0m0.058s

Following plugin has the fix to experiment.

-bash5.0$ docker plugin install kavirajk/loki-docker-driver:latest --alias loki-fix

@kavirajk kavirajk changed the title fix(docker-driver): Multiplex the client.batchSend fix(docker-driver): Propagate promtail's client.Stop properly Nov 8, 2020
@kavirajk kavirajk force-pushed the fix/docker-driver branch 2 times, most recently from 6d67178 to 2416feb Compare November 9, 2020 08:53
@pull-request-size pull-request-size bot added size/L and removed size/M labels Nov 9, 2020
@kavirajk kavirajk requested review from slim-bean and a team November 9, 2020 11:26
@cyriltovena
Copy link
Contributor

cyriltovena commented Nov 9, 2020

@slim-bean are you ok for stopping retrying early ? I think you had opinions ?

I think you could remove the quit channel and use a context.WithCancel instead and call the cancel in the close. This way we have only one synchronization and you don't have to change anything in the retrying logic. WDYT ?

@kavirajk
Copy link
Contributor Author

kavirajk commented Nov 9, 2020

@cyriltovena, even I don't completely like the idea of having to listen on quit channel multiple places!

But I don't understand what you mean by calling cancel() in close? you mean on the client.Stop() method?.

the core problem is the ctx we pass into batchSend need to be cancelled when client.Stop() is called. So you want to store the context and cancel func on the client struct and use same ctx in all the batchSend and call cancel inside the client.Stop()?

@cyriltovena
Copy link
Contributor

My idea was:

type client struct {
...
ctx context.Context
cancel context.CancelFunc
...
}

func NewClient(ctx,....) {
  ctx , cancel := context.WithCancel(ctx)
  return & client{
    ctx : ctx,
    cancel: cancel,
  }
}

func(c *client) Stop() {
c.cancel()
...
}

Then you can use the client context in the backoff and you don't need to change all of the code anymore.

@kavirajk
Copy link
Contributor Author

@cyriltovena, yea! right! actually thought about that approach as well.

I just felt like having ctx as the part of struct not the best approach, but here its ok I think, since the calling batchSend is private to client.

but one other thing is, in that case we can totally remove the quit channel. And replace it with <-ctx.Done(). Having both ctx and quit becomes unnecessary then! wdyt?

@cyriltovena
Copy link
Contributor

Yes I want quit to go. But more importantly, I think @slim-bean is really opinionated about this whole PR. While this helps docker, it causes some possibility of losing data, for other use case. I'm wondering if this behaviour could be optional and use just for the docker driver.

@slim-bean
Copy link
Collaborator

I think what i'd like is for SIGINT to initiate shutdown but it will still respect the retries and timeouts, and then SIGKILL exits no matter what.

I think it also might be nice to trap a second SIGINT to initiate shutdown regardless of retries (we should log a message here after the first SIGINT that a second SIGINT or a SIGKILL will force quit and abandon retries)

@kavirajk
Copy link
Contributor Author

@slim-bean I see what you mean. Can be done for SIGINT the way you wanted, by watching for the signal on driver's main! and graceful shutdown on the driver's StopLogging method (have to check what can be done for retry though)!.

But nothing can be done on SIGKILL, as it cannot be catched!

will update the PR accordingly, also for the canceling the batchSend context, will go with storing the context and cancel func on client struct!. Won't delete quit for now!

@slim-bean
Copy link
Collaborator

But nothing can be done on SIGKILL, as it cannot be catched!

yeah sorry, SIGTERM is probably the correct signal instead of SIGKILL

@slim-bean
Copy link
Collaborator

Interesting it looks like docker only sends a SIGTERM, i think the general idea still applies though.

We handle the SIGINT/SIGTERM and initiate a shutdown, but will wait for any active sends to complete

A second SIGINT/SIGTERM would interrupt any active sends and immediately shut down the process.

I realize this may be met with some controversy as to what the correct behavior is here, trying to strike a balance between data durability and intuitiveness here, but I am interested in other opinions.

@kavirajk
Copy link
Contributor Author

Hmm! SIGTERM is also for graceful shutdown! So we should handle it same way as SIGINT IMO!

@kavirajk
Copy link
Contributor Author

Ah! Saw your latest comment after posting mine!

So yea! Agree with

  1. Handling both SIGINT and SIGTERM with same behaviour
  2. Graceful shutdown on first and immediate shutdown on second!

@codecov-io
Copy link

Codecov Report

Merging #2898 (984246b) into master (fd451d9) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2898      +/-   ##
==========================================
- Coverage   61.63%   61.56%   -0.08%     
==========================================
  Files         181      181              
  Lines       14712    14715       +3     
==========================================
- Hits         9068     9059       -9     
- Misses       4812     4828      +16     
+ Partials      832      828       -4     
Impacted Files Coverage Δ
pkg/promtail/client/client.go 77.85% <100.00%> (+0.48%) ⬆️
pkg/promtail/positions/positions.go 46.80% <0.00%> (-11.71%) ⬇️
pkg/promtail/targets/file/filetarget.go 63.38% <0.00%> (-0.71%) ⬇️

@kavirajk
Copy link
Contributor Author

@cyriltovena @slim-bean Turns out docker doesn't send any signals (neither SIGINT or SIGTERM) to the driver when docker stop or docker kill is executed. Docker just calls driver's StopLogging method during these actions (docker stop or docker kill)

I just verified by adding signal handlers to the driver's main. But no signals is received (even though in the docker daemon log says SIGNAL is sent to the container) Something like

	// gracefull shutdown
	sig := make(chan os.Signal)
	signal.Notify(sig, syscall.SIGINT, syscall.SIGTERM)

	go func() {
		s := <-sig
		level.Info(util.Logger).Log("msg", "signal received", "signal", s)
	}()

So the best we could do currently is just make sure client's stop is canceling the batchSend's context. which we did in this PR!

Only pending is either we remove c.quit in this PR or just deprecate and remove it later!. @cyriltovena @slim-bean WDYT?

@cyriltovena
Copy link
Contributor

cyriltovena commented Nov 19, 2020

I think Ed is reluctant to the idea of stopping sending batches. Which currently now with this PR we do.

My suggestion would be to have a two method:

  • Stop() : would wait for the end of sending batches and retries
  • StopNow(): would also wait for the last batch to be sent but not apply any retries in case of failure.

Then we could call StopNow() when the driver ask for termination instead of Stop(). This way this special behaviour is only for docker.

In Addition I would make the timeout of the default client to 5s for docker, and not 10s. This way we have guarantee that container will stop at the worst case after 5s, and that it actually tried to send the last batch.

@kavirajk
Copy link
Contributor Author

@cyriltovena thanks :) nice idea 👍 . actually thought about that same before, but didn't do it because adding StopNow() breaks the client interface. So was not sure that time as that may break any other client implementations.

Looking into it. I can definitely add it if its minor changes!

@kavirajk kavirajk force-pushed the fix/docker-driver branch 2 times, most recently from 718de29 to 54472cb Compare November 19, 2020 10:52
@codecov-io
Copy link

codecov-io commented Nov 19, 2020

Codecov Report

Merging #2898 (820ec24) into master (14a5e22) will decrease coverage by 0.10%.
The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2898      +/-   ##
==========================================
- Coverage   61.86%   61.76%   -0.11%     
==========================================
  Files         182      182              
  Lines       14870    14880      +10     
==========================================
- Hits         9200     9190      -10     
- Misses       4822     4843      +21     
+ Partials      848      847       -1     
Impacted Files Coverage Δ
cmd/docker-driver/config.go 9.34% <ø> (ø)
cmd/docker-driver/loki.go 0.00% <0.00%> (ø)
cmd/docker-driver/main.go 0.00% <ø> (ø)
cmd/fluent-bit/dque.go 0.00% <0.00%> (ø)
pkg/promtail/client/logger.go 75.00% <0.00%> (-2.78%) ⬇️
pkg/promtail/client/multi.go 80.00% <0.00%> (-5.72%) ⬇️
pkg/promtail/client/client.go 76.92% <77.77%> (-0.45%) ⬇️
pkg/promtail/positions/positions.go 46.80% <0.00%> (-11.71%) ⬇️
pkg/promtail/targets/file/filetarget.go 64.08% <0.00%> (-2.12%) ⬇️

@kavirajk kavirajk force-pushed the fix/docker-driver branch 2 times, most recently from 0f3a235 to 002c23d Compare November 19, 2020 12:28
@kavirajk
Copy link
Contributor Author

@slim-bean can you take a look when you get a chance? thanks :)

stores `context` and its `cancel` in client and uses it to
make upstream calls that need `ctx`.

Also properly cancel's all the upstream calls that using `ctx` when
client's Stop is called.
var status int
for backoff.Ongoing() {
start := time.Now()
status, err = c.send(ctx, tenantID, buf)
status, err = c.send(c.ctx, tenantID, buf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're also cancelling the current batch and the last batch with this.

May be you should add a test to verify that StopNow() always try to send.

Use `StopNow()` from the docker-driver
@pull-request-size pull-request-size bot added size/L and removed size/M labels Nov 24, 2020
Timeout: client.Timeout,
// Avoid blocking the docker-driver on the worst case
// https://github.com/grafana/loki/pull/2898#issuecomment-730218963
Timeout: 5 * time.Second,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to avoid this, I'd like for all clients to have the same timeout and backoff behavior

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cyriltovena cyriltovena merged commit 410b9d9 into master Dec 1, 2020
@cyriltovena cyriltovena deleted the fix/docker-driver branch December 1, 2020 10:18
cyriltovena pushed a commit to cyriltovena/loki that referenced this pull request Jun 11, 2021
* Support Snappy compression for gRPC
fix: grafana#2898

Signed-off-by: Wing924 <weihe924stephen@gmail.com>

* changelog
Signed-off-by: Wing924 <weihe924stephen@gmail.com>

* fix

Signed-off-by: Wing924 <weihe924stephen@gmail.com>

* fix

Signed-off-by: Wing924 <weihe924stephen@gmail.com>

* fix

Signed-off-by: Wing924 <weihe924stephen@gmail.com>

* fix

Signed-off-by: Wing924 <weihe924stephen@gmail.com>

* fix

Signed-off-by: Wing924 <weihe924stephen@gmail.com>

* fix

Signed-off-by: Wing924 <weihe924stephen@gmail.com>

* Fix

Signed-off-by: Wing924 <weihe924stephen@gmail.com>

* fix

Signed-off-by: Wing924 <weihe924stephen@gmail.com>
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 64.28571% with 5 lines in your changes missing coverage. Please review.

Project coverage is 61.76%. Comparing base (14a5e22) to head (9355fa1).

Files with missing lines Patch % Lines
cmd/fluent-bit/dque.go 0.00% 2 Missing ⚠️
cmd/docker-driver/loki.go 0.00% 1 Missing ⚠️
pkg/promtail/client/logger.go 0.00% 1 Missing ⚠️
pkg/promtail/client/multi.go 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2898      +/-   ##
==========================================
- Coverage   61.86%   61.76%   -0.11%     
==========================================
  Files         182      182              
  Lines       14870    14880      +10     
==========================================
- Hits         9200     9190      -10     
- Misses       4822     4842      +20     
  Partials      848      848              
Files with missing lines Coverage Δ
cmd/docker-driver/main.go 0.00% <ø> (ø)
pkg/promtail/client/client.go 78.32% <100.00%> (+0.94%) ⬆️
cmd/docker-driver/loki.go 0.00% <0.00%> (ø)
pkg/promtail/client/logger.go 75.00% <0.00%> (-2.78%) ⬇️
pkg/promtail/client/multi.go 80.00% <0.00%> (-5.72%) ⬇️
cmd/fluent-bit/dque.go 0.00% <0.00%> (ø)

... and 3 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants