-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(docker-driver): Propagate promtail's client.Stop
properly
#2898
Conversation
client.Stop
properly
f806571
to
2980483
Compare
6d67178
to
2416feb
Compare
@slim-bean are you ok for stopping retrying early ? I think you had opinions ? I think you could remove the quit channel and use a context.WithCancel instead and call the cancel in the close. This way we have only one synchronization and you don't have to change anything in the retrying logic. WDYT ? |
@cyriltovena, even I don't completely like the idea of having to listen on quit channel multiple places! But I don't understand what you mean by calling the core problem is the |
My idea was: type client struct {
...
ctx context.Context
cancel context.CancelFunc
...
}
func NewClient(ctx,....) {
ctx , cancel := context.WithCancel(ctx)
return & client{
ctx : ctx,
cancel: cancel,
}
}
func(c *client) Stop() {
c.cancel()
...
} Then you can use the client context in the backoff and you don't need to change all of the code anymore. |
@cyriltovena, yea! right! actually thought about that approach as well. I just felt like having but one other thing is, in that case we can totally remove the |
Yes I want quit to go. But more importantly, I think @slim-bean is really opinionated about this whole PR. While this helps docker, it causes some possibility of losing data, for other use case. I'm wondering if this behaviour could be optional and use just for the docker driver. |
I think what i'd like is for SIGINT to initiate shutdown but it will still respect the retries and timeouts, and then SIGKILL exits no matter what. I think it also might be nice to trap a second SIGINT to initiate shutdown regardless of retries (we should log a message here after the first SIGINT that a second SIGINT or a SIGKILL will force quit and abandon retries) |
@slim-bean I see what you mean. Can be done for SIGINT the way you wanted, by watching for the signal on driver's main! and graceful shutdown on the driver's StopLogging method (have to check what can be done for retry though)!. But nothing can be done on SIGKILL, as it cannot be catched! will update the PR accordingly, also for the canceling the |
yeah sorry, SIGTERM is probably the correct signal instead of SIGKILL |
Interesting it looks like docker only sends a SIGTERM, i think the general idea still applies though. We handle the SIGINT/SIGTERM and initiate a shutdown, but will wait for any active sends to complete A second SIGINT/SIGTERM would interrupt any active sends and immediately shut down the process. I realize this may be met with some controversy as to what the correct behavior is here, trying to strike a balance between data durability and intuitiveness here, but I am interested in other opinions. |
Hmm! SIGTERM is also for graceful shutdown! So we should handle it same way as SIGINT IMO! |
Ah! Saw your latest comment after posting mine! So yea! Agree with
|
2416feb
to
984246b
Compare
Codecov Report
@@ Coverage Diff @@
## master #2898 +/- ##
==========================================
- Coverage 61.63% 61.56% -0.08%
==========================================
Files 181 181
Lines 14712 14715 +3
==========================================
- Hits 9068 9059 -9
- Misses 4812 4828 +16
+ Partials 832 828 -4
|
@cyriltovena @slim-bean Turns out docker doesn't send any signals (neither SIGINT or SIGTERM) to the driver when I just verified by adding signal handlers to the driver's main. But no signals is received (even though in the docker daemon log says SIGNAL is sent to the container) Something like // gracefull shutdown
sig := make(chan os.Signal)
signal.Notify(sig, syscall.SIGINT, syscall.SIGTERM)
go func() {
s := <-sig
level.Info(util.Logger).Log("msg", "signal received", "signal", s)
}() So the best we could do currently is just make sure client's stop is canceling the batchSend's context. which we did in this PR! Only pending is either we remove |
I think Ed is reluctant to the idea of stopping sending batches. Which currently now with this PR we do. My suggestion would be to have a two method:
Then we could call StopNow() when the driver ask for termination instead of Stop(). This way this special behaviour is only for docker. In Addition I would make the timeout of the default client to 5s for docker, and not 10s. This way we have guarantee that container will stop at the worst case after 5s, and that it actually tried to send the last batch. |
@cyriltovena thanks :) nice idea 👍 . actually thought about that same before, but didn't do it because adding Looking into it. I can definitely add it if its minor changes! |
3ab5c51
to
8d91322
Compare
718de29
to
54472cb
Compare
Codecov Report
@@ Coverage Diff @@
## master #2898 +/- ##
==========================================
- Coverage 61.86% 61.76% -0.11%
==========================================
Files 182 182
Lines 14870 14880 +10
==========================================
- Hits 9200 9190 -10
- Misses 4822 4843 +21
+ Partials 848 847 -1
|
0f3a235
to
002c23d
Compare
@slim-bean can you take a look when you get a chance? thanks :) |
stores `context` and its `cancel` in client and uses it to make upstream calls that need `ctx`. Also properly cancel's all the upstream calls that using `ctx` when client's Stop is called.
002c23d
to
533f56e
Compare
pkg/promtail/client/client.go
Outdated
var status int | ||
for backoff.Ongoing() { | ||
start := time.Now() | ||
status, err = c.send(ctx, tenantID, buf) | ||
status, err = c.send(c.ctx, tenantID, buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're also cancelling the current batch and the last batch with this.
May be you should add a test to verify that StopNow() always try to send.
533f56e
to
220c261
Compare
Use `StopNow()` from the docker-driver
220c261
to
820ec24
Compare
cmd/docker-driver/config.go
Outdated
Timeout: client.Timeout, | ||
// Avoid blocking the docker-driver on the worst case | ||
// https://github.com/grafana/loki/pull/2898#issuecomment-730218963 | ||
Timeout: 5 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to avoid this, I'd like for all clients to have the same timeout and backoff behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Support Snappy compression for gRPC fix: grafana#2898 Signed-off-by: Wing924 <weihe924stephen@gmail.com> * changelog Signed-off-by: Wing924 <weihe924stephen@gmail.com> * fix Signed-off-by: Wing924 <weihe924stephen@gmail.com> * fix Signed-off-by: Wing924 <weihe924stephen@gmail.com> * fix Signed-off-by: Wing924 <weihe924stephen@gmail.com> * fix Signed-off-by: Wing924 <weihe924stephen@gmail.com> * fix Signed-off-by: Wing924 <weihe924stephen@gmail.com> * fix Signed-off-by: Wing924 <weihe924stephen@gmail.com> * Fix Signed-off-by: Wing924 <weihe924stephen@gmail.com> * fix Signed-off-by: Wing924 <weihe924stephen@gmail.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2898 +/- ##
==========================================
- Coverage 61.86% 61.76% -0.11%
==========================================
Files 182 182
Lines 14870 14880 +10
==========================================
- Hits 9200 9190 -10
- Misses 4822 4842 +20
Partials 848 848
|
promtail
client
stop is not propaged properly to the itsbatchSend
methodin the client's
run
method.though
run
method multiplex onc.quit
channel, the other branch can blockin
batchSend
without even listening on thec.quit
channel on therun
method.This fixes it by multiplex on
c.quit
inside the batchSend as well.Other suggestion: expose
client.run
to promtail client and they control it viacontext. But that requires API changes like accepting
ctx
inclient.Run
method.fixes: #2361
Before
After
Following plugin has the fix to experiment.