-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Heartbeat] Add browser monitor timeout #32434
[Heartbeat] Add browser monitor timeout #32434
Conversation
Pinging @elastic/uptime (Team:Uptime) |
@@ -16,6 +17,7 @@ func DefaultConfig() *Config { | |||
return &Config{ | |||
Sandbox: false, | |||
Screenshots: "on", | |||
Timeout: 14 * time.Minute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewvc should I leave the default option as 15 min already? Or do we want to change that once it's actually able to run to 15 min?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
14 min sounds weird to me, IMO we can keep it as whole number 15m and increase the service timeout. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vigneshshanmugam we discussed the same idea in tech sync. I noticed Andrew has already created the issue to update k8s timeout, so I'll just update this one to 15 min.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe its just me, But IMO I feel like having timeout on the monitor config does not feel ideal.
-
The timeouts around the lightweight monitors are basically for connection and for the whole check itself which kind of aligns with what we have for browser monitors. But it feels like Browser monitors are doing more jobs than a connection and it feels like a process timeout rather than a monitor timeout. Just one level up.
-
This feature would become super confusing once we allow users to set timeout on the Synthetics tests itself - Configurable timeouts (UJ and Step Level) synthetics#133. As a user, I would expect to set the timeout for individual journeys which deals with how long a test should run as a whole including the step timeouts instead of how long it took for HB to run something.
I am not sure if i am being pedantic, But i feel like this should go inside the task timeout and not the monitor level timeout.
heartbeat.jobs:
timeout: 15m
@vigneshshanmugam I think you make some good points, but it's more of a "yes and" situation IMHO. I see your point about a process timeout vs. a monitor timeout, but the main purpose of this feature is as a safeguard against the node process hanging and consuming resources forever. A secondary goal is providing better error messages etc. We should have timeouts in node too , but node can hang as well. Ideally at each subprocess boundary we should have timeouts. A global timeout setting doesn't make sense in the context of everything we're doing with fleet and agent where we really set parameters per job. We're trying to move toward fleet and the service, where a global heartbeat config really doesn't exist. The closest thing we have is defaults applied to individual jobs. |
@@ -127,14 +127,14 @@ func (p *Project) jobs() []jobs.Job { | |||
isScript := p.projectCfg.Source.Inline != nil | |||
if isScript { | |||
src := p.projectCfg.Source.Inline.Script | |||
j = synthexec.InlineJourneyJob(context.TODO(), src, p.Params(), p.StdFields(), p.extraArgs()...) | |||
j = synthexec.InlineJourneyJob(context.TODO(), src, p.Params(), p.StdFields(), p.projectCfg.Timeout, p.extraArgs()...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be cleaner and more idiomatic to use context.WithTimeout
than pass the extra parameter. We've passed this context.TODO()
for ages in anticipation of having a real timeout, so now may be the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good sugestion, changing that now
go func() { | ||
<-ctx.Done() | ||
toTimer := time.NewTimer(timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just pass the context through we just wait on <-ctx.Done()
@@ -149,11 +149,25 @@ func TestRunBadExitCodeCmd(t *testing.T) { | |||
}) | |||
} | |||
|
|||
func runAndCollect(t *testing.T, cmd *exec.Cmd, stdinStr string) []*SynthEvent { | |||
func TestRunTimeoutExitCodeCmd(t *testing.T) { | |||
cmd := exec.Command("go", "run", "./main.go") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this test, are we sure that there isn't a race here? I wonder if on fast systems that might execute fast enough to be flaky. It might be safer to add a sleep into that go program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a small timeout to the executable just in case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good, we just need to clean up the spurious "could not kill synthetics process: os: process already finished"
warnings this now throws.
func NewSynthexecCtx(timeout time.Duration) (context.Context, context.CancelFunc) { | ||
cmdTimeout := timeout + 30*time.Second | ||
|
||
synthexecCtx := context.WithValue(context.TODO(), SynthexecTimeout, cmdTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synthexecCtx := context.WithValue(context.TODO(), SynthexecTimeout, cmdTimeout) | |
synthexecCtx := context.WithValue(context.Background(), SynthexecTimeout, cmdTimeout) |
This is the same functionally as TODO
but says essentially this is not something we think needs to be eventually replaced.
PS, nice use of WithValue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'll change that so it's clearer
go func() { | ||
<-ctx.Done() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed the kill/warning is a problem since this will happen even during runs that are not broken. I think the simplest way to do that would be to define an atomic bool that flips when cmd.Wait
returns and check that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than use a switch, which introduces a (small) state dependency between the two routines, I'd rather check the exit status with if !cmd.ProcessState.Exited() { // kill and log error}
. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize that existed, that's perfect
This pull request is now in conflicts. Could you fix it? 🙏
|
2434711
to
3c1a77c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Add browser monitor timeout * Add synthexec unit test for timeout * Add changelog
What does this PR do?
Fixes #32388.
Added logic to heartbeat side to trigger killing browser monitors'
node
process if it exceed a configurable timeout without completing. In case the timeout is triggered, the run is marked as failed and a error message is appended:Why is it important?
This mitigates scenarios we've seen where
node
would go unresponsive and continue running, allocating more resources until OOM.Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
Related issues