[Heartbeat] Add browser monitor timeout #32434

emilioalvap · 2022-07-21T11:05:57Z

What does this PR do?

Added logic to heartbeat side to trigger killing browser monitors' node process if it exceed a configurable timeout without completing. In case the timeout is triggered, the run is marked as failed and a error message is appended:

Why is it important?

This mitigates scenarios we've seen where node would go unresponsive and continue running, allocating more resources until OOM.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

[ ]

How to test this PR locally

User the following monitor example:

- type: browser
  enabled: true
  id: Timeout
  name: Timeout
  timeout: 15
  source:
    inline:
      script:
        step("load homepage", async () => {
            await page.goto('https://www.elastic.co');
        });
        step("hover over products menu", async () => {
            await page.hover('css=[data-nav-item=products]');
        });
        step("failme", async () => {
           await (new Promise(done => {
             setTimeout(done, 100000);
           }));
           await page.hover('css=[data-nav-item=notathingonpage]');
        });
  schedule: "@every 1m"

Related issues

Closes [Heartbeat] Add timeout to browser monitors in HB side #32388.

elasticmachine · 2022-07-21T11:06:00Z

Pinging @elastic/uptime (Team:Uptime)

elasticmachine · 2022-07-21T11:14:19Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-07-26T13:25:25.299+0000
Duration: 58 min 19 sec

Test stats 🧪

Test	Results
Failed	0
Passed	1635
Skipped	22
Total	1657

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

emilioalvap · 2022-07-21T15:04:38Z

x-pack/heartbeat/monitors/browser/config.go

@@ -16,6 +17,7 @@ func DefaultConfig() *Config {
 	return &Config{
 		Sandbox:     false,
 		Screenshots: "on",
+		Timeout:     14 * time.Minute,


@andrewvc should I leave the default option as 15 min already? Or do we want to change that once it's actually able to run to 15 min?

14 min sounds weird to me, IMO we can keep it as whole number 15m and increase the service timeout. WDYT?

@vigneshshanmugam we discussed the same idea in tech sync. I noticed Andrew has already created the issue to update k8s timeout, so I'll just update this one to 15 min.

vigneshshanmugam

Maybe its just me, But IMO I feel like having timeout on the monitor config does not feel ideal.

The timeouts around the lightweight monitors are basically for connection and for the whole check itself which kind of aligns with what we have for browser monitors. But it feels like Browser monitors are doing more jobs than a connection and it feels like a process timeout rather than a monitor timeout. Just one level up.
This feature would become super confusing once we allow users to set timeout on the Synthetics tests itself - Configurable timeouts (UJ and Step Level) synthetics#133. As a user, I would expect to set the timeout for individual journeys which deals with how long a test should run as a whole including the step timeouts instead of how long it took for HB to run something.

I am not sure if i am being pedantic, But i feel like this should go inside the task timeout and not the monitor level timeout.

heartbeat.jobs:
    timeout: 15m

andrewvc · 2022-07-21T17:42:42Z

@vigneshshanmugam I think you make some good points, but it's more of a "yes and" situation IMHO.

I see your point about a process timeout vs. a monitor timeout, but the main purpose of this feature is as a safeguard against the node process hanging and consuming resources forever. A secondary goal is providing better error messages etc.

We should have timeouts in node too , but node can hang as well. Ideally at each subprocess boundary we should have timeouts.

A global timeout setting doesn't make sense in the context of everything we're doing with fleet and agent where we really set parameters per job. We're trying to move toward fleet and the service, where a global heartbeat config really doesn't exist. The closest thing we have is defaults applied to individual jobs.

andrewvc · 2022-07-21T20:17:03Z

x-pack/heartbeat/monitors/browser/project.go

@@ -127,14 +127,14 @@ func (p *Project) jobs() []jobs.Job {
 	isScript := p.projectCfg.Source.Inline != nil
 	if isScript {
 		src := p.projectCfg.Source.Inline.Script
-		j = synthexec.InlineJourneyJob(context.TODO(), src, p.Params(), p.StdFields(), p.extraArgs()...)
+		j = synthexec.InlineJourneyJob(context.TODO(), src, p.Params(), p.StdFields(), p.projectCfg.Timeout, p.extraArgs()...)


I think it'd be cleaner and more idiomatic to use context.WithTimeout than pass the extra parameter. We've passed this context.TODO() for ages in anticipation of having a real timeout, so now may be the time.

Good sugestion, changing that now

andrewvc · 2022-07-21T20:26:51Z

x-pack/heartbeat/monitors/browser/synthexec/synthexec.go

 	go func() {
-		<-ctx.Done()
+		toTimer := time.NewTimer(timeout)


If we just pass the context through we just wait on <-ctx.Done()

andrewvc · 2022-07-21T20:27:46Z

x-pack/heartbeat/monitors/browser/synthexec/synthexec_test.go

@@ -149,11 +149,25 @@ func TestRunBadExitCodeCmd(t *testing.T) {
 	})
 }

-func runAndCollect(t *testing.T, cmd *exec.Cmd, stdinStr string) []*SynthEvent {
+func TestRunTimeoutExitCodeCmd(t *testing.T) {
+	cmd := exec.Command("go", "run", "./main.go")


I like this test, are we sure that there isn't a race here? I wonder if on fast systems that might execute fast enough to be flaky. It might be safer to add a sleep into that go program.

Added a small timeout to the executable just in case

andrewvc

Looking really good, we just need to clean up the spurious "could not kill synthetics process: os: process already finished" warnings this now throws.

andrewvc · 2022-07-22T15:06:23Z

x-pack/heartbeat/monitors/browser/synthexec/synthexec.go

+func NewSynthexecCtx(timeout time.Duration) (context.Context, context.CancelFunc) {
+	cmdTimeout := timeout + 30*time.Second
+
+	synthexecCtx := context.WithValue(context.TODO(), SynthexecTimeout, cmdTimeout)


Suggested change

synthexecCtx := context.WithValue(context.TODO(), SynthexecTimeout, cmdTimeout)

synthexecCtx := context.WithValue(context.Background(), SynthexecTimeout, cmdTimeout)

This is the same functionally as TODO but says essentially this is not something we think needs to be eventually replaced.

PS, nice use of WithValue

Right, I'll change that so it's clearer

andrewvc · 2022-07-22T20:47:33Z

x-pack/heartbeat/monitors/browser/synthexec/synthexec.go

 	go func() {
 		<-ctx.Done()
+


As we discussed the kill/warning is a problem since this will happen even during runs that are not broken. I think the simplest way to do that would be to define an atomic bool that flips when cmd.Wait returns and check that here.

Rather than use a switch, which introduces a (small) state dependency between the two routines, I'd rather check the exit status with if !cmd.ProcessState.Exited() { // kill and log error}. wdyt?

I didn't realize that existed, that's perfect

mergify · 2022-07-22T20:59:31Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b browser-monitor-timeout upstream/browser-monitor-timeout
git merge upstream/main
git push upstream browser-monitor-timeout

andrewvc

LGTM

* Add browser monitor timeout * Add synthexec unit test for timeout * Add changelog

emilioalvap added bug Team:obs-ds-hosted-services Label for the Observability Hosted Services team release-note:fix The content should be included as a fix v8.4.0 labels Jul 21, 2022

emilioalvap requested a review from andrewvc July 21, 2022 11:05

emilioalvap requested a review from a team as a code owner July 21, 2022 11:05

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jul 21, 2022

mergify bot assigned emilioalvap Jul 21, 2022

emilioalvap changed the title ~~Browser monitor timeout~~ [Heartbeat] Add browser monitor timeout Jul 21, 2022

emilioalvap commented Jul 21, 2022

View reviewed changes

vigneshshanmugam reviewed Jul 21, 2022

View reviewed changes

andrewvc requested changes Jul 21, 2022

View reviewed changes

andrewvc requested changes Jul 22, 2022

View reviewed changes

emilioalvap added 12 commits July 26, 2022 12:09

Add browser monitor timeout

ec49d8f

Add unit tests

7eff66f

Add synthexec test for timeout

f1f6676

Add changelog

814bf0a

Clean up unused method

61e90e4

Adjust browser default timeout

5557c2d

Refactor timeout to ctx

f4f7e0a

Update synthexec unit test

699a13b

Update browser timeout to be larger than synth cli

7dcf441

Add explanation to erro handling, fix unit test

a76d378

Refactor ctx timeout to extend parent context

bc30562

Check process state before issuing SIGKILL

3c1a77c

emilioalvap force-pushed the browser-monitor-timeout branch from 2434711 to 3c1a77c Compare July 26, 2022 12:32

emilioalvap added 2 commits July 26, 2022 14:33

Change synthexec context type

39eba7c

Remove test monitors

581656d

andrewvc approved these changes Jul 26, 2022

View reviewed changes

emilioalvap merged commit b1302b6 into elastic:main Jul 27, 2022

vigneshshanmugam assigned vigneshshanmugam and unassigned emilioalvap and vigneshshanmugam Aug 1, 2022

chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023

[Heartbeat] Add browser monitor timeout (#32434)

9c1a0fa

* Add browser monitor timeout * Add synthexec unit test for timeout * Add changelog

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Heartbeat] Add browser monitor timeout #32434

[Heartbeat] Add browser monitor timeout #32434

emilioalvap commented Jul 21, 2022 •

edited

Loading

elasticmachine commented Jul 21, 2022

elasticmachine commented Jul 21, 2022 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

emilioalvap Jul 21, 2022

vigneshshanmugam Jul 21, 2022

emilioalvap Jul 21, 2022

vigneshshanmugam left a comment •

edited

Loading

andrewvc commented Jul 21, 2022

andrewvc Jul 21, 2022

emilioalvap Jul 22, 2022

andrewvc Jul 21, 2022

andrewvc Jul 21, 2022

emilioalvap Jul 22, 2022

andrewvc left a comment

andrewvc Jul 22, 2022

emilioalvap Jul 26, 2022

andrewvc Jul 22, 2022

emilioalvap Jul 26, 2022

andrewvc Jul 26, 2022

mergify bot commented Jul 22, 2022

andrewvc left a comment

	synthexecCtx := context.WithValue(context.TODO(), SynthexecTimeout, cmdTimeout)
	synthexecCtx := context.WithValue(context.Background(), SynthexecTimeout, cmdTimeout)

[Heartbeat] Add browser monitor timeout #32434

[Heartbeat] Add browser monitor timeout #32434

Conversation

emilioalvap commented Jul 21, 2022 • edited Loading

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Related issues

elasticmachine commented Jul 21, 2022

elasticmachine commented Jul 21, 2022 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vigneshshanmugam left a comment • edited Loading

Choose a reason for hiding this comment

andrewvc commented Jul 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewvc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Jul 22, 2022

andrewvc left a comment

Choose a reason for hiding this comment

emilioalvap commented Jul 21, 2022 •

edited

Loading

elasticmachine commented Jul 21, 2022 •

edited by jenkins-beats-ci bot

Loading

vigneshshanmugam left a comment •

edited

Loading