[Heartbeat] Fix scheduler job type limit algorithm #27559

andrewvc · 2021-08-23T20:21:26Z

What does this PR do?

Previously heartbeat would break when running with mode: all since that would create multiple terminal jobs. These would all attempt to release from the limit semaphore, when only the last one should.

This also refactors recursive job running into an OO type structure to make things more readable.

Why is it important?

Causes a panic:

panic: semaphore: released more than held

goroutine 100 [running]:
golang.org/x/sync/semaphore.(*Weighted).Release(0xc00011c0f0, 0x1)
	/Users/dominiqueclarke/go/pkg/mod/golang.org/x/sync@v0.0.0-20200317015054-43a5402ce75a/semaphore/semaphore.go:103 +0xed
github.com/elastic/beats/v7/heartbeat/scheduler.(*Scheduler).runRecursiveTask(0xc000720000, 0x1101964a0, 0xc00073c440, 0xc00000ec20, 0xc000058a20, 0xc00011c0f0, 0xc04114cb9b52d7b0, 0x346f4dc, 0x1110a1b60)
	/Users/dominiqueclarke/dev/beats/heartbeat/scheduler/scheduler.go:303 +0x2e5
created by github.com/elastic/beats/v7/heartbeat/scheduler.(*Scheduler).runRecursiveTask
	/Users/dominiqueclarke/dev/beats/heartbeat/scheduler/scheduler.go:300 +0x246

With this config

output.elasticsearch:
  hosts: ["localhost:9200"]
  username: "elastic"
  password: "[redacted]"
heartbeat.jobs.http.limit: 1
heartbeat.monitors:
- type: http
  id: httpcheck
  name: HTTP_CHECK
  urls: 'https://news.google.com'
  schedule: '@every 1m'
  mode: all
- type: http
  id: localhost
  name: localhost
  urls: 'http://localhost:8080'
  schedule: '@every 1m'
  mode: all

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~- [ ] I have made corresponding changes to the documentation~~
~~- [ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~- [ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.~~

How to test this PR locally

Use the config above

elasticmachine · 2021-08-23T20:21:28Z

Pinging @elastic/uptime (Team:Uptime)

andrewvc · 2021-08-23T20:30:47Z

heartbeat/scheduler/scheduler.go

-			jobSem.Release(1)
+		// There is always at least 1 task (the current one), if that's all, then we know
+		// there are no other jobs active or pending, and we can release the jobLimitSem
+		if sj.jobLimitSem != nil && sj.activeTasks.Load() == 1 {


This is the core of the fix, being able to count the actual number of active tasks

elasticmachine · 2021-08-23T20:33:54Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2021-08-24T13:54:47.104+0000
Duration: 66 min 56 sec
Commit: e89c771

Test stats 🧪

Test	Results
Failed	0
Passed	3399
Skipped	80
Total	3479

Trends 🧪

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	3399
Skipped	80
Total	3479

vigneshshanmugam

LGTM and looks much cleaner than before.

heartbeat/scheduler/scheduler_test.go

x-pack/heartbeat/monitors.d/plaintodos.yml

x-pack/heartbeat/monitors.d/todos.yml

vigneshshanmugam

Did a test locally, turned out to be not working as expected. Its not getting limited even after setting limit to 1

andrewvc · 2021-08-24T00:14:37Z

@vigneshshanmugam I believe I've addressed all PR feedback now

andrewvc · 2021-08-24T00:15:02Z

I take that back, I didn't see your final comment, will try a local test.

andrewvc · 2021-08-24T00:37:21Z

@vigneshshanmugam I see it working with the following config. I've pushed up a commit with WARN logging showing the locking. You can see that with a limit of one only one monitor can have acquired the lock

2021-08-23T19:34:32.996-0500    WARN    scheduler/scheduler.go:209      Run my-monitor | &{1 0 {0 0} {{0 0 0 <nil>} 0}} limit
2021-08-23T19:34:32.996-0500    WARN    scheduler/schedjob.go:46        TRY-BLOCKING-ACQUIRE ? my-monitor
2021-08-23T19:34:32.996-0500    WARN    scheduler/schedjob.go:48        ACQUIRED + my-monitor
2021-08-23T19:34:32.996-0500    WARN    scheduler/scheduler.go:209      Run alt-monitor | &{1 1 {0 0} {{0 0 0 <nil>} 0}} limit
2021-08-23T19:34:32.996-0500    WARN    scheduler/schedjob.go:46        TRY-BLOCKING-ACQUIRE ? alt-monitor
2021-08-23T19:34:32.996-0500    INFO    cfgfile/reload.go:164   Config reloader started
2021-08-23T19:34:32.997-0500    INFO    cfgfile/reload.go:224   Loading of config files completed.
2021-08-23T19:34:33.164-0500    WARN    scheduler/schedjob.go:105       RELEASED + my-monitor
2021-08-23T19:34:33.164-0500    WARN    scheduler/scheduler.go:213      End my-monitor
2021-08-23T19:34:33.164-0500    WARN    scheduler/schedjob.go:48        ACQUIRED + alt-monitor
2021-08-23T19:34:33.189-0500    WARN    scheduler/schedjob.go:105       RELEASED + alt-monitor
2021-08-23T19:34:33.189-0500    WARN    scheduler/scheduler.go:213      End alt-monitor

sample config used is:

heartbeat.config.monitors:
  path: ${path.config}/monitors.d/*.yml
  reload.enabled: false
  reload.period: 5s
heartbeat.jobs.http.limit: 1
heartbeat.monitors:
- type: http
  id: my-monitor
  name: My Monitor
  urls: ["http://www.google.com"]
  mode: all
  schedule: '@every 10s'
- type: http
  id: alt-monitor
  mode: all
  name: Alt Monitor
  urls: ["http://www.elastic.co"]
  schedule: '@every 10s'

setup.template.settings:
  index.number_of_shards: 1
  index.codec: best_compression
setup.kibana:
output.console: ~
processors:
  - add_observer_metadata:

vigneshshanmugam · 2021-08-24T00:47:13Z

@andrewvc Sorry for that, my bad I had my acquire print statements in a wrong place before the acquire which made me think it was acquiring way too sooner, But I did test now properly and can see it working fine.

vigneshshanmugam

LGTM. Need to remove this file though - x-pack/heartbeat/out

andrewvc · 2021-08-24T16:35:01Z

@Mergifyio backport 7.x

andrewvc · 2021-08-24T16:35:08Z

@Mergifyio backport 7.15

Previously heartbeat would break when running with mode: all since that would create multiple terminal jobs. These would all attempt to release from the limit semaphore, when only the last one should. This also refactors recursive job running into an OO type structure to make things more readable. (cherry picked from commit d561a55)

mergify · 2021-08-24T16:35:31Z

Command backport 7.x: success

Backports have been created

#27573 [Heartbeat] Fix scheduler job type limit algorithm (backport #27559) has been created for branch 7.x

Previously heartbeat would break when running with mode: all since that would create multiple terminal jobs. These would all attempt to release from the limit semaphore, when only the last one should. This also refactors recursive job running into an OO type structure to make things more readable. (cherry picked from commit d561a55)

mergify · 2021-08-24T16:36:05Z

Command backport 7.15: success

Backports have been created

#27574 [Heartbeat] Fix scheduler job type limit algorithm (backport #27559) has been created for branch 7.15

Previously heartbeat would break when running with mode: all since that would create multiple terminal jobs. These would all attempt to release from the limit semaphore, when only the last one should. This also refactors recursive job running into an OO type structure to make things more readable. (cherry picked from commit d561a55) Co-authored-by: Andrew Cholakian <andrew@andrewvc.com>

* master: Skip Flaky Tests (elastic#27590) Remove fargate from aws module config (elastic#27575) [Heartbeat] Fix scheduler job type limit algorithm (elastic#27559)

Previously heartbeat would break when running with mode: all since that would create multiple terminal jobs. These would all attempt to release from the limit semaphore, when only the last one should. This also refactors recursive job running into an OO type structure to make things more readable.

Fixed limit algorithm

9c3a183

andrewvc added bug Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team v7.15.0 labels Aug 23, 2021

andrewvc requested review from vigneshshanmugam and justinkambic August 23, 2021 20:21

andrewvc requested a review from a team as a code owner August 23, 2021 20:21

andrewvc self-assigned this Aug 23, 2021

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Aug 23, 2021

andrewvc added 2 commits August 23, 2021 15:27

Actually implement the fix

7669669

Improve comments

8d19d48

andrewvc commented Aug 23, 2021

View reviewed changes

Cleanup var naming

bc93409

vigneshshanmugam approved these changes Aug 23, 2021

View reviewed changes

heartbeat/scheduler/scheduler_test.go Outdated Show resolved Hide resolved

x-pack/heartbeat/monitors.d/plaintodos.yml Outdated Show resolved Hide resolved

x-pack/heartbeat/monitors.d/todos.yml Outdated Show resolved Hide resolved

vigneshshanmugam requested changes Aug 23, 2021

View reviewed changes

Incorporate PR feedback

49be62a

Spread code ouut nicer, add debugging print

2aac7df

vigneshshanmugam approved these changes Aug 24, 2021

View reviewed changes

vigneshshanmugam mentioned this pull request Aug 24, 2021

Heartbeat: handle panic when job spanws multiple tasks #27558

Closed

andrewvc added 3 commits August 24, 2021 08:49

Incorporate PR feedback, update reference yml

f8ef339

Remove debug logs

003e74d

Fix id invocation

e89c771

andrewvc merged commit d561a55 into elastic:master Aug 24, 2021

andrewvc deleted the fix-lim-release branch August 24, 2021 16:34

mergify bot mentioned this pull request Aug 24, 2021

[Heartbeat] Fix scheduler job type limit algorithm (backport #27559) #27573

Merged

mergify bot mentioned this pull request Aug 24, 2021

[Heartbeat] Fix scheduler job type limit algorithm (backport #27559) #27574

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Heartbeat] Fix scheduler job type limit algorithm #27559

[Heartbeat] Fix scheduler job type limit algorithm #27559

andrewvc commented Aug 23, 2021 •

edited

Loading

elasticmachine commented Aug 23, 2021

andrewvc Aug 23, 2021

elasticmachine commented Aug 23, 2021 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

Trends 🧪

Test stats 🧪

vigneshshanmugam left a comment

vigneshshanmugam left a comment

andrewvc commented Aug 24, 2021

andrewvc commented Aug 24, 2021

andrewvc commented Aug 24, 2021

vigneshshanmugam commented Aug 24, 2021

vigneshshanmugam left a comment

andrewvc commented Aug 24, 2021

andrewvc commented Aug 24, 2021

mergify bot commented Aug 24, 2021

mergify bot commented Aug 24, 2021

[Heartbeat] Fix scheduler job type limit algorithm #27559

[Heartbeat] Fix scheduler job type limit algorithm #27559

Conversation

andrewvc commented Aug 23, 2021 • edited Loading

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

elasticmachine commented Aug 23, 2021

andrewvc Aug 23, 2021

Choose a reason for hiding this comment

elasticmachine commented Aug 23, 2021 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

Trends 🧪

💚 Flaky test report

Test stats 🧪

vigneshshanmugam left a comment

Choose a reason for hiding this comment

vigneshshanmugam left a comment

Choose a reason for hiding this comment

andrewvc commented Aug 24, 2021

andrewvc commented Aug 24, 2021

andrewvc commented Aug 24, 2021

vigneshshanmugam commented Aug 24, 2021

vigneshshanmugam left a comment

Choose a reason for hiding this comment

andrewvc commented Aug 24, 2021

andrewvc commented Aug 24, 2021

mergify bot commented Aug 24, 2021

mergify bot commented Aug 24, 2021

andrewvc commented Aug 23, 2021 •

edited

Loading

elasticmachine commented Aug 23, 2021 •

edited by jenkins-beats-ci bot

Loading