Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heartbeat] Fix scheduler job type limit algorithm #27559

Merged
merged 9 commits into from
Aug 24, 2021

Conversation

andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Aug 23, 2021

What does this PR do?

Previously heartbeat would break when running with mode: all since that would create multiple terminal jobs. These would all attempt to release from the limit semaphore, when only the last one should.

This also refactors recursive job running into an OO type structure to make things more readable.

Why is it important?

Causes a panic:

panic: semaphore: released more than held

goroutine 100 [running]:
golang.org/x/sync/semaphore.(*Weighted).Release(0xc00011c0f0, 0x1)
	/Users/dominiqueclarke/go/pkg/mod/golang.org/x/sync@v0.0.0-20200317015054-43a5402ce75a/semaphore/semaphore.go:103 +0xed
github.com/elastic/beats/v7/heartbeat/scheduler.(*Scheduler).runRecursiveTask(0xc000720000, 0x1101964a0, 0xc00073c440, 0xc00000ec20, 0xc000058a20, 0xc00011c0f0, 0xc04114cb9b52d7b0, 0x346f4dc, 0x1110a1b60)
	/Users/dominiqueclarke/dev/beats/heartbeat/scheduler/scheduler.go:303 +0x2e5
created by github.com/elastic/beats/v7/heartbeat/scheduler.(*Scheduler).runRecursiveTask
	/Users/dominiqueclarke/dev/beats/heartbeat/scheduler/scheduler.go:300 +0x246

With this config

output.elasticsearch:
  hosts: ["localhost:9200"]
  username: "elastic"
  password: "[redacted]"
heartbeat.jobs.http.limit: 1
heartbeat.monitors:
- type: http
  id: httpcheck
  name: HTTP_CHECK
  urls: 'https://news.google.com'
  schedule: '@every 1m'
  mode: all
- type: http
  id: localhost
  name: localhost
  urls: 'http://localhost:8080'
  schedule: '@every 1m'
  mode: all

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
    - [ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Use the config above

@andrewvc andrewvc added bug Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team v7.15.0 labels Aug 23, 2021
@andrewvc andrewvc requested a review from a team as a code owner August 23, 2021 20:21
@andrewvc andrewvc self-assigned this Aug 23, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/uptime (Team:Uptime)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Aug 23, 2021
jobSem.Release(1)
// There is always at least 1 task (the current one), if that's all, then we know
// there are no other jobs active or pending, and we can release the jobLimitSem
if sj.jobLimitSem != nil && sj.activeTasks.Load() == 1 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core of the fix, being able to count the actual number of active tasks

@elasticmachine
Copy link
Collaborator

elasticmachine commented Aug 23, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-08-24T13:54:47.104+0000

  • Duration: 66 min 56 sec

  • Commit: e89c771

Test stats 🧪

Test Results
Failed 0
Passed 3399
Skipped 80
Total 3479

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 3399
Skipped 80
Total 3479

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and looks much cleaner than before.

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a test locally, turned out to be not working as expected. Its not getting limited even after setting limit to 1

@andrewvc
Copy link
Contributor Author

@vigneshshanmugam I believe I've addressed all PR feedback now

@andrewvc
Copy link
Contributor Author

I take that back, I didn't see your final comment, will try a local test.

@andrewvc
Copy link
Contributor Author

@vigneshshanmugam I see it working with the following config. I've pushed up a commit with WARN logging showing the locking. You can see that with a limit of one only one monitor can have acquired the lock

2021-08-23T19:34:32.996-0500    WARN    scheduler/scheduler.go:209      Run my-monitor | &{1 0 {0 0} {{0 0 0 <nil>} 0}} limit
2021-08-23T19:34:32.996-0500    WARN    scheduler/schedjob.go:46        TRY-BLOCKING-ACQUIRE ? my-monitor
2021-08-23T19:34:32.996-0500    WARN    scheduler/schedjob.go:48        ACQUIRED + my-monitor
2021-08-23T19:34:32.996-0500    WARN    scheduler/scheduler.go:209      Run alt-monitor | &{1 1 {0 0} {{0 0 0 <nil>} 0}} limit
2021-08-23T19:34:32.996-0500    WARN    scheduler/schedjob.go:46        TRY-BLOCKING-ACQUIRE ? alt-monitor
2021-08-23T19:34:32.996-0500    INFO    cfgfile/reload.go:164   Config reloader started
2021-08-23T19:34:32.997-0500    INFO    cfgfile/reload.go:224   Loading of config files completed.
2021-08-23T19:34:33.164-0500    WARN    scheduler/schedjob.go:105       RELEASED + my-monitor
2021-08-23T19:34:33.164-0500    WARN    scheduler/scheduler.go:213      End my-monitor
2021-08-23T19:34:33.164-0500    WARN    scheduler/schedjob.go:48        ACQUIRED + alt-monitor
2021-08-23T19:34:33.189-0500    WARN    scheduler/schedjob.go:105       RELEASED + alt-monitor
2021-08-23T19:34:33.189-0500    WARN    scheduler/scheduler.go:213      End alt-monitor

sample config used is:

heartbeat.config.monitors:
  path: ${path.config}/monitors.d/*.yml
  reload.enabled: false
  reload.period: 5s
heartbeat.jobs.http.limit: 1
heartbeat.monitors:
- type: http
  id: my-monitor
  name: My Monitor
  urls: ["http://www.google.com"]
  mode: all
  schedule: '@every 10s'
- type: http
  id: alt-monitor
  mode: all
  name: Alt Monitor
  urls: ["http://www.elastic.co"]
  schedule: '@every 10s'

setup.template.settings:
  index.number_of_shards: 1
  index.codec: best_compression
setup.kibana:
output.console: ~
processors:
  - add_observer_metadata:

@vigneshshanmugam
Copy link
Member

@andrewvc Sorry for that, my bad I had my acquire print statements in a wrong place before the acquire which made me think it was acquiring way too sooner, But I did test now properly and can see it working fine.

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Need to remove this file though - x-pack/heartbeat/out

@andrewvc andrewvc merged commit d561a55 into elastic:master Aug 24, 2021
@andrewvc andrewvc deleted the fix-lim-release branch August 24, 2021 16:34
@andrewvc
Copy link
Contributor Author

@Mergifyio backport 7.x

@andrewvc
Copy link
Contributor Author

@Mergifyio backport 7.15

mergify bot pushed a commit that referenced this pull request Aug 24, 2021
Previously heartbeat would break when running with mode: all since that would create multiple terminal jobs. These would all attempt to release from the limit semaphore, when only the last one should.

This also refactors recursive job running into an OO type structure to make things more readable.

(cherry picked from commit d561a55)
@mergify
Copy link
Contributor

mergify bot commented Aug 24, 2021

Command backport 7.x: success

Backports have been created

mergify bot pushed a commit that referenced this pull request Aug 24, 2021
Previously heartbeat would break when running with mode: all since that would create multiple terminal jobs. These would all attempt to release from the limit semaphore, when only the last one should.

This also refactors recursive job running into an OO type structure to make things more readable.

(cherry picked from commit d561a55)
@mergify
Copy link
Contributor

mergify bot commented Aug 24, 2021

Command backport 7.15: success

Backports have been created

andrewvc added a commit that referenced this pull request Aug 24, 2021
Previously heartbeat would break when running with mode: all since that would create multiple terminal jobs. These would all attempt to release from the limit semaphore, when only the last one should.

This also refactors recursive job running into an OO type structure to make things more readable.

(cherry picked from commit d561a55)

Co-authored-by: Andrew Cholakian <andrew@andrewvc.com>
andrewvc added a commit that referenced this pull request Aug 24, 2021
Previously heartbeat would break when running with mode: all since that would create multiple terminal jobs. These would all attempt to release from the limit semaphore, when only the last one should.

This also refactors recursive job running into an OO type structure to make things more readable.

(cherry picked from commit d561a55)

Co-authored-by: Andrew Cholakian <andrew@andrewvc.com>
mdelapenya added a commit to mdelapenya/beats that referenced this pull request Aug 25, 2021
* master:
  Skip Flaky Tests  (elastic#27590)
  Remove fargate from aws module config (elastic#27575)
  [Heartbeat] Fix scheduler job type limit algorithm (elastic#27559)
Icedroid pushed a commit to Icedroid/beats that referenced this pull request Nov 1, 2021
Previously heartbeat would break when running with mode: all since that would create multiple terminal jobs. These would all attempt to release from the limit semaphore, when only the last one should.

This also refactors recursive job running into an OO type structure to make things more readable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team v7.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants