Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/rule: Fix bugs where rules were out of sync. #1439

Closed
wants to merge 3 commits into from

Conversation

johncming
Copy link
Contributor

@johncming johncming commented Aug 20, 2019

Thanos rule loading alerts about out-of-sync rule files

Specifically, the following scenario:

Rule file rule. Yaml, which contains some rules, through the reload command, thanos rule load rules.

If you clear the contents of rule. Yaml, but leave the file in, the old rules in thanos rule will not be cleared.

This bug leads to a common situation (all rules in one file)

If thanos rule has 100 rules, it can only clear 99 rules at most.

Closes #1739

@johncming johncming changed the title pkg/rule: Fixed bugs where rules were out of sync. pkg/rule: Fix bugs where rules were out of sync. Aug 20, 2019
@bwplotka bwplotka self-requested a review August 22, 2019 09:00
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I think that makes sense but have one suggestion

Thanks for helping 👍

On top of the suggestion below I think we really need more unit tests to:

  • Detect those bug early
  • Do not repeat same bug anymore
  • Confirm fix works (:

Do you think we can add some tests?

pkg/rule/rule.go Outdated
@@ -113,7 +113,10 @@ func (r RuleGroup) MarshalYAML() (interface{}, error) {
func (m *Managers) Update(dataDir string, evalInterval time.Duration, files []string) error {
var (
errs = tsdberrors.MultiError{}
filesMap = map[storepb.PartialResponseStrategy][]string{}
filesMap = map[storepb.PartialResponseStrategy][]string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I think that will work, but to make it verbose, clearer and bit more efficient I would suggest add one more loop at the end:

	for s, updater := range *m {
		if _, ok := filesMap[s]; ok {
			// We already updated those above.
			continue
		}
		// Rules removed update updater if needed.
		if len(updater.RuleGroups()) == 0 {
			continue
		}

		if err := updater.Update(evalInterval, []string{}, nil); err != nil {
			errs = append(errs, err)
		}
	}

What do you think? (:

Copy link
Contributor Author

@johncming johncming Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it 's more clear. how about merge the last loop with your loop?

	for s, updater := range *m {
		fs := filesMap[s]
		if err := updater.Update(evalInterval, fs, nil); err != nil {
			errs = append(errs, err)
			continue
		}
	}

if ok, i will update the pr.

I will add a testcase for this.

@johncming
Copy link
Contributor Author

Nice! I think that makes sense but have one suggestion

Thanks for helping 👍

On top of the suggestion below I think we really need more unit tests to:

  • Detect those bug early
  • Do not repeat same bug anymore
  • Confirm fix works (:

Do you think we can add some tests?

Yes, it is necessary to add test cases.

@GiedriusS
Copy link
Member

@johncming do you have any free time to finish up this PR? :p

@simonpasquier
Copy link
Contributor

@johncming do you have time to rebase?

@johncming
Copy link
Contributor Author

johncming commented Dec 20, 2019

@johncming do you have time to rebase?

fixed.

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

err = m.Update(1*time.Second, []string{
filepath.Join(dir, "no_strategy.yaml"),
})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) remove blank line

dir, err := ioutil.TempDir("", "test_rule_rule_groups")
testutil.Ok(t, err)
defer func() { testutil.Ok(t, os.RemoveAll(dir)) }()
err = os.MkdirAll(filepath.Join(dir, "subdir"), 0775)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line and the next one aren't needed AFAICT.

@simonpasquier
Copy link
Contributor

I think you'll need to update the CHANGELOG and reference #1739 too.

Signed-off-by: johncming <johncming@yahoo.com>
@johncming
Copy link
Contributor Author

I think you'll need to update the CHANGELOG and reference #1739 too.

Updated.

When I read the CHANGELOG, there was no place to put this reference #1739. Where should I put this reference #1739? Or I reference it in the pr conversations?

Thanks.

@simonpasquier
Copy link
Contributor

When I read the CHANGELOG, there was no place to put this reference #1739. Where should I put this reference #1739? Or I reference it in the pr conversations?

Yes I meant the latter (adding to Closes #1739 to the PR description).

@johncming
Copy link
Contributor Author

When I read the CHANGELOG, there was no place to put this reference #1739. Where should I put this reference #1739? Or I reference it in the pr conversations?

Yes, I meant the latter (adding to Closes #1739 to the PR description).

Yes, it is updated.

@daixiang0
Copy link
Member

daixiang0 commented Jan 13, 2020

@johncming can you rebase?

@johncming
Copy link
Contributor Author

@johncming can you rebase?

rebased.

@@ -44,6 +44,7 @@ Compactor now properly handles partial block uploads for all operation like rete
- [#1882](https://github.com/thanos-io/thanos/pull/1882) Receive: upload to object storage as 'receive' rather than 'sidecar'.
- [#1907](https://github.com/thanos-io/thanos/pull/1907) Store: Fixed the duration unit for the metric `thanos_bucket_store_series_gate_duration_seconds`.
- [#1931](https://github.com/thanos-io/thanos/pull/1931) Compact: Fixed the compactor successfully exiting when actually an error occurred while compacting a blocks group.
- [#1439](https://github.com/thanos-io/thanos/pull/1439) Rule: Fix bugs where rules were out of sync.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should go in the Unreleased section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -213,6 +213,21 @@ func (m *Manager) Update(evalInterval time.Duration, files []string) error {
continue
}
}

for s, mgr := range m.mgrs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth adding a comment explaining that this section removes the rules from a manager when a strategy has no more rule.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Sorry for lag in review. Let's fix @simonpasquier comments and merge 👍

LGTM after fixes.

@GiedriusS
Copy link
Member

@johncming thanks for your persistence! Care to fix the code according to the last comments so that we could merge it and fix it?

@stale
Copy link

stale bot commented Mar 1, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Mar 1, 2020
@stale stale bot closed this Mar 9, 2020
@lilic
Copy link
Contributor

lilic commented Apr 2, 2020

Any reason why this PR was never merged, was this issue fixed by another PR? Thanks! @johncming @bwplotka

@simonpasquier
Copy link
Contributor

@lilic it has been closed by the bot because it has seen no activity in a while. @johncming would you be able to reopen it and address the comments?

@lilic
Copy link
Contributor

lilic commented Apr 2, 2020

@johncming if you don't have time to address the comments for any reason, I can also takeover and pick up this PR from where you leftover with your commits of course? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rule: Alerts will not be deleted after curl -X POST localhost:10902/-/reload.
6 participants