Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

returns error messages when trigger reload with http #1848

Merged
merged 19 commits into from
Mar 4, 2020
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 60 additions & 36 deletions cmd/thanos/rule.go
Original file line number Diff line number Diff line change
Expand Up @@ -472,49 +472,26 @@ func runRule(
}

// Handle reload and termination interrupts.
reload := make(chan struct{}, 1)
{
cancel := make(chan struct{})
reload <- struct{}{} // Initial reload.

g.Add(func() error {
//initialize rules.
if err := reloadRules(logger, ruleFiles, ruleMgr, evalInterval, configSuccess, configSuccessTime, rulesLoaded); err != nil {
level.Error(logger).Log("msg", "initialize rules failed", "err", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the initial load, we can actually fail and return the combined multierror.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kakkoyun The maintainer doesn't agree the to stop the everything.

please check the comment here. #1848 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinions about it, both works for me. In any case, I've pinged @bwplotka on the thread for the final decision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, initial can fail indeed (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka @kakkoyun Let's focus on the reload process first in this PR. Since the failure in iniliazation is breaking change.
After the reload related code is merged, I will start a new PR for the initialization failure only.
Any suggestion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, happy with that.

//returns when initialize with invalid pattern error.
if _, ok := err.(*errInvalidPattern); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using recently introduced errors API. https://blog.golang.org/go1.13-errors

// Similar to:
//   if e, ok := err.(*QueryError); ok { … }
var e *QueryError
if errors.As(err, &e) {
    // err is a *QueryError, and e is set to the error's value
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will look into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kakkoyun According to the coverstaion, I should not change the behavior the the rule reloading. This code will be removed.

return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes behaviour compared to the previous implementation. Previously, it was logging and proceeding to other rule files.

if err != nil {	
	// The only error can be a bad pattern.	
	level.Error(logger).Log("msg", "retrieving rule files failed. Ignoring file.", "pattern", pat, "err", err)	
	continue	
}

If this is what we want, you should update the CHANGELOG accordingly.
You should update CHANGELOG anyway because you are changing the behaviour of the HTTP API.

@bwplotka

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kakkoyun This comes from orignal logic. I wrote it to keep same logic with orignal one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No actually. That's what I'm trying to tell, it doesn't keep the same logic if I'm not mistaken. If you check your reloadRules function, in a for-loop when execution encounters an error, it returns it so it breaks out from with the first encounter of an error.

var files []string
for _, pat := range ruleFiles {
    fs, err := filepath.Glob(pat)
    if err != nil {
        // Check errInvalidPattern when initialize.
        return &errInvalidPattern{err, pat}
    }

    files = append(files, fs...)
}

The old code, just prints an error and continue with the execution and checks other files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kakkoyun Sorry for my misunderstanding. Cannot remeber why change like this, since it's long time.
Maybe the change is reasonable. What do you thank?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is definitely reasonable, all I'm saying is that we should document the behaviour change. And of course, if also the maintainers agree.

Copy link
Member

@bwplotka bwplotka Feb 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side, I think @kakkoyun is right.

You should update CHANGELOG anyway because you are changing the behavior of the HTTP API.

Yes, if we are changing behavior... I think we should not.

The old code, just prints an error and continue with the execution and checks other files.

My opinion is that we should fail and proceed to another rule. No point in failing everything and stop as it does not crash (and should not!) ruler. Also we should be consistent to Prometheus behavior. (: Can we change the logic here to what was previously? @00arthur00

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka Thanks for your reply. I will revert the logic. I.e, keep log and no termination.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kakkoyun committed. Changes include logic revert and changelog modification. Please help to review. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka I think we should error and exit out if rule file includes faulty rules when we are starting component for the first time. And then for any subsequent update, it should just log the error and continue running without any crash. What do you think?

}
}
for {
select {
case <-cancel:
return errors.New("canceled")
case <-reload:
case <-reloadSignal:
}

level.Debug(logger).Log("msg", "configured rule files", "files", strings.Join(ruleFiles, ","))
var files []string
for _, pat := range ruleFiles {
fs, err := filepath.Glob(pat)
if err != nil {
// The only error can be a bad pattern.
level.Error(logger).Log("msg", "retrieving rule files failed. Ignoring file.", "pattern", pat, "err", err)
continue
if err := reloadRules(logger, ruleFiles, ruleMgr, evalInterval, configSuccess, configSuccessTime, rulesLoaded); err != nil {
level.Error(logger).Log("msg", "reload rules by sighup failed", "err", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only sighup TBH, can be HTTP /-/reload right? So let's fix message here (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka Yes. These code hanlde sighup only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to treat signal and HTTP reload exactly the same way (as it was before). Why not? We could reuse reloadSignal channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to treat signal and HTTP reload exactly the same way (as it was before). Why not? We could reuse reloadSignal channel.

@bwplotka If so, we need another channel to receive the error message for webhandler. So a new struct should wrap reloadSignal and errMsg. Maybe it is redundant for the current implementation.
BTW, If you persist, I will follow your comment, and any suggestion on the "a new struct" solution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new struct to handler webhandler. Since we have a reloadSignal as an input paramter, we need select reloadSignal always.
Please help review. Thanks.
@bwplotka

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

}

files = append(files, fs...)
}

level.Info(logger).Log("msg", "reload rule files", "numFiles", len(files))

if err := ruleMgr.Update(evalInterval, files); err != nil {
configSuccess.Set(0)
level.Error(logger).Log("msg", "reloading rules failed", "err", err)
continue
}

configSuccess.Set(1)
configSuccessTime.SetToCurrentTime()

rulesLoaded.Reset()
for _, group := range ruleMgr.RuleGroups() {
rulesLoaded.WithLabelValues(group.PartialResponseStrategy.String(), group.File(), group.Name()).Set(float64(len(group.Rules())))
case <-cancel:
return errors.New("canceled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not yours, but can we move to context.WithCancel(..) as we do everywhere else?

}

}
}, func(error) {
close(cancel)
Expand Down Expand Up @@ -564,7 +541,10 @@ func runRule(
}

router.WithPrefix(webRoutePrefix).Post("/-/reload", func(w http.ResponseWriter, r *http.Request) {
reload <- struct{}{}
if err := reloadRules(logger, ruleFiles, ruleMgr, evalInterval, configSuccess, configSuccessTime, rulesLoaded); err != nil {
level.Error(logger).Log("msg", "reload rules by webhandler failed", "err", err)
http.Error(w, err.Error(), http.StatusInternalServerError)
}
})

flagsMap := map[string]string{
Expand Down Expand Up @@ -758,3 +738,47 @@ func addDiscoveryGroups(g *run.Group, c *http_util.Client, interval time.Duratio
cancel()
})
}

type errInvalidPattern struct {
err error
pat string
}

func (e *errInvalidPattern) Error() string {
return errors.Wrapf(e.err, "retrieving rule files failed. Ignoring file. pattern %s", e.pat).Error()
}
00arthur00 marked this conversation as resolved.
Show resolved Hide resolved
func reloadRules(logger log.Logger,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has too many parameters, it makes it harder to read. And most of the parameters are metrics, consider using a struct to collect metrics as it had done in

type DownsampleMetrics struct {
downsamples *prometheus.CounterVec
downsampleFailures *prometheus.CounterVec
}

and pass it around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kakkoyun Committed. Please help to review. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now 👍

ruleFiles []string,
ruleMgr *thanosrule.Manager,
evalInterval time.Duration,
configSuccess prometheus.Gauge,
configSuccessTime prometheus.Gauge,
rulesLoaded *prometheus.GaugeVec) error {
level.Debug(logger).Log("msg", "configured rule files", "files", strings.Join(ruleFiles, ","))
var files []string
for _, pat := range ruleFiles {
fs, err := filepath.Glob(pat)
if err != nil {
//check errInvalidPattern when initialize.
return &errInvalidPattern{err, pat}
}

files = append(files, fs...)
}

level.Info(logger).Log("msg", "reload rule files", "numFiles", len(files))

if err := ruleMgr.Update(evalInterval, files); err != nil {
configSuccess.Set(0)
return errors.Wrap(err, "reloading rules failed")
}

configSuccess.Set(1)
configSuccessTime.Set(float64(time.Now().UnixNano()) / 1e9)

rulesLoaded.Reset()
for _, group := range ruleMgr.RuleGroups() {
rulesLoaded.WithLabelValues(group.PartialResponseStrategy.String(), group.File(), group.Name()).Set(float64(len(group.Rules())))
}
return nil
}