Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: support authentication and TLS for Alertmanager #1838

Merged
merged 10 commits into from
Dec 18, 2019

Conversation

simonpasquier
Copy link
Contributor

@simonpasquier simonpasquier commented Dec 4, 2019

Closes #606

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

This change adds support for authentication with basic auth, client
certificates and bearer tokens. It also enables to configure TLS
settings for the Alertmanager endpoints.

Most of the work leverages the existing Prometheus configuration format
and code. In particular TLS certificate files are automatically reloaded
whenever they change.

Verification

End-to-end tests added to cover various HTTP client configurations (TLS, authentication) and file SD integration.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, very nice, it would be nice to add some e2e tests TBH, but it looks good from static review (:

cmd/thanos/rule.go Outdated Show resolved Hide resolved
docs/components/rule.md Show resolved Hide resolved
scripts/cfggen/main.go Show resolved Hide resolved
@simonpasquier simonpasquier force-pushed the tls-and-auth-for-alertmanager branch 2 times, most recently from c76a305 to 358fdcf Compare December 11, 2019 08:16
@simonpasquier simonpasquier marked this pull request as ready for review December 11, 2019 09:07
@simonpasquier
Copy link
Contributor Author

@bwplotka it's ready now.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It looks great, but I have some suggestions. I think we improved single alertmanagers.urls flag on the way which is nice. (: Thanks!

I think all of those are minor suggestions and this is generally good!

return err
}
var (
alertingcfg alert.AlertingConfig
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we stick to camelCase here, but not a big deal, looks readable (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"msg", "sending alerts failed",
"alertmanager", u.Host,
"numAlerts", len(alerts),
"err", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

) should be in next line I think in terms of formatting (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -248,12 +238,15 @@ func (q *Queue) Push(alerts []*Alert) {
}
}

type AlertmanagerDoer interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere else we refer as Client - should we rename here as well? (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -308,7 +294,7 @@ func NewSender(
return s
}

// Send an alert batch to all given Alertmanager URLs.
// Send an alert batch to all given Alertmanager client.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Send an alert batch to all given Alertmanager client.
// Send an alert batch to all given Alertmanager clients.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


### Alertmanager

The configuration format supported by the `--alertmanagers.config` and `--alertmanagers.config-file` flags is the following:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mention something like:

The configuration allows specifying multiple Alertmanagers. Those entries are treated as a single HA group. This means that alert send failure is claimed only if Ruler fails to send to all instances. 

I think we might be missing this as users could use it in a different way (sharding alerts, fanout etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

r := rule(a.New(), a.New(), rulesDir, amCfg, []address{qAddr}, nil)
q := querier(qAddr, a.New(), []address{r.GRPC}, nil)

ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1*time.Minute This might be not enough for our sometimes slow CI, let's make it 3m

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still 1m?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right! it should be ok now.

}))

// Update the Alertmanager file service discovery configuration.
writeAlertmanagerFileSD(t, filepath.Join(amDir, "targets.yaml"), am.HTTP.HostPort())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
writeAlertmanagerFileSD(t, filepath.Join(amDir, "targets.yaml"), am.HTTP.HostPort())
writeRulerAlertmanagerFileSD(t, filepath.Join(amDir, "targets.yaml"), am.HTTP.HostPort())

return nil
}))

// Update the Alertmanager file service discovery configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like we are updating some Alertmanager file SD not Ruler\s file SD for alertmanager (: Can we clarify a bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed writeAlertmanagerFileSD which wasn't really needed since it was only called once. Hopefully it's clearer now.

<-exit
}()

// Wait for a couple of evaluations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we comment on what we wait?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

func TestRuleAlertmanagerFileSD(t *testing.T) {
a := newLocalAddresser()

am := alertManager(a.New())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about this and using alertmanager Mock? I like e2e compatibility check against Alertmanager. I guess it would be too hard to use proper alertmanager in TestRuleAlertmanagerHTTPClient as well? (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with a "fake" Alertmanager for TestRuleAlertmanagerHTTPClient because Alertmanager doesn't support TLS and authentication natively so we would have to deploy something else in front of it. Since the other tests still exercise the "real" Alertmanager API, I felt that it was worth the trade off.

Side-note: with the Alertmanager v2 API and its Open API specification, it's even less needed to run a "real" Alertmanager server as you can generate the server code and probably hook into it from the e2e tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense totally, worth to comment maybe? (:

Side-note: with the Alertmanager v2 API and its Open API specification, it's even less needed to run a "real" Alertmanager server as you can generate the server code and probably hook into it from the e2e tests.

From API perspective yes (methods, required parameters etc), but it's always nice to have e2e tests against actual implementation. This useful to check against hidden invariants etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It would be worth revisiting this part when we add support for Alertmanager API v2. I tried quickly to hack something with httputil.ReverseProxy but I failed short.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added.

@simonpasquier simonpasquier force-pushed the tls-and-auth-for-alertmanager branch 2 times, most recently from 8150c9f to 69083ff Compare December 16, 2019 16:42
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Looks like provider and some docs and test timeout are the only things to address (:

LGTM otherwise.

// TODO(simonpasquier): add support for API version (v1 or v2).
type AlertmanagerConfig struct {
// HTTP client configuration.
HTTPClientConfig HTTPClientConfig `yaml:"http_config"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, it's just bit more work, but happy with this.

pkg/alert/client_test.go Show resolved Hide resolved
pkg/alert/client_test.go Show resolved Hide resolved
func TestRuleAlertmanagerFileSD(t *testing.T) {
a := newLocalAddresser()

am := alertManager(a.New())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense totally, worth to comment maybe? (:

Side-note: with the Alertmanager v2 API and its Open API specification, it's even less needed to run a "real" Alertmanager server as you can generate the server code and probably hook into it from the e2e tests.

From API perspective yes (methods, required parameters etc), but it's always nice to have e2e tests against actual implementation. This useful to check against hidden invariants etc.

r := rule(a.New(), a.New(), rulesDir, amCfg, []address{qAddr}, nil)
q := querier(qAddr, a.New(), []address{r.GRPC}, nil)

ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still 1m?

cmd/thanos/rule.go Show resolved Hide resolved
This change adds support for authentication with basic auth, client
certificates and bearer tokens. It also enables to configure TLS
settings for the Alertmanager endpoints.

Most of the work leverages the existing Prometheus configuration format
and code. In particular TLS certificate files are automatically reloaded
whenever they change.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
…re both defined

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
@simonpasquier simonpasquier force-pushed the tls-and-auth-for-alertmanager branch from 69083ff to f467acf Compare December 18, 2019 07:41
@bwplotka
Copy link
Member

Rdy for review? (:

@simonpasquier
Copy link
Contributor Author

yep

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚄 Let's go! LGTM, thanks for this good work @simonpasquier

@bwplotka bwplotka merged commit 56abeab into thanos-io:master Dec 18, 2019
@simonpasquier
Copy link
Contributor Author

Thanks a lot for the speedy reviews!

var userAgent = fmt.Sprintf("Thanos/%s", version.Version)

type AlertingConfig struct {
Alertmanagers []AlertmanagerConfig `yaml:"alertmanagers"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking.. It might be nice to also use the alert_relabel_configs instead of the --alert.label-drop ?
That would provide higher variability to user.
Also adding external labels to the config instead of the --label flag?
Not sure how fare we want to take this configuration @bwplotka WDYT?
(Sorry for late comments, I had no time lately)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Rule: Add authentication for alertmanagers
3 participants