Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/ruler (take 2) #2458

Merged
merged 42 commits into from
Aug 25, 2020
Merged

Feature/ruler (take 2) #2458

merged 42 commits into from
Aug 25, 2020

Conversation

owen-d
Copy link
Member

@owen-d owen-d commented Jul 31, 2020

Second take on the ruler after getting some upstream changes merged into Prometheus. This initial version hacks an in-memory series store which evaluates alerts upon request and temporarily caches the result locally instead of using a Prometheus metrics store for the ALERTS_FOR_STATE metric.

Followups include:

  1. docs
  2. support for Cortex as a backend metrics store
  3. jsonnet configs
  4. dashboards

@codecov-commenter
Copy link

codecov-commenter commented Jul 31, 2020

Codecov Report

Merging #2458 into master will decrease coverage by 0.40%.
The diff coverage is 43.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2458      +/-   ##
==========================================
- Coverage   63.43%   63.03%   -0.41%     
==========================================
  Files         166      168       +2     
  Lines       14584    14879     +295     
==========================================
+ Hits         9252     9379     +127     
- Misses       4595     4759     +164     
- Partials      737      741       +4     
Impacted Files Coverage Δ
pkg/logql/ast.go 89.77% <0.00%> (+1.56%) ⬆️
pkg/loki/loki.go 0.00% <0.00%> (ø)
pkg/loki/modules.go 4.16% <0.00%> (-0.73%) ⬇️
pkg/ruler/manager/compat.go 0.00% <0.00%> (ø)
pkg/ruler/manager/memstore.go 68.25% <68.25%> (ø)
pkg/querier/queryrange/downstreamer.go 95.87% <0.00%> (-2.07%) ⬇️

@@ -28,6 +28,11 @@ type QueryParams interface {
GetShards() []string
}

// implicit holds default implementations
type implicit struct{}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: these changes are unrelated, but I snuck them in 😈

// Calling Start will set the RuleIter, unblock the MemStore, and start the run() function in a separate goroutine.
func (m *MemStore) Start(iter RuleIter) {
m.mgr = iter
close(m.initiated)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn if a panic is ok if someone calls Start twice? or would it be better to track if running and just do nothing if called twice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. It's definitely a usage error if this is called twice and the caller ultimately comes from cortex which isn't expecting an error return type:

type ManagerFactory = func(
	ctx context.Context,
	userID string,
	notifier *notifier.Manager,
	logger log.Logger,
	reg prometheus.Registerer,
) *rules.Manager

// This could be more efficient (logarithmic instead of linear)
for ts, tsMap := range c.data {
if ts < ns {
delete(c.data, ts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is dangerous? deleting from a slice while iterating over it? I think the pattern I usually see is to make a list of items to remove and then iterate that list to remove them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fine (and is included in effective go)

cache.Set(m.ts, forStateVec)

// Finally return the series if it exists
smpl, ok = cache.Get(m.ts, ls)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant, didn't we just put this timestamp in the cache 2 lines ago?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We previously stored the entire vector resulting from evaluating the rule. When we request it from the cache we're looking for one possible element from that vector. We could iterate through the vector and label-match without using the cache, but it'd result in a lot of duplicate logic that's better contained in the cache IMO.

Copy link
Collaborator

@slim-bean slim-bean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@owen-d owen-d merged commit aea6c3a into grafana:master Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants