Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler: per-tenant WAL #4344

Merged
merged 41 commits into from
Sep 22, 2021
Merged

Conversation

dannykopping
Copy link
Contributor

@dannykopping dannykopping commented Sep 17, 2021

What this PR does / why we need it:
Adds durability and isolation to the samples generated by Loki recording rules.

Each tenant will have their own Write-Ahead Log (WAL). Samples appended to the WAL will be sent to the configured remote-write endpoint separately for each tenant to prevent noisy neighbours.

Almost all remote-write configurations are configurable on a per-tenant basis, with each tenant defaulting to the global remote-write config. Remote-write can be reconfigured at runtime, too.

Which issue(s) this PR fixes:
Fixes #4241

Special notes for your reviewer:
We reuse a lot of Prometheus' WAL and remote-write functionality, as well as some code from the Grafana Agent. The Agent's WAL handling does away with the TSDB head appender; by skipping the head appender, we only need to keep a WAL per tenant, and not an additional tsdb.

The code from the Agent is heavily modified for our purposes.
Most of the changes I made in pkg/ruler/storage (copied from grafana/agent in #4354) was around making metrics injectable, removing features we didn't need and minor refactoring.

Checklist

  • Documentation added
  • Tests updated

@dannykopping dannykopping force-pushed the dannykopping/ruler-wal branch 3 times, most recently from e161d53 to 104b810 Compare September 19, 2021 09:12
@@ -154,7 +173,6 @@ func (l *Limits) RegisterFlags(f *flag.FlagSet) {

f.IntVar(&l.RulerMaxRulesPerRuleGroup, "ruler.max-rules-per-rule-group", 0, "Maximum number of rules per rule group per-tenant. 0 to disable.")
f.IntVar(&l.RulerMaxRuleGroupsPerTenant, "ruler.max-rule-groups-per-tenant", 0, "Maximum number of rule groups per-tenant. 0 to disable.")
f.IntVar(&l.RulerRemoteWriteQueueCapacity, "ruler.remote-write.queue-capacity", 10000, "Capacity of remote-write queues; if a queue exceeds its capacity it will evict oldest samples.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: add notice in changelog about this flag being removed

@dannykopping dannykopping marked this pull request as ready for review September 20, 2021 06:45
@dannykopping dannykopping requested a review from a team as a code owner September 20, 2021 06:45
@dannykopping dannykopping changed the title Ruler: per-tenant WAL (WIP) Ruler: per-tenant WAL Sep 20, 2021
Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Danny Kopping added 18 commits September 20, 2021 15:12
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Setting WAL dir name as tenant ID

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Adding appender_ready metric

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
…rule evaluation until WAL is ready

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Danny Kopping added 17 commits September 20, 2021 15:29
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
…p was removed in Prometheus

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
No reason to keep a per-tenant WAL without remote-write

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
We need to prevent a package import cycle

Also add Clone functions for deep copy

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Clean up WAL dir after test

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
…t be manipulated

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
@dannykopping dannykopping marked this pull request as draft September 20, 2021 13:48
@dannykopping dannykopping marked this pull request as ready for review September 20, 2021 14:45
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
It will only clean abandoned WALs on startup, for now

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits, but LGTM


return ruler.ManagerFactory(func(
var registry storageRegistry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love having this globally defined, perhaps we can alter it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree, but I don't see an easy way to do so.
I'll try address this in a follow-up PR.


mgr := rules.NewManager(&rules.ManagerOptions{
Appendable: newAppendable(cfg, overrides, logger, userID, rwMetrics),
Appendable: registry,
Queryable: memStore,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still using the memStore here as queryable, which is fine, but we are now able to read from Prometheus to determine this with less hacks and remove the memstore implementation altogether. This can be in a later PR though. I opened #4362

@dannykopping dannykopping merged commit 8a546b3 into grafana:main Sep 22, 2021
@dannykopping dannykopping deleted the dannykopping/ruler-wal branch September 22, 2021 09:52
dannykopping pushed a commit that referenced this pull request Sep 27, 2021
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement durable storage for recording rules samples
3 participants