Line deduplication filter #3110

dannykopping · 2021-01-03T22:40:58Z

What this PR does / why we need it:

Introduces a dedup filter to reduce a log stream by grouping on unique label dimensions.

Which issue(s) this PR fixes:

None. I implemented this quickly on a whim so have not discussed whether we actually want this; feel free to reject on this basis.

Special notes for your reviewer:

I wrote this feature to scratch an itch I had when using Loki as a datasource for annotations in Grafana; LogQL only offers line/label filtering for reducing the returned lines in a log stream, which doesn't quite fit my use-case.

In my use-case, I have multiple log lines that are "duplicated", with only a unique cluster label in each; I didn't want to force the user of my dashboard to have to choose a cluster to filter on since this data is generally consistent across clusters (and choosing an arbitrary cluster to filter on to achieve the same result doesn't express my intent clearly); figured this feature might be useful to other folks too.

I tried to mirror the nomenclature used in the Grafana Explore view, although this may cause confusion:

Open to suggestions on this.

If this PR is approved, I'm happy to implement the deduplication by exact/number/signature logic in LogQL.

Checklist

Documentation added
Tests updated

CLAassistant · 2021-01-03T22:41:03Z

All committers have signed the CLA.

owen-d

I've left a few comments/suggestions here and there but overall it's well done. Great job figuring out how the sausage is made and extending this.

Todo (Owen/Cyril):

How will this affect sharding? Presumably dedup would need to be applied twice (at the edge and at the aggregation layer.
The new node types need to be added to the new Walk functions I think.

owen-d · 2021-01-05T16:03:25Z

pkg/logql/log/dedup_filter.go

+
+func NewLineDedupFilter(labelFilters []string, inverted bool) *LineDedupFilter {
+	// create a map of labelFilters for O(1) lookups instead of O(n)
+	var filterMap = make(map[string]interface{})


nit: can use map[string]struct{} here.

Ah yes, I missed this in my refactor - thanks 👍

pkg/logql/log/dedup_filter.go

pkg/logql/log/dedup_filter_test.go

pkg/logql/parser_test.go

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

dannykopping · 2021-01-05T20:11:33Z

@owen-d thought I'd also just drop in the benchmarking results to see if this is any cause for concern:

goos: linux
goarch: amd64
pkg: github.com/grafana/loki/pkg/logql/log
Benchmark_Deduplication
Benchmark_Deduplication     	10071492	       102 ns/op	      32 B/op	       1 allocs/op
Benchmark_Deduplication-2   	13297992	        93.6 ns/op	      32 B/op	       1 allocs/op
Benchmark_Deduplication-4   	11089706	       101 ns/op	      32 B/op	       1 allocs/op
Benchmark_Deduplication-8   	11677495	        97.4 ns/op	      32 B/op	       1 allocs/op
PASS

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

cyriltovena · 2021-01-05T21:17:03Z

pkg/logql/log/dedup_filter.go

+
+func NewLineDedupFilter(labelFilters []string, inverted bool) *LineDedupFilter {
+	// create a map of labelFilters for O(1) lookups instead of O(n)
+	var filterMap = make(map[string]struct{})


Suggested change

var filterMap = make(map[string]struct{})

var filterMap = make(map[string]struct{}, len(labelFilters))

cyriltovena · 2021-01-05T21:44:37Z

I think this is a great, I wanted to implement something similar.

I'm wondering though, wouldn't be this more useful if it includes somehow the number of duplicates reduced for each line ?

I think this would help the client to emphasis on a specific message, but then there some UI work.

I'm also wondering how it will looks like in Explore histogram, we could get this deployed in dev and give it a try.

In my use-case, I have multiple log lines that are "duplicated", with only a unique cluster label in each; I didn't want to force the user of my dashboard to have to choose a cluster to filter on since this data is generally consistent across clusters (and choosing an arbitrary cluster to filter on to achieve the same result doesn't express my intent clearly); figured this feature might be useful to other folks too.

Can you share some example of that use-case ? I'm curious to compare to my use-cases, filtering duplicate errors like below using the dedupe by (msg):

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

dannykopping · 2021-01-06T04:55:54Z

I'm wondering though, wouldn't be this more useful if it includes somehow the number of duplicates reduced for each line ?

I think this would help the client to emphasis on a specific message, but then there some UI work.

I'm also wondering how it will looks like in Explore histogram, we could get this deployed in dev and give it a try.

Funnily enough, I implemented that initially and then disregarded it.

How do you see this working though? I'm thinking of a meta-label like __dedup_count__ (similar in nature to __error__).

Can you share some example of that use-case ? I'm curious to compare to my use-cases, filtering duplicate errors like below using the dedupe by (msg):

That's almost exactly my use-case too; in my use-case I want to reduce the stream of logs by using dedup without (cluster) as it's the only unique label in an otherwise identical log entry (reducing cardinality would be the overarching goal, you could say).

dannykopping · 2021-01-06T06:22:36Z

I tried to quickly hack together a dedup counter, with that __dedup_count__ meta-label attached to the returned line, but I've run into an issue ~~I don't understand yet~~.

dannykopping/loki@dannykopping/line_dedup_filter...dannykopping:dannykopping/line_dedup_filter_count

Even if I keep a reference to the LabelsBuilder instance associated to the line that is eventually returned, I cannot update it later on (I always end up with __dedup_count__=1).

Update:

I see now that p.builder.Reset() is being called in func (p *streamPipeline) Process(line []byte), and the labels are processed to a result once the line is returned, so that explains it:

return line, p.builder.LabelsResult(), true

What would suggest here? I can't see a way right now to attach a meta-label to the 1 returned line using my existing mechanism of deduplication.

cyriltovena · 2021-01-06T08:13:52Z

I tried to quickly hack together a dedup counter, with that __dedup_count__ meta-label attached to the returned line, but I've run into an issue ~~I don't understand yet~~.

dannykopping/loki@dannykopping/line_dedup_filter...dannykopping:dannykopping/line_dedup_filter_count

Even if I keep a reference to the LabelsBuilder instance associated to the line that is eventually returned, I cannot update it later on (I always end up with __dedup_count__=1).

Update:

I see now that p.builder.Reset() is being called in func (p *streamPipeline) Process(line []byte), and the labels are processed to a result once the line is returned, so that explains it:
return line, p.builder.LabelsResult(), true
What would suggest here? I can't see a way right now to attach a meta-label to the 1 returned line using my existing mechanism of deduplication.

Not sure yet, but this is not all your problems. Right now, those pipelines are run on ingesters too, so you will be deduping in multiple place, you'll need to merge result a bit differently, this is trickier than you would expect.

There's also dedupe from replication that needs to happens before otherwise they will count tower duplicates and they should not, currently it's not a big problem because we only add labels, but I want to solve this separately.

I'm fine with the meta label solution, but we should see with everyone in the team, you should join our weekly or community meeting.

cyriltovena · 2021-01-06T08:25:11Z

2. The new node types need to be added to the new Walk functions I think.

children(expr SampleExpr) []SampleExpr it's only for sample for now.

…_dedup_filter

stale · 2021-02-09T17:08:13Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

pull-request-size bot added the size/XXL label Jan 3, 2021

dannykopping changed the title ~~Dannykopping/line dedup filter~~ Line deduplication filter Jan 3, 2021

dannykopping force-pushed the dannykopping/line_dedup_filter branch from c1aed95 to b9b54e4 Compare January 3, 2021 22:49

dannykopping marked this pull request as ready for review January 4, 2021 08:29

cyriltovena self-assigned this Jan 5, 2021

owen-d self-assigned this Jan 5, 2021

owen-d reviewed Jan 5, 2021

View reviewed changes

Danny Kopping added 8 commits January 5, 2021 22:05

Initial working version

52fff1e

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Adding parser tests

1b56fdf

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Refactoring

fe9f035

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Refactoring

9ea1313

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Refactoring, adding tests

dcff2d0

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Adding documentation

99d7beb

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Appeasing the linter

165980b

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Review changes

198640e

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

dannykopping force-pushed the dannykopping/line_dedup_filter branch from 0c37f18 to 198640e Compare January 5, 2021 20:06

Implementing required interface method

04bcae8

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Removing defunct code

66612b4

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

cyriltovena reviewed Jan 5, 2021

View reviewed changes

Further review changes

2e59e65

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

owen-d mentioned this pull request Jan 7, 2021

label_format no longer shardable and introduces the Shardable() metho… #3137

Merged

Merge remote-tracking branch 'upstream/master' into dannykopping/line…

aa61e4e

…_dedup_filter

stale bot added the stale A stale issue or PR that will automatically be closed. label Feb 9, 2021

stale bot closed this Feb 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Line deduplication filter #3110

Line deduplication filter #3110

dannykopping commented Jan 3, 2021 •

edited

Loading

CLAassistant commented Jan 3, 2021 •

edited

Loading

owen-d left a comment

owen-d Jan 5, 2021

dannykopping Jan 5, 2021 •

edited

Loading

dannykopping commented Jan 5, 2021

cyriltovena Jan 5, 2021

cyriltovena commented Jan 5, 2021

dannykopping commented Jan 6, 2021

dannykopping commented Jan 6, 2021 •

edited

Loading

cyriltovena commented Jan 6, 2021

cyriltovena commented Jan 6, 2021

stale bot commented Feb 9, 2021

	var filterMap = make(map[string]struct{})
	var filterMap = make(map[string]struct{}, len(labelFilters))

Line deduplication filter #3110

Line deduplication filter #3110

Conversation

dannykopping commented Jan 3, 2021 • edited Loading

CLAassistant commented Jan 3, 2021 • edited Loading

owen-d left a comment

Choose a reason for hiding this comment

owen-d Jan 5, 2021

Choose a reason for hiding this comment

dannykopping Jan 5, 2021 • edited Loading

Choose a reason for hiding this comment

dannykopping commented Jan 5, 2021

cyriltovena Jan 5, 2021

Choose a reason for hiding this comment

cyriltovena commented Jan 5, 2021

dannykopping commented Jan 6, 2021

dannykopping commented Jan 6, 2021 • edited Loading

cyriltovena commented Jan 6, 2021

cyriltovena commented Jan 6, 2021

stale bot commented Feb 9, 2021

dannykopping commented Jan 3, 2021 •

edited

Loading

CLAassistant commented Jan 3, 2021 •

edited

Loading

dannykopping Jan 5, 2021 •

edited

Loading

dannykopping commented Jan 6, 2021 •

edited

Loading