Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an experimental flag to block samples with timestamp too far in the future #6195

Merged
merged 12 commits into from
Apr 10, 2023
Merged

Conversation

jnyi
Copy link
Contributor

@jnyi jnyi commented Mar 8, 2023

Hi Team,

TL; DR when we run thanos receiver, sometimes 1 sample with a bad timestamp too far in the future could pollute TSDB head and block other valid samples even with tsdb.out-of-order.time-window enabled, for example like 10 years in the future. Prometheus implementation didn't have a good way to deal with this issue neither: https://github.com/prometheus/prometheus/blob/main/tsdb/head_append.go#L401

Added this tsdb.too-far-in-future.time-window flag will prevent samples with timestamp > current + configured window to be appended to TSDB.

This Pr addresses: #6167.

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Added an experimental flag to thanos receive called --tsdb.too-far-in-future.time-window:

  1. By default it is disabled with "0s"
  2. If some values are provided, it will reject samples with timestamp t if t > now + window

Verification

Added two unit tests to verify it is working as expected

--- PASS: TestWriter (10.25s)
    --- PASS: TestWriter/should_error_out_when_sample_timestamp_is_too_far_in_the_future (1.02s)
    --- PASS: TestWriter/should_error_out_on_valid_series_with_out_of_order_exemplars (1.03s)
    --- PASS: TestWriter/should_error_out_when_exemplar_label_length_exceeds_the_limit (1.02s)
    --- PASS: TestWriter/should_error_out_and_skip_series_with_out-of-order_labels;_accept_series_with_valid_labels (1.02s)
    --- PASS: TestWriter/should_succeed_when_sample_timestamp_is_NOT_too_far_in_the_future (1.02s)
    --- PASS: TestWriter/should_error_out_and_skip_series_with_out-of-order_labels (1.02s)
    --- PASS: TestWriter/should_error_out_and_skip_series_with_duplicate_labels (1.02s)
    --- PASS: TestWriter/should_succeed_on_valid_series_with_exemplars (1.02s)
    --- PASS: TestWriter/should_error_out_on_series_with_no_labels (1.02s)
    --- PASS: TestWriter/should_succeed_on_series_with_valid_labels (1.03s)
PASS
ok      github.com/thanos-io/thanos/pkg/receive 10.899s

…he future

Signed-off-by: Yi Jin <yi.jin@databricks.com>
lset = labelpb.ZLabelsToPromLabels(t.Labels)
}

// Append as many valid samples as possible, but keep track of the errors.
for _, s := range t.Samples {
ref, err = app.Append(ref, lset, s.Timestamp, s.Value)
if tooFarInFuture != 0 && tooFarInFuture.Before(model.Time(s.Timestamp)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check r.opts.TooFarInFutureTimeWindow != 0?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@@ -866,6 +870,11 @@ func (rc *receiveConfig) registerFlag(cmd extkingpin.FlagClause) {

rc.tsdbMaxBlockDuration = extkingpin.ModelDuration(cmd.Flag("tsdb.max-block-duration", "Max duration for local TSDB blocks").Default("2h").Hidden())

rc.tsdbTooFarInFutureTimeWindow = extkingpin.ModelDuration(cmd.Flag("tsdb.too-far-in-future.time-window",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe print an error and exit if the flag < 0.

Copy link
Contributor Author

@jnyi jnyi Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of the flag should be a duration string like "0s", "5m", "1h", etc

ref, err = app.Append(ref, lset, s.Timestamp, s.Value)
if tooFarInFuture != 0 && tooFarInFuture.Before(model.Time(s.Timestamp)) {
// now + tooFarInFutureTimeWindow < sample timestamp
err = storage.ErrOutOfBounds

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add debug log for how long it out of bounds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we even make it Warning level? That is something that should fail a bit louder i think so errors can be investigated better.

lset = labelpb.ZLabelsToPromLabels(t.Labels)
}

// Append as many valid samples as possible, but keep track of the errors.
for _, s := range t.Samples {
ref, err = app.Append(ref, lset, s.Timestamp, s.Value)
if tooFarInFuture != 0 && tooFarInFuture.Before(model.Time(s.Timestamp)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

rc.tsdbTooFarInFutureTimeWindow = extkingpin.ModelDuration(cmd.Flag("tsdb.too-far-in-future.time-window",
"[EXPERIMENTAL] Configures the allowed time window for ingesting samples too far in the future. Disabled (0s) by default"+
"Please note enable this flag will reject samples in the future of receive local NTP time + configured duration.",
).Default("0s").Hidden())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a limit range we should allow it to put?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for making this flag hidden?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no particular reason, i thought it is a convention for experimental flag, removed.

Signed-off-by: Yi Jin <yi.jin@databricks.com>
@jnyi jnyi requested review from hczhu and davidyuanfs and removed request for hczhu and davidyuanfs March 9, 2023 21:33
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit wary about adding yet another flag. Perhaps there could be some default value like 1h or 30m? What sense does it make to accept data in the future by default? 🤔 I wonder what @fpetkovski thinks about this PR.

@fpetkovski
Copy link
Contributor

Adding a default of 1h sounds good to me. We can think about a flag or a more dynamic approach if the default proves to be not good enough 👍

@jnyi
Copy link
Contributor Author

jnyi commented Mar 14, 2023

Sounds fair, let me make it default to 1h for now instead of a flag.

@matej-g
Copy link
Collaborator

matej-g commented Mar 15, 2023

I had same concern as @fpetkovski and @GiedriusS, for such a specific feature we should be fine with one value. I'd even go as far as saying whether it makes sense to accept anything with future timestamps, ideally users should not have any clock skew.

@defreng
Copy link

defreng commented Mar 15, 2023

I kind of agree with what you're saying... I also don't see the point of allowing any future samples except for some minor clock skew which should be less than a few seconds typically (?)

If going with a static value, 1h seems far too big to me, and I would rather propose something like 30s or even less?

@ahurtaud
Copy link
Contributor

ahurtaud commented Mar 16, 2023

We are currently looking at something to push points to the future. In order to build "seasonal" traffic, we are epxlorign a way to remote_write "current" throughput to +1 week timestamp. This way, we should be able to display / alert on the "1 week ago" traffic using "live" timestamp and not offset 1w. We expect this to be way more efficient, as we have performance issue with offset 1w. I am not sure it will be better with this yet. Will let you know...

EDIT: this would also allow us to display expected traffic like what we have in our in-house monitoring solution right now:
Screenshot 2023-03-16 at 11 45 44

But we would like a flag to allow pushing to the future :/

@matej-g
Copy link
Collaborator

matej-g commented Mar 17, 2023

@ahurtaud interesting! Partly I was thinking if there is some special use cases where people want to push into future, and seems like you found one. I think it's reasonable then to add simple flag to enable / disable ingestion of future timestamps.

So how does it sound if:

  1. By default we don't allow timestamps that are > 30 seconds in the future
  2. This protection can be turned off with a flag

@ahurtaud
Copy link
Contributor

perfect @matej-g ! thank you for considering it!

@jnyi
Copy link
Contributor Author

jnyi commented Mar 17, 2023

That's good to know too @ahurtaud, I was about to revert the flag :( so in this case @matej-g I would argue this flag should be turned off by default for backward compatibility, people can turn on intentionally going forward by adding the flag --tsdb.too-far-in-future.time-window=<your perfered threshold>, our team has been adding the flag for a while internally for a week or so and it worked out well.

@jnyi jnyi requested review from GiedriusS and removed request for hczhu March 17, 2023 22:56
@matej-g
Copy link
Collaborator

matej-g commented Mar 22, 2023

@jnyi I don't have a strong opinion on the default, for me it would make sense to have the protection on by default, as the use case is so narrow and special, I believe 99% of users would be unaffected even if they did not know about this change.

@jnyi
Copy link
Contributor Author

jnyi commented Mar 22, 2023

@jnyi I don't have a strong opinion on the default, for me it would make sense to have the protection on by default, as the use case is so narrow and special, I believe 99% of users would be unaffected even if they did not know about this change.

Agreed, I think also most people didn't even run into this issue so far so making it default doesn't matter to them neither, disable this by default is backward compatibility for people like @ahurtaud won't be affected when they upgrade Thanos, and people like us will be able to fix it as new adopters :)

Copy link
Contributor

@fpetkovski fpetkovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, this is looking quite good. I have mostly minor comments.

rc.tsdbTooFarInFutureTimeWindow = extkingpin.ModelDuration(cmd.Flag("tsdb.too-far-in-future.time-window",
"[EXPERIMENTAL] Configures the allowed time window for ingesting samples too far in the future. Disabled (0s) by default"+
"Please note enable this flag will reject samples in the future of receive local NTP time + configured duration.",
).Default("0s").Hidden())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for making this flag hidden?

cmd/thanos/receive.go Show resolved Hide resolved
pkg/receive/writer.go Outdated Show resolved Hide resolved
Signed-off-by: Yi Jin <yi.jin@databricks.com>
@pull-request-size pull-request-size bot added size/L and removed size/M labels Apr 7, 2023
Copy link
Contributor

@fpetkovski fpetkovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just one last comment from my side.

@@ -28,6 +29,26 @@ type TenantStorage interface {
TenantAppendable(string) (Appendable, error)
}

// Wraps storage.Appender to add validation and logging.
type ReceiveAppender struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if my comment wasn't clear, I meant that this appender should always reject samples that are too far in the future.

If the flag is set to a value other than zero, we would wrap the default appender with this one. Otherwise we would just use the storage appender as before.

Copy link
Contributor Author

@jnyi jnyi Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ic, we can do that, but I feel this pattern is more extensible for wrapping prometheus' default appender implementation going forward:

app := multiTSDB.appender(tenantID)
if tooFarInFuture != 0 {
  app = ReceiveAppender{ toofarInFuture, app}
} else if <another condition> {
 app = AnotherAppenderWrapper{ flag, app}
} else if <third condition> {
  app = thridAppenderWrapper...
}

vs

app :=  ReceiveAppender{
  tooFarInFuture: tooFarInFutureFlag,
  // add other flags
  app: multiTSDB.appender(tenantID)
}

// add behavior insider `ReceiveAppender`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The appender wrapper would only be responsible for rejecting samples in the future and other appender wrappers will apply different limits. So it would rather be:

app := multiTSDB.appender(tenantID)
if tooFarInFuture != 0 {
  app = AppenderWithFutureLimit(toofarInFuture, app}
} 
if <another condition> {
 app = AnotherAppender{flag, app}
} 
if <third condition> {
  app = ThirdAppender(flag, appender)
}

In any case, this is definitely not blocker but more of a suggestion. We would anyway have one appender for now and we can break it up into multiple if it the logic gets more complicated in the future.

jnyi added 2 commits April 7, 2023 12:02
Signed-off-by: Yi Jin <yi.jin@databricks.com>
Signed-off-by: Yi Jin <yi.jin@databricks.com>
docs/components/tools.md Outdated Show resolved Hide resolved
Signed-off-by: Yi Jin <yi.jin@databricks.com>
@jnyi jnyi requested a review from fpetkovski April 7, 2023 20:00
fpetkovski
fpetkovski previously approved these changes Apr 8, 2023
Copy link
Contributor

@fpetkovski fpetkovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good to me 👍

@JayChanggithub
Copy link

@jnyi @fpetkovski
I am not clearly but just wanna to clarify issue. I supposed these PR was validate whether cover my issue(#6158). If so how to add exactly parameters can solved with receiver? I found our
receiver instance query data not stability since promethues logs show msg="non-recoverable error" count=500 exemplarCount=0 err="server returned HTTP status 409 Conflict: 5 errors: forwarding request to endpoint thanos-receive-0.thanos-receive.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-0.thanos-receive.thanos.svc.cluster.local:10901: add 96 samples: too old sample; forwarding request to endpoint thanos-receive-1.thanos-receive.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-1.thanos-receive.thanos.svc.cluster.local:10901: add 103 samples: too old sample;
Thanks.
Xnip2023-04-08_14-15-28

Signed-off-by: Yi Jin <yi.jin@databricks.com>
Signed-off-by: Yi Jin <yi.jin@databricks.com>
@jnyi jnyi requested a review from fpetkovski April 8, 2023 20:52
@jnyi
Copy link
Contributor Author

jnyi commented Apr 8, 2023

Thanks, this looks good to me 👍

Thanks @fpetkovski , the e2e tests seem flaky, I try to trigger the rerun and dismissed your approval, would you take another stamp?

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@yeya24 yeya24 merged commit 5d5d39a into thanos-io:main Apr 10, 2023
@jnyi
Copy link
Contributor Author

jnyi commented Apr 12, 2023

Hi @JayChanggithub,

The issue you mentioned seems a different one due to data samples too old, you could consider this flag instead --tsdb.out-of-order.time-window=1h

@JayChanggithub
Copy link

Hi @jnyi
Thanks for your replying. Also we had been adopted flag which called --tsdb.out-of-order.time-window=1h Unfortunately we still meet error code from each prometheus logs( msg="non-recoverable error" count=500 exemplarCount=0 err="server returned HTTP status 409 Conflict). It's impacted our receiver component to scrape each prometheus data as well as the promethues instances counts not stability. Anything else miss configuration regarding those ?

$ k get po -n thanos -oyaml thanos-receive-4  | kubectl-neat 

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 1924efd6089e38f2bc07c62a16f9b29a4fbd972caf3d10ba50922ddb974ceac4
    cni.projectcalico.org/podIP: 100.96.0.234/32
    cni.projectcalico.org/podIPs: 100.96.0.234/32
    kubernetes.io/psp: gardener.privileged
  labels:
    controller-revision-hash: thanos-receive-6484b9b744
    kubernetes.io/name: thanos-receive
    statefulset.kubernetes.io/pod-name: thanos-receive-4
    thanos-store-api: "true"
  name: thanos-receive-4
  namespace: thanos
spec:
  containers:
  - args:
    - receive
    - --grpc-address=0.0.0.0:10901
    - --http-address=0.0.0.0:10902
    - --remote-write.address=0.0.0.0:19291
    - --receive.replication-factor=1
    - --receive.hashrings-file-refresh-interval=1m
    - --receive.hashrings-algorithm=ketama
    - --objstore.config-file=/etc/thanos/objectstorage.yaml
    - --tsdb.path=/var/thanos/receive
    - --tsdb.retention=12h
    - --tsdb.out-of-order.time-window=1h
    - --receive.tenant-header="THANOS-TENANT"
    - --receive-forward-timeout=120s
    - --label=receive_replica="$(NAME)"
    - --label=receive="true"
    - --tsdb.no-lockfile
    - --receive.hashrings-file=/etc/thanos/thanos-receive-hashrings.json
    - --receive.local-endpoint=$(NAME).thanos-receive.thanos.svc.cluster.local:10901
  ......
  ..... 

@ranryl
Copy link

ranryl commented Apr 13, 2023

+1

@jnyi
Copy link
Contributor Author

jnyi commented Apr 13, 2023

Hi @JayChanggithub @ranryl,

Your issue seems derail from what this PR is for. I don't think it is a good place to continue this discussion. Instead I would suggest you either open another github Issue or you can reach out to CNCF #thanos in #slack for help

rabenhorst added a commit to rabenhorst/thanos that referenced this pull request May 4, 2023
* mixins: Add code/grpc-code dimension to error widgets

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Update changelog

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix messed up merge conflict resolution

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Readd empty line at the end of changelog

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Rerun CI

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* mixin(Rule): Add rule evaluation failures to the Rule dashboard (thanos-io#6244)

* Improve Thanos Rule dashboard legends

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Add evaluations failed to Rule dashboard

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Refactor rule dashboard

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Add changelog entry

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Rerun CI

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

---------

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* added thanos logo in react app (thanos-io#6264)

Signed-off-by: hackeramitkumar <amit9116260192@gmail.com>

* Add an experimental flag to block samples with timestamp too far in the future (thanos-io#6195)

* Add an experimental flag to block samples with timestamp too far in the future

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* fix bug

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* address comments

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* fix docs CI errors

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* resolve merge conflicts

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* resolve merge conflicts

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* retrigger checks

Signed-off-by: Yi Jin <yi.jin@databricks.com>

---------

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* store/bucket: snappy-encoded postings reading improvements (thanos-io#6245)

* store: pool input to snappy.Decode

Pool input to snappy.Decode to avoid allocations.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* store: use s2 for decoding snappy

It's faster hence use it.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* store: small code style adjustment

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* store: call closefns before returning err

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* store/postings_codec: return both if possible

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* store/bucket: always call close fns

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

---------

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* truncateExtLabels support Unicode cut (thanos-io#6267)

* truncateExtLabels support Unicode cut

Signed-off-by: mickeyzzc <mickey_zzc@163.com>

* update TestTruncateExtLabels and pass test

Signed-off-by: mickeyzzc <mickey_zzc@163.com>

---------

Signed-off-by: mickeyzzc <mickey_zzc@163.com>

* Update mentorship links

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* Fix segfault in LabelValues during head compaction (thanos-io#6271)

* Fix segfault in LabelValues during head compaction

Head compaction causes blocks outside the retention period to get deleted.
If there is an in-flight LabelValues request at the same time, deleting
the block can cause the store proxy to panic since it loses access to
the data.

This commit fixes the issue by copying label values from TSDB stores
before returning them to the store proxy. I thought about exposing
a Close method on the TSDB store which the Proxy can call, but this will
not eliminate cases where gRPC defers sending data over a channel using its
queueing mechanism.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Add changelog entry

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Assert no error when querying labels

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

---------

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Mixin: Allow specifying an instance name filter (thanos-io#6273)

This commit allow specifying the instance name filter, in order to
filter the datasources shown on the dashboards.

For example, when generating the dashboards one can do the following
(i.e in config.libsonnet)

```
  dashboard+:: {
    prefix: 'Thanos / ',
    ...
    instance_name_filter: '/EU.*/'
```

Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>

* Adds Deno to adopters.yml (thanos-io#6275)

Signed-off-by: Will (Newby) Atlas <will@deno.com>

* Bump `make test` timeout (thanos-io#6276)

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* fix 0.31 changelog (thanos-io#6278)

Signed-off-by: junot <junotxiang@kubesphere.io>

* Query: Switch Multiple Engines (thanos-io#6234)

* Query: Switch engines using `engine` param

Thanos query has two engine, prometheus (default) and thanos.
A single engine runs through thanos query command at a time, and
have to re run the command to switch between.

This commit adds a functionality to run multiple engines at once
and switch between them using `engine` query param inq query api.

To avoid duplicate matrics registration, the thanos engine is
provided with a different registerer having prefix `tpe_` (not
been finalized yet).

promql-engine command line flag has been removed that specifies
the query engine to run.

Currently this functionality not implemented on GRPCAPI.

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* Add multiple engine support to GRPCAPI

Fix build fail for thanos, adds support for multiple engine in
GRPCAPI.

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* Create QueryEngineFactory to create engines

QueryEngineFactory makes a collection for all promql engines used
by thanos and returns it. Any engine can be created and returned
using `GetXEngine` method.

It is currently limited to 2 engines prometheus and thanos engines
that get created on the first call.

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* Use QueryEngineFactory in query API

thanos query commands pass `QueryEngineFactory` to query apis
that will use engine based on query params. It will provide more
flexibility to create multiple engines in thanos.

Adds `defaultEngine` CLI flag, A default engine to use if not
specified with query params.

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* Update Query API tests

Fixes breaking tests

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* Minor changes and Docs fixes

* Move defaultEngine argument to reduce diff.
* Generated Docs.

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* Add Engine Selector/ Dropdown to Query UI

Engine Selector is a dropdown that sets an engine to be used to
run the query. Currently two engines `thanos` and `prometheus`.

This dropdown sends a query param `engine` to query api, which
runs the api using the engine provided. Provided to run query
using multiple query engines from Query UI.

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* Move Engine Selector to Panel

Removes Dropdown component, and renders Engine Selector directly.
Receives defaultEngine from `flags` API.
Updates parseOptions to parse engine query param and updates test
for Panel and utils.

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* Upgrade promql-engine dependency

Updates promql-engine that brings functionality to provide
fallback engine using enigne Opts.

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* Add MinT to remote client

MinT method was missing from Client due to updated promql-engine.
This commits adds mint to the remote client.

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* Use prometheus fallback engine in thanos engine

Thanos engine creates a fallback prometheus engine that conflicts
with another prometheus engine created by thanos, while
registering metrics. To fix this, provided created thanos engine
as fallback engine to thanos engine in engine Opts.

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* Use enum for EngineType in GRPC

GRPC is used for communication between thanos components and
defaultEngine was a string before. Enum makes more sense, and
hence the request.Enigne type has been changed to
querypb.EngineType.
Default case is handled with another default value provided in
the enum.

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* Update query UI bindata.go

Compile react app using `make assets`.

Signed-off-by: Pradyumna Krishna <git@onpy.in>

---------

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* docs: mismatch in changelog

Signed-off-by: Etienne Martel <etienne.martel.7@gmail.com>

* Updates busybox SHA (thanos-io#6283)

Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: fpetkovski <fpetkovski@users.noreply.github.com>

* Upgrade prometheus to 7309ac272195cb856b879306d6a27af7641d3346 (thanos-io#6287)

* Upgrade prometheus to 7309ac272195cb856b879306d6a27af7641d3346

Signed-off-by: Alex Le <leqiyue@amazon.com>

* Reverted test code

Signed-off-by: Alex Le <leqiyue@amazon.com>

* Updated comment

Signed-off-by: Alex Le <leqiyue@amazon.com>

* docs: mismatch in changelog

Signed-off-by: Etienne Martel <etienne.martel.7@gmail.com>
Signed-off-by: Alex Le <leqiyue@amazon.com>

* Updates busybox SHA (thanos-io#6283)

Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: fpetkovski <fpetkovski@users.noreply.github.com>
Signed-off-by: Alex Le <leqiyue@amazon.com>

* trigger workflow

Signed-off-by: Alex Le <leqiyue@amazon.com>

* trigger workflow

Signed-off-by: Alex Le <leqiyue@amazon.com>

---------

Signed-off-by: Alex Le <leqiyue@amazon.com>
Signed-off-by: Etienne Martel <etienne.martel.7@gmail.com>
Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Etienne Martel <etienne.martel.7@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: fpetkovski <fpetkovski@users.noreply.github.com>

* Add CarTrade Tech as new adopter

Signed-off-by: naveadkazi <navead@carwale.com>

* tests: Remove custom Between test matcher (thanos-io#6310)

* Remove custom Between test matcher

The upstream PR to efficientgo/e2e has been merged, so we can use  it from there.

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Run go mod tidy

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

---------

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* query frontend, query UI: Native histogram support (thanos-io#6071)

* Implemented native histogram support for qfe and query UI

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Fixed marshalling for histograms in qfe

Started working on native histogram query ui

Copied histogram implementation for graph

Added query range support for native histograms in qfe

Use prom model (un-)marshal for native histograms in qfe

Use prom model (un-)marshal for native histograms in qfe

Fixed sample and sample stream marshal fn

Extended qfe native histogram e2e tests

Added copyright to qfe queryrange compat

Added query range test fo histograms and try to fix ui tests

Fixed DataTable test

Review feedback

Fixed native histogram e2e test

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Add histogram support for ApplyCounterResetsSeriesIterator

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Made assets

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Add chnagelog

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Fixed changelog

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Fixed qfe

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Fixed PrometheusResponse minTime for histograms in qfe

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Updated prometheus common to v0.40.0 and queryrange.Sample fixes

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Updated Readme

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Addressed PR comments

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

trigger tests

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Made assets

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

* Made assets

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

* fixed tsdbutil references

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

* fixed imports

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

* Enabled pushdown for query native hist test and removed ToDo

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

* Refactored native histogram query UI

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

---------

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

* store: add streamed snappy encoding for postings list (thanos-io#6303)

* store: add streamed snappy encoding for postings list

We've noticed that decoding Snappy compressed postings list
takes a lot of RAM:

```
(pprof) top
Showing nodes accounting for 1427.30GB, 67.55% of 2112.82GB total
Dropped 1069 nodes (cum <= 10.56GB)
Showing top 10 nodes out of 82
      flat  flat%   sum%        cum   cum%
         0     0%     0%  1905.67GB 90.20%  golang.org/x/sync/errgroup.(*Group).Go.func1
    2.08GB 0.098% 0.098%  1456.94GB 68.96%  github.com/thanos-io/thanos/pkg/store.(*blockSeriesClient).ExpandPostings
    1.64GB 0.078%  0.18%  1454.87GB 68.86%  github.com/thanos-io/thanos/pkg/store.(*bucketIndexReader).ExpandedPostings
    2.31GB  0.11%  0.29%  1258.15GB 59.55%  github.com/thanos-io/thanos/pkg/store.(*bucketIndexReader).fetchPostings
    1.48GB  0.07%  0.36%  1219.67GB 57.73%  github.com/thanos-io/thanos/pkg/store.diffVarintSnappyDecode
 1215.21GB 57.52% 57.87%  1215.21GB 57.52%  github.com/klauspost/compress/s2.Decode
```

This is because we are creating a new []byte slice for the decoded data
each time. To avoid this RAM usage problem, let's stream the decoding
from a given buffer. Since Snappy block format doesn't support streamed
decoding, let's switch to Snappy stream format which is made for exactly
that.

Notice that our current `index.Postings` list does not
support going back through Seek() even if theoretically one could want
something like that. Fortunately, to search for posting intersection, we
need to only go forward.

Benchmark data:

```
name                                                          time/op
PostingsEncodingDecoding/10000/raw/encode-16                  71.6µs ± 3%
PostingsEncodingDecoding/10000/raw/decode-16                  76.3ns ± 4%
PostingsEncodingDecoding/10000#01/snappy/encode-16            73.3µs ± 1%
PostingsEncodingDecoding/10000#01/snappy/decode-16            1.63µs ± 6%
PostingsEncodingDecoding/10000#02/snappyStreamed/encode-16     111µs ± 2%
PostingsEncodingDecoding/10000#02/snappyStreamed/decode-16    14.5µs ± 7%
PostingsEncodingDecoding/100000/snappyStreamed/encode-16      1.09ms ± 2%
PostingsEncodingDecoding/100000/snappyStreamed/decode-16      14.4µs ± 4%
PostingsEncodingDecoding/100000#01/raw/encode-16               710µs ± 1%
PostingsEncodingDecoding/100000#01/raw/decode-16              79.3ns ±13%
PostingsEncodingDecoding/100000#02/snappy/encode-16            719µs ± 1%
PostingsEncodingDecoding/100000#02/snappy/decode-16           13.5µs ± 4%
PostingsEncodingDecoding/1000000/raw/encode-16                7.14ms ± 1%
PostingsEncodingDecoding/1000000/raw/decode-16                81.7ns ± 9%
PostingsEncodingDecoding/1000000#01/snappy/encode-16          7.52ms ± 3%
PostingsEncodingDecoding/1000000#01/snappy/decode-16           139µs ± 4%
PostingsEncodingDecoding/1000000#02/snappyStreamed/encode-16  11.4ms ± 4%
PostingsEncodingDecoding/1000000#02/snappyStreamed/decode-16  15.5µs ± 4%

name                                                          alloc/op
PostingsEncodingDecoding/10000/raw/encode-16                  13.6kB ± 0%
PostingsEncodingDecoding/10000/raw/decode-16                   96.0B ± 0%
PostingsEncodingDecoding/10000#01/snappy/encode-16            25.9kB ± 0%
PostingsEncodingDecoding/10000#01/snappy/decode-16            11.0kB ± 0%
PostingsEncodingDecoding/10000#02/snappyStreamed/encode-16    16.6kB ± 0%
PostingsEncodingDecoding/10000#02/snappyStreamed/decode-16     148kB ± 0%
PostingsEncodingDecoding/100000/snappyStreamed/encode-16       148kB ± 0%
PostingsEncodingDecoding/100000/snappyStreamed/decode-16       148kB ± 0%
PostingsEncodingDecoding/100000#01/raw/encode-16               131kB ± 0%
PostingsEncodingDecoding/100000#01/raw/decode-16               96.0B ± 0%
PostingsEncodingDecoding/100000#02/snappy/encode-16            254kB ± 0%
PostingsEncodingDecoding/100000#02/snappy/decode-16            107kB ± 0%
PostingsEncodingDecoding/1000000/raw/encode-16                1.25MB ± 0%
PostingsEncodingDecoding/1000000/raw/decode-16                 96.0B ± 0%
PostingsEncodingDecoding/1000000#01/snappy/encode-16          2.48MB ± 0%
PostingsEncodingDecoding/1000000#01/snappy/decode-16          1.05MB ± 0%
PostingsEncodingDecoding/1000000#02/snappyStreamed/encode-16  1.47MB ± 0%
PostingsEncodingDecoding/1000000#02/snappyStreamed/decode-16   148kB ± 0%

name                                                          allocs/op
PostingsEncodingDecoding/10000/raw/encode-16                    2.00 ± 0%
PostingsEncodingDecoding/10000/raw/decode-16                    2.00 ± 0%
PostingsEncodingDecoding/10000#01/snappy/encode-16              3.00 ± 0%
PostingsEncodingDecoding/10000#01/snappy/decode-16              4.00 ± 0%
PostingsEncodingDecoding/10000#02/snappyStreamed/encode-16      4.00 ± 0%
PostingsEncodingDecoding/10000#02/snappyStreamed/decode-16      5.00 ± 0%
PostingsEncodingDecoding/100000/snappyStreamed/encode-16        4.00 ± 0%
PostingsEncodingDecoding/100000/snappyStreamed/decode-16        5.00 ± 0%
PostingsEncodingDecoding/100000#01/raw/encode-16                2.00 ± 0%
PostingsEncodingDecoding/100000#01/raw/decode-16                2.00 ± 0%
PostingsEncodingDecoding/100000#02/snappy/encode-16             3.00 ± 0%
PostingsEncodingDecoding/100000#02/snappy/decode-16             4.00 ± 0%
PostingsEncodingDecoding/1000000/raw/encode-16                  2.00 ± 0%
PostingsEncodingDecoding/1000000/raw/decode-16                  2.00 ± 0%
PostingsEncodingDecoding/1000000#01/snappy/encode-16            3.00 ± 0%
PostingsEncodingDecoding/1000000#01/snappy/decode-16            4.00 ± 0%
PostingsEncodingDecoding/1000000#02/snappyStreamed/encode-16    4.00 ± 0%
PostingsEncodingDecoding/1000000#02/snappyStreamed/decode-16    5.00 ± 0%
```

Compression ratios are still the same like previously:

```
$ /bin/go test -v -timeout 10m -run ^TestDiffVarintCodec$ github.com/thanos-io/thanos/pkg/store
[snip]
=== RUN   TestDiffVarintCodec/snappy/i!~"2.*"
    postings_codec_test.go:73: postings entries: 944450
    postings_codec_test.go:74: original size (4*entries): 3777800 bytes
    postings_codec_test.go:80: encoded size 44498 bytes
    postings_codec_test.go:81: ratio: 0.012
=== RUN   TestDiffVarintCodec/snappyStreamed/i!~"2.*"
    postings_codec_test.go:73: postings entries: 944450
    postings_codec_test.go:74: original size (4*entries): 3777800 bytes
    postings_codec_test.go:80: encoded size 44670 bytes
    postings_codec_test.go:81: ratio: 0.012
```

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* store: clean up postings code

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* store: fix estimation

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* store: use buffer.Bytes()

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* store/postings_codec: reuse extgrpc compressors/decompressors

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* CHANGELOG: add item

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* CHANGELOG: clean up whitespace

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

---------

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* compact: atomically replace no compact marked map (thanos-io#6319)

With lots of blocks it could take some time to fill this no compact
marked map hence replace it atomically. I believe this leads to problems
in the compaction planner where it picks up no compact marked blocks
because meta syncer does synchronizations concurrently.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* Fixed modules, logicalplan flag and more

* Made assets

* Removed unused test function

* Sorted labels

---------

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: hackeramitkumar <amit9116260192@gmail.com>
Signed-off-by: Yi Jin <yi.jin@databricks.com>
Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Signed-off-by: mickeyzzc <mickey_zzc@163.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>
Signed-off-by: Will (Newby) Atlas <will@deno.com>
Signed-off-by: junot <junotxiang@kubesphere.io>
Signed-off-by: Pradyumna Krishna <git@onpy.in>
Signed-off-by: Etienne Martel <etienne.martel.7@gmail.com>
Signed-off-by: GitHub <noreply@github.com>
Signed-off-by: Alex Le <leqiyue@amazon.com>
Signed-off-by: naveadkazi <navead@carwale.com>
Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>
Co-authored-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Co-authored-by: Filip Petkovski <filip.petkovsky@gmail.com>
Co-authored-by: Amit kumar <amit9116260192@gmail.com>
Co-authored-by: Yi Jin <96499497+jnyi@users.noreply.github.com>
Co-authored-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Co-authored-by: MickeyZZC <mickeyzzc@gmail.com>
Co-authored-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Co-authored-by: Jacob Baungård Hansen <jacobbaungard@redhat.com>
Co-authored-by: Will (Newby) Atlas <willnewby@gmail.com>
Co-authored-by: junot <49136171+junotx@users.noreply.github.com>
Co-authored-by: Pradyumna Krishna <git@onpy.in>
Co-authored-by: Etienne Martel <etienne.martel.7@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: fpetkovski <fpetkovski@users.noreply.github.com>
Co-authored-by: Alex Le <emoc1989@gmail.com>
Co-authored-by: naveadkazi <navead@carwale.com>
hczhu pushed a commit to databricks/thanos that referenced this pull request Jun 27, 2023
…he future (thanos-io#6195)

* Add an experimental flag to block samples with timestamp too far in the future

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* fix bug

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* address comments

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* fix docs CI errors

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* resolve merge conflicts

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* resolve merge conflicts

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* retrigger checks

Signed-off-by: Yi Jin <yi.jin@databricks.com>

---------

Signed-off-by: Yi Jin <yi.jin@databricks.com>
hczhu pushed a commit to databricks/thanos that referenced this pull request Jun 27, 2023
…he future (thanos-io#6195)

* Add an experimental flag to block samples with timestamp too far in the future

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* fix bug

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* address comments

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* fix docs CI errors

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* resolve merge conflicts

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* resolve merge conflicts

Signed-off-by: Yi Jin <yi.jin@databricks.com>

* retrigger checks

Signed-off-by: Yi Jin <yi.jin@databricks.com>

---------

Signed-off-by: Yi Jin <yi.jin@databricks.com>
@Avigdorrr
Copy link

Can we make this flag stable?
The flag have been merged over a year ago and i cant find any opened (or even closed) issues related to problems or bugs with it.

@GiedriusS
Copy link
Member

Yes, help wanted. I think we can definitely mark it as stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.