Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive: Add block shipping #1011

Merged
merged 1 commit into from
Apr 18, 2019
Merged

Conversation

brancz
Copy link
Member

@brancz brancz commented Apr 5, 2019

Changes

This adds block shipping to the receive component.

Verification

Ran the quickstart script and watched blocks being uploaded, and stopped the receive component and could still query the uploaded data. Ideally this would have an e2e test, but I'm not sure how to realize that, and I couldn't find e2e tests for this for the sidecar component either.

@domgreen @bwplotka @FUSAKLA @GiedriusS @povilasv

cc @metalmatze @squat

@brancz brancz force-pushed the receive-shipper branch 2 times, most recently from 1b2d23b to d398997 Compare April 5, 2019 12:53
Copy link
Member

@povilasv povilasv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nits, but LGTM

cmd/thanos/receive.go Outdated Show resolved Hide resolved

err := addShipper(logger, reg, g, nil, metadata.ReceiveSource, &promMetadata{labels: externalLabels}, dataDir, objStoreConfig, false)
if err != nil {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe wrap the error here?

@@ -229,6 +229,16 @@ func runSidecar(
})
}

err := addShipper(logger, reg, g, peer, metadata.SidecarSource, m, dataDir, objStoreConfig, uploadCompacted)
if err != nil {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we wrap the error here


mint, maxt := m.Timestamps()
if peer != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this as we have noop peer in the gossip, and this will be removed in #1008 so less merge issues for me :D

@povilasv
Copy link
Member

@brancz re e2e tests, we definetely need to work on this.

@brancz
Copy link
Member Author

brancz commented Apr 11, 2019

Addressed comments. I'd say let's do upload tests for sidecar and receive as a follow up @povilasv ?

@brancz
Copy link
Member Author

brancz commented Apr 12, 2019

Looks like something was odd with Circle CI, re-triggering with force push.

cmd/thanos/receive.go Outdated Show resolved Hide resolved
Copy link
Member

@povilasv povilasv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor nit.

@brancz brancz force-pushed the receive-shipper branch 3 times, most recently from d3a9d45 to 28c15f6 Compare April 17, 2019 13:22
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Genrally LGTM, just style & flag inconsistency.

@@ -36,6 +39,10 @@ func registerReceive(m map[string]setupFunc, app *kingpin.Application, name stri
dataDir := cmd.Flag("tsdb.path", "Data directory of TSDB.").
Default("./data").String()

externalLabels := cmd.Flag("external-labels", "External labels to announce. Comma separated list of key=value pairs. This flag will be removed in the future when handling multiple tsdb instances is added.").PlaceHolder("key1=value1,key2=value2").String()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have just --labels naming in ruler. We should be consistent proably and rename this or ruler ones?

This flag will be removed in the future when handling multiple tsdb instances is added.

Any view how it will look like then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be a whitelist of label-names, and set on the fly based on incoming data, as receivers have to handle multiple. As the proposal mentioned, we'll need to add the ability to publish multiple sets of labels that a receiver can respond to.

Agreed regarding consistency, will adapt.

@@ -36,6 +39,10 @@ func registerReceive(m map[string]setupFunc, app *kingpin.Application, name stri
dataDir := cmd.Flag("tsdb.path", "Data directory of TSDB.").
Default("./data").String()

externalLabels := cmd.Flag("external-labels", "External labels to announce. Comma separated list of key=value pairs. This flag will be removed in the future when handling multiple tsdb instances is added.").PlaceHolder("key1=value1,key2=value2").String()

objStoreConfig := regCommonObjStoreFlags(cmd, "", false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this required? Basing of fact that retention is not configurable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see people just caring about a single instance of this with just local storage. I can see scenarios where it would not have to be set.

Copy link
Member Author

@brancz brancz Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, retention should be configurable. People should be able to make a choice how they want local storage to be treated when running receive I think. I'll make retention configurable in a follow up PR. Don't want to mix concerns here.

) error {
logger = log.With(logger, "component", "receive")
level.Warn(logger).Log("msg", "setting up receive; the Thanos receive component is EXPERIMENTAL, it may break significantly without notice")

externalLabels := labels.Labels{}
labelStrings := strings.Split(externalLabelsString, ",")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is parseLabels function that gives you that and is used in more than one place I believe, can we reuse it? https://github.com/improbable-eng/thanos/blob/59634d99e4b7aa74d53814ed9b2401f6068e17cf/cmd/thanos/rule.go#L718

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful 👍

@@ -225,6 +250,12 @@ func runReceive(
},
)
}

err := addShipper(logger, reg, g, metadata.ReceiveSource, &promMetadata{labels: externalLabels}, dataDir, objStoreConfig, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to do if err := ... ; err != nil { paradigm in Thanos (:


err := addShipper(logger, reg, g, metadata.ReceiveSource, &promMetadata{labels: externalLabels}, dataDir, objStoreConfig, false)
if err != nil {
return errors.Wrap(err, "failed to add block shipper to run group")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, when you wrap error for Thanos, let's avoid failed to prefixes. It's already known it failed. Wrap is only to figure out where it fail and why NOT to check IF it failed (:

Suggested change
return errors.Wrap(err, "failed to add block shipper to run group")
return errors.Wrap(err, "add block shipper to run group")

@@ -229,6 +229,16 @@ func runSidecar(
})
}

err := addShipper(logger, reg, g, metadata.SidecarSource, m, dataDir, objStoreConfig, uploadCompacted)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto with if err := ... paradigm

@@ -229,6 +229,16 @@ func runSidecar(
})
}

err := addShipper(logger, reg, g, metadata.SidecarSource, m, dataDir, objStoreConfig, uploadCompacted)
if err != nil {
return errors.Wrap(err, "failed to add block shipper to run group")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto with error wrap

return nil
}

func addShipper(logger log.Logger, reg prometheus.Registerer, g *run.Group, sourceType metadata.SourceType, m *promMetadata, dataDir string, objStoreConfig *pathOrContent, uploadCompacted bool) error {
confContentYaml, err := objStoreConfig.Content()
Copy link
Member

@bwplotka bwplotka Apr 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of this. You save a line but addShipper means to me your are actually adding shipper NOT adding only IF something happens, can we rearrange? Sorry for being picky ): I don't like methods that hides and suprises.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

return nil
}

func addShipper(logger log.Logger, reg prometheus.Registerer, g *run.Group, sourceType metadata.SourceType, m *promMetadata, dataDir string, objStoreConfig *pathOrContent, uploadCompacted bool) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about rule.go shipper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure exactly what you mean here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I need to look at that, I'll take care of it tomorrow!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at it a bit, and I feel like the best decision around this for now is to not generalize this in the addShipper function, but once gossip removal lands, this is nicely generalizable for receive/sidecar/rule.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually reducing this to what's necessary in receive, I think I might even prefer the slight duplication, as the code necessary is a lot smaller than the addShipper function.

@brancz brancz force-pushed the receive-shipper branch 3 times, most recently from 98912c5 to 8cf31fa Compare April 17, 2019 16:23
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 💓 LGTM, thanks!

Leaving you to merge.

@brancz brancz merged commit 3b13aeb into thanos-io:master Apr 18, 2019
@brancz brancz deleted the receive-shipper branch April 18, 2019 08:49
smalldirector pushed a commit to smalldirector/thanos that referenced this pull request Jun 20, 2019
* query: cleanup store statuses as they come and go (thanos-io#910)

Signed-off-by: Adrien Fillon <adrien.fillon@cdiscount.com>

* [docs] Example of using official prometheus helm chart to deploy server with sidecar (thanos-io#1003)

* update documentation with an example of using official prometheus helm chart

Signed-off-by: Ivan Kiselev <ivan@messagebird.com>

* a little formatting to values

Signed-off-by: Ivan Kiselev <ivan@messagebird.com>

* satisfy PR comments

Signed-off-by: Ivan Kiselev <ivan@messagebird.com>

* Compact: group concurrency  (thanos-io#1010)

* compact: add concurrency to group compact

* add flag to controll the number of goroutines to use when compacting group

* update compact.md for group-compact-concurrency

* fixed: miss wg.Add()

* address CR

* regenerate docs

* use err group

* fix typo in flag description

* handle context

* set up workers in main loop

* move var initialisation

* remove debug log

* validate concurrency

* move comment

* warn -> error

* remove extra newline

* fix typo

* dns: Added miekgdns resolver as a hidden option to query and ruler. (thanos-io#1016)

Fixes: thanos-io#1015

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* query: set default evaluation interval (thanos-io#1028)

Subqueries allows request with no [specified resolution](https://prometheus.io/blog/2019/01/28/subquery-support/).
Set it up and allow to configure default evaluation interval.

* store+compactor: pre-compute index cache during compaction (thanos-io#986)

Fixes first part of thanos-io#942

This changes allow to safe some startup & sync time in store gateway as it is no longer is needed to compute index-cache from block index on its own. For compatibility store GW still can do it, but it first checks bucket if there is index-cached uploaded already.

In the same time, compactor precomputes the index cache file on every compaction. To allow quicker addition of index cache files we added `--index.generate-missing-cache-file` flag, that if enabled precompute missing files on compactor startup. Note that it will take time and it's only one-off step per bucket.

Signed-off-by: Aleksei Semiglazov <xjewer@gmail.com>

* Added website for Thanos' docs using Hugo. (thanos-io#807)

Hosted in github pages.

Signed-off-by: adrien-f <adrien.fillon@gmail.com>
Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* gcs: Fixed scopes for inline ServiceAccount option. (thanos-io#1033)

Without this that option was unusable.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* Fixed root docs and liche is now checking root dir as well. (thanos-io#1040)

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* storage docs: add detail about GCS policies and testing (thanos-io#1037)

* add more details about GCS policies and testing

* remove fixed names from exec command

* Prometheus library updated to v2.8.1 (thanos-io#1009)

* compact:  group concurrency improvements (thanos-io#1029)

* group concurrency improvements

* remove unnecessary error check

* add to wg in main goroutine

* receive: Add block shipping (thanos-io#1011)

* receive: Add retention flag for local tsdb storage (thanos-io#1046)

* querier: Add /api/v1/labels support (thanos-io#905)

* Feature: add /api/v1/labels support

Signed-off-by: jojohappy <sarahdj0917@gmail.com>

* Disabled gossip by default, marked all flags as deprecated. (thanos-io#1055)

+ changed small label.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* ruler: Fixed Chunk going out or Max Uint16. (thanos-io#1041)

Fixes thanos-io#1038

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* store: azure: allow passing an endpoint parameter for specific regions (thanos-io#980)

Fix thanos-io#968

Signed-off-by: Adrien Fillon <adrien.fillon@cdiscount.com>

* feature: support POST method for series endpoint (thanos-io#1021)

Signed-off-by: Joseph Lee <joseph.t.lee@outlook.com>

* bucket verify: repair out of order labels (thanos-io#964)

* bucket verify: repair out of order labels

* verify repair: correctly order series in the index on rewrite

When we have label sets that are not in the correct order, fixing that
changes the order of the series in the index.  So the index must be
rewritten in that new order.  This makes this repair tool take up a
bunch more memory, but produces blocks that verify correctly.

* Fix the TSDB block safe-delete function

The directory name must be the block ID name exactly to verify.  A temp
directory or random name will not work here.

* verify repair: fix duplicate chunk detection

Pointer/reference logic error was eliminating all chunks for a series in
a given TSDB block that wasn't the first chunk.  Chunks are now
referenced correctly via pointers.

* PR feedback: use errors.Errorf() instead of fmt.Errorf()

* Use errors.New()

Some linters catch errors.Errorf() as its not really part of the errors
package.

* Liberally comment this for loop

We're comparing items by pointers, using Go's range variables is
misleading here and we need not fall into the same trap.

* Take advantage of sort.Interface

This prevents us from having to re-implement label sorting.

* PR Feedback: Comments are full sentences.

* Cut release 0.4.0-rc.0 (thanos-io#1017)

* Cut release 0.4.0-rc.0 🎉 🎉

NOTE: This is last release that has gossip.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

Co-Authored-By: povilasv <p.versockas@gmail.com>

* Fixed crossbuild.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* ci: Env fixes. (thanos-io#1058)

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* Removed bzr requirement for make crossbuild.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* Bump github.com/hashicorp/golang-lru from 0.5.0 to 0.5.1 (thanos-io#1051)

Bumps [github.com/hashicorp/golang-lru](https://github.com/hashicorp/golang-lru) from 0.5.0 to 0.5.1.
- [Release notes](https://github.com/hashicorp/golang-lru/releases)
- [Commits](hashicorp/golang-lru@v0.5.0...v0.5.1)

Signed-off-by: dependabot[bot] <support@dependabot.com>

* Initialze and correctly register all index cache metrics. (thanos-io#1069)

* store/cache: add more tests (thanos-io#1071)

*  Fixed Downsampling process; Fixed `runutil.CloseAndCaptureErr` (thanos-io#1070)

* runutil. Simplified CloseWithErrCapture.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* Fixed Downsampling process; Fixed runutil.CloseAndCaptureErr

Fixes thanos-io#1065

Root cause:
* runutil defered capture error function was not passing error properly so unit tests were passing, event though there was bug
* streamed block write index cache requires index file which was not closed (saved) properly yet. Closers need to be closed to perform this.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* objstore: Expose S3 region attribute (thanos-io#1060)

Minio is able to autodetect the region for cloud providers like AWS but the logic fails with Scaleway Object Storage solution.

Related issue on Minio: minio/mc#2570

* Fixed fetching go-bindata failed (thanos-io#1074)

* Fixed bug:
- fetching go-bindata failed.
- change the repo of go-bindata to github.com/go-bindata/go-bindata,
because old repo has been archived.
- pin the go-bindata as v3.3.1.

Signed-off-by: jojohappy <sarahdj0917@gmail.com>

* Add CHANGELOG

Signed-off-by: jojohappy <sarahdj0917@gmail.com>

* Remove CHANGELOG

Signed-off-by: jojohappy <sarahdj0917@gmail.com>

* add compare flags func to compare flags between prometheus and sidecar (thanos-io#838)

Original message:

* update documentation for a max/min block duration

add compare flags func to compare flags between prom and sidecar

* fix some nits


Functional change: now we check the configured flags (if possible) and error out if MinTime != MaxTime. We need to check this always since if that is not true then we will get overlapping blocks. Additionally, an error message is printed out if it is not equal to 2h (the recommended value).

* Ensured index cache is best effort, refactored tests, validated edge cases. (thanos-io#1073)

Fixes thanos-io#651

Current size also includes slice header.

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* website: Moved to netlify. (thanos-io#1078)

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* website: Fixing netlify. (thanos-io#1080)

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* website: Added "founded by" footer. (thanos-io#1081)

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* store/proxy: properly check if context has ended (thanos-io#1082)

How the code was before it could happen that we might receive some
series from the stream however by the time we'd send them back to the
reader, it would not read it anymore since the deadline would have been
exceeded.

Properly use a `select` here to get out of the goroutine if the deadline
has been exceeded.

Might potentially fix a problem where we see one goroutine hanging
constantly (and thus blocking from work being done):

```
goroutine profile: total 126
25 @ 0x42f62f 0x40502b 0x405001 0x404de5 0xe7435b 0x45cc41
	0xe7435a	github.com/improbable-eng/thanos/pkg/store.startStreamSeriesSet.func1+0x18a	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:318
```

* Cut release v0.4.0-rc.1 (thanos-io#1088)

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* website: Removed ghpages handling; fixed docs; and status badge. (thanos-io#1084)

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* Fix readme (thanos-io#1090)

* store: Compose indexCache properly allowing injection for testing purposes. (thanos-io#1098)

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* website: add sponsor section on homepage (thanos-io#1062)

* website: Adjusted logos sizing and responsiveness. (thanos-io#1105)

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* Add Monzo to "Used by" section 🎉 (thanos-io#1106)

* Compactor: remove malformed blocks after delay (thanos-io#1053)

* compactor removes malformed blocks after delay

* compactor removes malformed blocks after delay

* include missing file

* reuse existing freshness check

* fix comment

* remove unused var

* fix comment

* syncDelay -> consistencyDelay

* fix comment

* update flag description

* address cr

* fix dupliacte error handling

* minimum value for --consistency-delay

* update

* docs

* add test case

* move test to inmem bucket

* Add Utility Warehouse to "used by" section (thanos-io#1108)

* Add Utility Warehouse logo

* Make logo smaller

* website: add Adform as users (thanos-io#1109)

We use Thanos extensively as well so I have added Adform.

* Cut release v0.4.0 (thanos-io#1107)

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants