Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e: Fixed WaitSumMetrics to fail on non existing metric #2256

Merged
merged 3 commits into from
Mar 12, 2020

Conversation

bwplotka
Copy link
Contributor

Hello Busy Cortexianians 👋

This is kind of tricky as logically sum of non-existing is 0, but also it's super easy to make a
mistake and put the wrong metric and sneakily introduce bug... so I think being strict makes sense here? WDYT?

Alternative is to extend isExpected func(sums ...float64) bool to something like isExpected func(exists bool, sums ...float64) bool
but I think the proposed simplification makes sense here. (:

Also, I think the e2e base unit test should be run all the time not only on integration build,
remove the tag from those in the main e2e core package.

cc @pstibrany @pracucci

Signed-off-by: Bartlomiej Plotka bwplotka@gmail.com

@bwplotka
Copy link
Contributor Author

Some context thanks to @kakkoyun: thanos-io/thanos#2256

@pracucci
Copy link
Contributor

This is kind of tricky as logically sum of non-existing is 0, but also it's super easy to make a
mistake and put the wrong metric and sneakily introduce bug... so I think being strict makes sense here? WDYT?

Thanks @bwplotka! I see the use case but I'm dubious about the fix. As you can see some integration tests fail, and the problem is that some metrics are not immediately exported right after registration. Think about metrics with labels: the first series is exported once a value is tracked for the first time, which may occur after the first time /metrics is called.

@bwplotka
Copy link
Contributor Author

the first series is exported once a value is tracked for the first time, which may occur after the first time /metrics is called.

This is not true...

it's tracked for the first time when you do metrics.WithLabelValues(....) (it will be present as 0) or metrics.WithLabelValues(....).Set(X) (as X).

This means:

  • You should always initialize labels if you can
  • Maybe we can assert on metric{expectedlabel=expectedvalue
  • If you cannot init because the label is so dynamic (I doubt it's really dynamic in e2e tests), maybe you have different metrics that states the state you expect, or we need WaitCountMetric method?

Additionally, think about this:

  • If you will put the wrong metric your test will pass even though there might be a bug, so assert on 0 without this is really like asserting on nothing. In fact in a case like this when you expect metric to NOT be there, you can even point to totally wrong service and our assertion will work... this does not really make any sense ;p
  • You should initialize your metrics anyway (it's a good pattern), see: https://www.youtube.com/watch?v=LU6D5cNeHks&t=778s

Anyway, in my free time, I can look on some failing cases to see how I would fix them with my "own tips"... Let me know if that makes sense.

@pracucci pracucci force-pushed the wait-sum-metric-fix branch from d339d57 to 57b0f7d Compare March 12, 2020 10:57
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some changes to rework metrics used in integration tests. We have a bunch of global metrics in Cortex which I'm progressively get rid of. Will work on the table manager metrics too, but in a separate PR.

The metrics used in integration tests are the following:

Fixed:

  • cortex_alertmanager_configs
  • cortex_dynamo_sync_tables_seconds > introduced cortex_table_manager_sync_success_timestamp_seconds

The label is the tenant which can't be predicted, but used a trick in integration tests to check another related metric first:

  • cortex_ingester_memory_series_created_total
  • cortex_ingester_memory_series_removed_total

Already OK:

  • cortex_ring_tokens_total
  • cortex_ingester_shipper_uploads_total
  • cortex_ingester_memory_series
  • cortex_querier_bucket_store_blocks_loaded
  • memberlist_client_cluster_members_count
  • cortex_querier_blocks_index_cache_items
  • cortex_querier_blocks_index_cache_items_added_total
  • cortex_querier_blocks_index_cache_hits_total

@@ -1,5 +1,3 @@
// +build integration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning we gave to this tag was "a test requiring Docker". Does it cause you any trouble keeping it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my point is that it's not an integration test so technically this tag is wrong.

What if change this tag to requires_docker? (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did that (:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care too much how it's called, but feel like this is change just for the change. What if we have integration tests that don't use docker in the future? README.md is not updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also update integration/README.md accordingly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea behind using a tag here was to avoid running tests in integration dir when doing go test ./... from main Cortex package. I simply used tag name that matched the directory name.

Copy link
Contributor Author

@bwplotka bwplotka Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but feel like this is change just for the change

Just seen this and... what? What do you mean? Why I would put on myself more work to improve something in Cortex for no reason? @pstibrany Sorry, I don't get it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't feel that renaming the tag was necessary, but ... whatever works. I don't care too much about the name as long as there is a tag in the first place.

Copy link
Contributor

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending on Marco's comment about the integration tag.

@bwplotka
Copy link
Contributor Author

Fixed:

cortex_alertmanager_configs
cortex_dynamo_sync_tables_seconds > introduced cortex_table_manager_sync_success_timestamp_seconds
The label is the tenant which can't be predicted, but used a trick in integration tests to check another related metric first:

cortex_ingester_memory_series_created_total
cortex_ingester_memory_series_removed_total
Already OK:

cortex_ring_tokens_total
cortex_ingester_shipper_uploads_total
cortex_ingester_memory_series
cortex_querier_bucket_store_blocks_loaded
memberlist_client_cluster_members_count
cortex_querier_blocks_index_cache_items
cortex_querier_blocks_index_cache_items_added_total
cortex_querier_blocks_index_cache_hits_total

Thank you @pracucci ! And sorry for more work for you. But I think it actually will help Cortex as well to avoid init metric problem ❤️

cc @kakkoyun as we saved the world again with out talk 💪 😄

@kakkoyun
Copy link

@bwplotka We're fulfiling our mission is to make the world better place one PR at a time 😁

Makefile Outdated
@@ -135,7 +135,7 @@ shell:
bash

configs-integration-test:
/bin/bash -c "go test -v -tags 'netgo integration' -timeout 30s ./pkg/configs/... ./pkg/ruler/..."
/bin/bash -c "go test -v -tags 'netgo requires_docker' -timeout 30s ./pkg/configs/... ./pkg/ruler/..."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a different set of integration tests. You should actually rollback this and replace the -tags=integration in .circleci/config.yml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ups

@@ -1,5 +1,3 @@
// +build integration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also update integration/README.md accordingly?

@pracucci
Copy link
Contributor

Also, we've just merged another integration test. Could you rebase master and change cortex_dynamo_sync_tables_seconds to cortex_table_manager_sync_success_timestamp_seconds in integration/chunks_storage_backends_test.go, please?

@bwplotka bwplotka force-pushed the wait-sum-metric-fix branch from dc2d6bd to f570616 Compare March 12, 2020 12:00
bwplotka and others added 2 commits March 12, 2020 12:16
This is kind of tricky as logically sum of non-existing is 0, but also it's super easy to make a
mistake and put wrong metric and sneakly introduce bug... so I think being strict makes sense here? WDYT?

Alternative is to extend `isExpected func(sums ...float64) bool` to something like `isExpected func(exists bool, sums ...float64) bool`
but I think the proposed simplification makes sense here. (:

Also I think e2e base unit test should be run all the time not only on integration build,
remove the tag from those in main e2e core package.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@bwplotka bwplotka force-pushed the wait-sum-metric-fix branch from f570616 to 55f7276 Compare March 12, 2020 12:16
@bwplotka
Copy link
Contributor Author

PTAL (:

@bwplotka bwplotka force-pushed the wait-sum-metric-fix branch from 55f7276 to 5e2b851 Compare March 12, 2020 13:58
@bwplotka
Copy link
Contributor Author

Done, sorry for missed comment.

@bwplotka
Copy link
Contributor Author

--- FAIL: TestChunksStorageAllIndexBackends (12.15s)
    chunks_storage_backends_test.go:58: 
        	Error Trace:	chunks_storage_backends_test.go:58
        	Error:      	Received unexpected error:
        	            	metric cortex_dynamo_sync_tables_seconds not found in table-manager metric page
        	            	github.com/cortexproject/cortex/integration/e2e.(*HTTPService).WaitSumMetrics
        	            		/home/circleci/.go_workspace/src/github.com/cortexproject/cortex/integration/e2e/service.go:549
        	            	github.com/cortexproject/cortex/integration.TestChunksStorageAllIndexBackends
        	            		/home/circleci/.go_workspace/src/github.com/cortexproject/cortex/integration/chunks_storage_backends_test.go:58
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:909
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1357
        	Test:       	TestChunksStorageAllIndexBackends
=== RUN   TestGettingStartedSingleProcessConfigWithChunksStorage

Flakiness?

@pracucci
Copy link
Contributor

Flakiness?

No, missed comment :)
#2256 (comment)

@bwplotka
Copy link
Contributor Author

damn I did not read the comment properly like 3rd time 🤦‍♂️

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
@bwplotka bwplotka force-pushed the wait-sum-metric-fix branch from 5e2b851 to 16a9a44 Compare March 12, 2020 14:18
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Once CI pass we can merge it.

@gouthamve gouthamve merged commit 9f4d65b into cortexproject:master Mar 12, 2020
@pstibrany
Copy link
Contributor

Thanks Bartek, this is useful.

@bwplotka
Copy link
Contributor Author

Thanks ❤️

@bwplotka bwplotka deleted the wait-sum-metric-fix branch March 12, 2020 15:14
simonswine pushed a commit to grafana/e2e that referenced this pull request Jan 13, 2022
…ct/cortex#2256)

* e2e: Fixed WaitSumMetrics to fail on non existing metric

This is kind of tricky as logically sum of non-existing is 0, but also it's super easy to make a
mistake and put wrong metric and sneakly introduce bug... so I think being strict makes sense here? WDYT?

Alternative is to extend `isExpected func(sums ...float64) bool` to something like `isExpected func(exists bool, sums ...float64) bool`
but I think the proposed simplification makes sense here. (:

Also I think e2e base unit test should be run all the time not only on integration build,
remove the tag from those in main e2e core package.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Fixed init of the metrics used in integration tests

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Moved build tag to `required_docker` to be explicit.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Co-authored-by: Marco Pracucci <marco@pracucci.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants