Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel walking support #550

Merged
merged 1 commit into from
Mar 9, 2023
Merged

Add parallel walking support #550

merged 1 commit into from
Mar 9, 2023

Conversation

jacksontj
Copy link
Owner

@jacksontj jacksontj commented Mar 8, 2023

This PR is a potential fix for the parallel walking (seems to be a lot simpler on the new prom fork). Before merging I'll need to update the prometheus fork (instead of hacking the change directly in the vendor dir), but this is a good way to test this out -- if anyone is interested in giving it a spin.

Fixes #461

@jacksontj
Copy link
Owner Author

When running the tests locally with -race I get an error:

--- FAIL: TestEvaluations (75.49s)
    --- FAIL: TestEvaluations/0testdata/selectors.test (6.28s)
        promql_test.go:249: error running test testdata/selectors.test: error in eval rate(http_requests[40s] @ 18000.000) - rate(http_requests[1m] @ 18000.000 offset 2h46m40s) (line 70) range mode: expected 5 for {az="a", group="canary", instance="0", job="api-server"} but got 5.

which is consistently reproducible if -race is on; so there is some weird bug here to still dig into

@jacksontj
Copy link
Owner Author

I have found the cause of the issue; An example of the issue can be found here. Basically the current implementation is assuming an in-order tree traversal; so it uses a single variable for evalRange which is cleared after usage. In the parallel walking case this breaks down as the execution order isn't guaranteed the same (it moves from a guaranteed sequential order; to only guaranteeing the tree above a particular node was processed).

I did some hacking locally and this can be solved with moving these types of variables to be in some prefix tree structure -- such that the traversal stores the found variable by its position in the tree -- and then the reader can check for the "longest match" within that prefix. I hacked this up locally with just ranging over larger slices; but I'll be cleaning this up with a trie of some type for better performance before merging.

@jacksontj
Copy link
Owner Author

After doing some benchmarks I decided against bothering with a trie -- the performance benefits are only valuable if we have thousands of matrix selectors in a query -- which seems like the performance will be abysmal anyways (as that'd be a massive query) so no need for the additional complexity.

@jacksontj jacksontj marked this pull request as ready for review March 9, 2023 22:41
@jacksontj jacksontj merged commit 6225f43 into master Mar 9, 2023
@jacksontj jacksontj deleted the parallel_walk branch March 9, 2023 22:41
@jacksontj
Copy link
Owner Author

Fixes #433

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallel ast.Walk has errors with BinaryExpr
1 participant