-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load generator tests on the last cache #25127
Comments
I'm guessing that the amount of data you write into the table before running the query test is going to have a significant impact on the performance of the SQL queries. It would be interesting to run it with some different amounts of historical data. |
Yes, definitely. I would also like to compare performance/memory usage between the two when using higher cardinality key columns. |
Oh, one detail omitted:
|
Will be interesting to see what it looks like if you have 100 values you're pulling back like |
I filed #25174 to add support for more filter
💯 not looking forward to composing that. |
Update: I figured out the general query structure to select N-recent values from multiple time series. Note, that there is an open issue in Datafusion to optimize such queries (see apache/datafusion#6899), so we should re-run this analysis when that optimization is implemented. The issue description was updated with the relevant details. |
Test: Last 5 Values - 1M Cardinality - Grouping by tag/keySetupIn this test I compared the following two queries under the same write load. In either case, the query test was run for 60 minutes, with the same write load running in parallel. Each query test uses a single querier that fires queries sequentially, one after the other, as fast as it can. 1. No Cache (Yellow)WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY t3 ORDER BY time DESC) AS rn
FROM data
WHERE t2 = $t2_id
AND time > now() - INTERVAL '5 minutes'
)
SELECT * FROM ranked
WHERE rn <= 5
2. With Cache (Blue)SELECT * FROM last_cache('data') WHERE t2 = $t2_id
Write Load / Data Spec
Results
DiscussionBecause of the data layout, there should be ~5k The result above isn't awesome, but I don't want to over analyze this yet; here are some of the next tests I plan to try:
|
Test: Last Value - 1M Cardinality - Grouping by tag/keySetupThis test setup is almost the same as the previous (#25127 (comment)), with the exception that the SQL query was changed to return a single value, and the cache to only store a count of one value. 1. No Cache (Yellow)WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY t3 ORDER BY time DESC) AS rn
FROM data
WHERE t2 = $t2_id
AND time > now() - INTERVAL '5 minutes'
)
SELECT * FROM ranked
WHERE rn <= 1 -- <- changed to 1
2. With Cache (Blue)SELECT * FROM last_cache('data') WHERE t2 = $t2_id
Write Load / Data Spec
Results
This is a zoom in on the (2.) With Cache scenario to see the range of query latencies a bit more clearly: DiscussionIt is still difficult to see, but the query latency when using the cache is almost a square wave, toggling between ~5ms and ~100ms. It would be worth profiling to see what might be causing the slowness during the 100ms periods. Although its gradual, the (1.) No Cache query latency is degrading over time, and did not saturate during the test, while the (2.) With cache queries look stable and also reduce the load on the CPU by a factor of 5-8x. |
The SQL query you're executing is only looking back 5 minutes, which is almost never what users do when they're looking for last values. That only works if they've actually written a value in the last 5 minutes. Also, your SQL query isn't grouping by the actual tag, so I don't believe you're actually pulling in the last value for every unique t3 in the given t2 group. |
Oh, maybe the |
Yeah, the 5 minute look back might be a bit optimistic in the general sense, I figured it was acceptable given the write load should have values written in that time. I could try with a more conservative value - I originally was going to use the same as the cache TTL, which is 4 hours, but wanted to give the SQL a fighting chance 😅.
Yeah, it is a bit odd, but the I am currently running a similar test to the above, but using an |
Definitely worth a look on the profiling. I think the new WAL and write buffer refactor might have a big impact here because the write locking behavior is going to change quite a bit. |
Once the last cache has been implemented we will want to run a series of load generator tests to see how it performs compared to SQL queries that would be used in its absence.
The only setup required in the load generator should be to create the specs that exercise the queries below. We could have the load generator create the last cache, but will probably just be easy enough to write some data in and create the cache using the CLI, before running the load gen tests.
Scenarios
Here are some scenarios we want to test.
1. Basic last value queries
In this case, the cache does not need to be keyed on any columns.
Using SQL
Using Last Cache
2. Multi-level tag/key hierarchy
In this case, the data has a hierarchical tag set, e.g.,
region/host/cpu
. The last cache is keyed using the hierarchyregion -> host -> cpu
, and we want to compare query performance when using different combinations of predicates.Using SQL
In general, situations where attempting to pull the N-most-recent values for a set of time series, we can use a combination of a ranking function, e.g.,
ROW_NUMBER()
andPARTITION BY
like so:Here, predicate can be, e.g.,
host IN ('_', '_', ...)
region IN ('_', '_', ...)
host = '_' AND cpu = '_'
host = '_'
region = '_'
Using Last Cache
Here, the last cache is doing the work for us, so we really just need to provide the above predicates like so:
The text was updated successfully, but these errors were encountered: