-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hybrid storage format #77
Comments
I have done a benchmark in my local env, This hybrid format is better than the old one. Table below summarize read cost in each format(each is read ten times). Hybrid
Old
How it testsFirstly, my test env is
Data is generated using tsbs, with config below data-source:
simulator:
debug: 0
initial-scale: "0"
log-interval: 10s
max-data-points: "0"
max-metric-count: "1"
scale: "50000"
seed: 100
timestamp-start: "2022-07-02T00:00:00Z"
timestamp-end: "2022-07-02T01:00:00Z"
use-case: devops-generic
type: SIMULATOR This means the generated data source is
Data sample
Next stepRebase with upstream master, apply this hybrid format with string column(currently only fixed-length column tested). |
Checklist
|
There are some more things need to be done for good performance, leave here to keep a note for myself and hope others interested can get involved. Write
Read
Misc
|
This checklist is outdated. |
Description
For now, data by default is ordered by
timestamp
column within one SST file(currently in Parquet format), each tag/field being a column.This design is good for OLAP queries, as it will only scan relevant columns, and CeresDB can take advantage of this ordering to filter unnecessary file, reducing IO further.
But for time-series user case like IoT or DevOps, this maybe not the best format. Those queries will typically first group its result by series id(or device-id), then by timestamp. This ordering isn't match with SST, so many random IOs will be incurred.
A general approach is to duplicate data twice: one ordered by timestamp first, and the other ordered by series id first.
Apparently this isn't very cost-effective, and will require some replication algorithm to synchronize data, which is very error-prone. It's best we could solve this
ordering
issue in one format.Proposal
This issue propose one potential hybrid format (OLAP and time-series):
In the above schema, instead of store timestamp row by row, we put timestamp within a device id in one array, and the corresponding values are also in array type, so we can easily map between them. The table is ordered by device ID.
In this way, we can avoid random IO when query one specific device, since its data are stored together, and this format is also beneficial for OLAP queries since we can use min/maxTime to help reader filter unnecessary chunks.
Additional context
Some references
The text was updated successfully, but these errors were encountered: