Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hybrid storage format #77

Closed
jiacai2050 opened this issue Jun 30, 2022 · 4 comments
Closed

Hybrid storage format #77

jiacai2050 opened this issue Jun 30, 2022 · 4 comments
Assignees
Labels
A-analytic-engine Area: Analytic Engine feature New feature or request

Comments

@jiacai2050
Copy link
Contributor

jiacai2050 commented Jun 30, 2022

Description

For now, data by default is ordered by timestamp column within one SST file(currently in Parquet format), each tag/field being a column.

Timestamp Device ID Status Code Tag 1 Tag 2
12:01 A 0 v1 v1
12:01 B 0 v2 v2
12:02 A 0 v1 v1
12:02 B 1 v2 v2
12:03 A 0 v1 v1
12:03 B 0 v2 v2

This design is good for OLAP queries, as it will only scan relevant columns, and CeresDB can take advantage of this ordering to filter unnecessary file, reducing IO further.

But for time-series user case like IoT or DevOps, this maybe not the best format. Those queries will typically first group its result by series id(or device-id), then by timestamp. This ordering isn't match with SST, so many random IOs will be incurred.

A general approach is to duplicate data twice: one ordered by timestamp first, and the other ordered by series id first.

Apparently this isn't very cost-effective, and will require some replication algorithm to synchronize data, which is very error-prone. It's best we could solve this ordering issue in one format.

Proposal

This issue propose one potential hybrid format (OLAP and time-series):

Device ID Timestamp Status Code Tag 1 Tag 2 minTime maxTime
A [12:01,12:02,12:03] [0,0,0] v1 v1 12:01 12:03
B [12:01,12:02,12:03] [0,1,0] v2 v2 12:01 12:03

In the above schema, instead of store timestamp row by row, we put timestamp within a device id in one array, and the corresponding values are also in array type, so we can easily map between them. The table is ordered by device ID.

In this way, we can avoid random IO when query one specific device, since its data are stored together, and this format is also beneficial for OLAP queries since we can use min/maxTime to help reader filter unnecessary chunks.

Additional context

Some references

@jiacai2050 jiacai2050 added the feature New feature or request label Jun 30, 2022
@jiacai2050
Copy link
Contributor Author

jiacai2050 commented Jul 18, 2022

I have done a benchmark in my local env, This hybrid format is better than the old one.

Table below summarize read cost in each format(each is read ten times).

Hybrid

cost row nums
615ms 10367743
576ms 10367743
585ms 10367743
511ms 10367743
558ms 10367743
569ms 10367743
568ms 10367743
555ms 10367743
557ms 10367743
584ms 10367743

Old

cost row nums
1304ms 10367743
1283ms 10367743
1276ms 10367743
1286ms 10367743
1275ms 10367743
1272ms 10367743
1273ms 10367743
1275ms 10367743
1275ms 10367743
1270ms 10367743

How it tests

Firstly, my test env is

Data is generated using tsbs, with config below

data-source:
  simulator:
    debug: 0
    initial-scale: "0"
    log-interval: 10s
    max-data-points: "0"
    max-metric-count: "1"
    scale: "50000"
    seed: 100
    timestamp-start: "2022-07-02T00:00:00Z"
    timestamp-end: "2022-07-02T01:00:00Z"
    use-case: devops-generic
  type: SIMULATOR

This means the generated data source is

  • one metric within one hour, point interval is 10s, 50k series total.

Data sample

{
      "arch": "x86",
      "region": "ap-southeast-1",
      "service_environment": "test",
      "team": "SF",
      "value": 473.0,
      "service_version": "0",
      "datacenter": "ap-southeast-1b",
      "timestamp": 1656720000000,
      "os": "Ubuntu16.04LTS",
      "hostname": "host_3349",
      "rack": "80",
      "service": "6",
      "tsid": 1123006250071095
    }

Next step

Rebase with upstream master, apply this hybrid format with string column(currently only fixed-length column tested).

@jiacai2050
Copy link
Contributor Author

jiacai2050 commented Aug 17, 2022

This was referenced Aug 18, 2022
@jiacai2050
Copy link
Contributor Author

jiacai2050 commented Aug 24, 2022

There are some more things need to be done for good performance, leave here to keep a note for myself and hope others interested can get involved.

Write

  • Support variable-length type for ListArray
  • Support table without tsid, only a row id is required

Read

  • Support basic read(without any filter pushdown), WIP
  • Support timestamp column filter, some extra columns may be needed
  • Support variable-length type for ListArray
  • Enable a total ordering, to support query with pagination

Misc

  • Ensure row group size is large enough, in case of list length within same row_id is to small
  • Use dictionary array type to represent non-collapsible columns to reduce memory usage.
  • Benchmark between two format

@jiacai2050 jiacai2050 changed the title [PoC] Hybrid storage format Hybrid storage format Aug 24, 2022
@chunshao90
Copy link
Contributor

Checklist

This checklist is outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-analytic-engine Area: Analytic Engine feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants