Catalog API #1354

dantengsky · 2021-06-22T02:24:26Z

dantengsky
Jun 22, 2021
Collaborator

Summary

Currently DataSource API provides us the essential functionalities of manipulating metadata, but as we moving the v0.5, some extra requirements should be fulfilled, including but not limited to:

~~periodically sync metadata with(from) meta-store~~
metadata of database, table should be versioned, as we are going to provide time travel features
exposes a session-level catalog API so that we can keep uncommitted metadata invisible
provide simple transaction API
for per statement, implicit transaction.
Note: In v0.5 we are not going to support distributed/parallel insertion/deletion

drmingdrmer · 2021-06-22T13:07:53Z

drmingdrmer
Jun 22, 2021
Maintainer

metadata of database, table should be versioned, as we are going to provide time travel features

Versions of table is a single listed list, or could it be a DAG?

v0 <- v1 <- v2 ...

or

 .--- v2 <----+- v3
|      |       |
v      v       |
v0 <- v1 <----'

0 replies

BohuTANG · 2021-06-22T14:29:05Z

BohuTANG
Jun 22, 2021
Maintainer

I think it's a single chain like git

0 replies

leiysky · 2021-06-22T15:52:04Z

leiysky
Jun 22, 2021

periodically sync metadata with(from) meta-store

How does DDL work?

0 replies

drmingdrmer · 2021-06-22T15:52:12Z

drmingdrmer
Jun 22, 2021
Maintainer

I think it's a single chain like git

You're right. In every case it does not need a DAG history. I was overthinking.

0 replies

dantengsky · 2021-06-22T16:23:27Z

dantengsky
Jun 22, 2021
Collaborator Author

metadata of database, table should be versioned, as we are going to provide time travel features

Versions of table is a single listed list, or could it be a DAG?
v0 <- v1 <- v2 ...

or

 .--- v2 <----+- v3
|      |       |
v      v       |
v0 <- v1 <----'

"v0 <- v1 <- v2 " is enought for time travel.

 .--- v2 <----+- v3
|      |       |
v      v       |
v0 <- v1 <----'

The above figure may describe:

both v2 and v1 concurrently commit on the base of v0, and v1 wins. v2 successfully resolves conflicts and commits later.
both v3 and v2 concurrently commit on the base of v1, and v2 wins. v4 successfully resolves conflicts and commits later.

Transaction operation log will contain an item/entry to record the base version (I might forget this in the last discussion though;-), but that's for post-audit/diagnostics, for time travel, linear history is enough I think.

0 replies

dantengsky · 2021-06-22T16:30:24Z

dantengsky
Jun 22, 2021
Collaborator Author

periodically sync metadata with(from) meta-store

How does DDL work?

Roughly speaking,

DDL interpreters call catalog API
catalog API call metastore (via fight service rpc)
catalog sync with metastore

Any specific concerns or suggestions(sincerely welcome)?

0 replies

leiysky · 2021-08-08T07:12:09Z

leiysky
Aug 8, 2021

@dantengsky

I've just realized something, and I will appreciate it if you can give some feedback:

Do we distinguish between immutable and mutable tables(or partitions)? Since it's simple to do some optimization on static dataset, e.g. accurate statistics, efficient cache policy.
Shall we have more fine-grained item classification? Like supporting Field item, Function item, so we can implement fine-grained priviledge management and security ensurance.

0 replies

dantengsky · 2021-08-08T13:44:45Z

dantengsky
Aug 8, 2021
Collaborator Author

@dantengsky

I've just realized something, and I will appreciate it if you can give some feedback:

pleasure to discuss with you ( and all the kind people who cares about datafause)

Do we distinguish between immutable and mutable tables(or partitions)? Since it's simple to do some optimization on static dataset, e.g. accurate statistics, efficient cache policy.

For each version of a given table, it is immutable: including schema, partitions, and other meta data of it. When computation layer accessing a table without specifying the version, the latest version of the table will be returned.

There might be some exceptions, like re-orging a table (background merging), things of these parts are not fully settled, we need some deeper thoughts here.

I agree with you that immutability will benefit cache policy, and statistics (with an affordable cost if the statistics index we are trying to maintain is not "heavy").

Shall we have more fine-grained item classification? Like supporting Field item, Function item, so we can implement fine-grained priviledge management and security insurance.

I think we should have them eventually, at least, the ansi information_schema need them.
I am not sure if we should provide "field/column" level access control, but if we could cover that without scarifying too much, it will be great.

As always, suggestions are sincerely welcome.

0 replies

jyizheng · 2021-08-08T14:24:01Z

jyizheng
Aug 8, 2021

What are the file format and table format we want to support?
Parquet is a popular file format and the cost-based-optimization may rely on it.
Is the iceberg the table format we want to support?

0 replies

leiysky · 2021-08-08T14:24:23Z

leiysky
Aug 8, 2021

Thanks for the reply.

For each version of a given table, it is immutable: including schema, partitions, and other meta data of it. When computation layer accessing a table without specifying the version, the latest version of the table will be returned.

Correct me if I was wrong, for each write operation, like insert a single row with INSERT INTO t VALUES(...), it will generate a unique version.

This scheme would bring serious overhead, that is, user have to batch the records and ingest them with bulk load.

How do you handle this case?

0 replies

leiysky · 2021-08-08T14:48:47Z

leiysky
Aug 8, 2021

What are the file format and table format we want to support?
Parquet is a popular file format and the cost-based-optimization may rely on it.
Is the iceberg the table format we want to support?

And besides, I have the question too.

Maybe we can have our own unified columnar format?

As https://www.firebolt.io/performance-at-scale said, they will transform the external data(e.g. JSON, Parquet, Avro, CSV) into their own data format F3 through the ETL pipeline.

0 replies

jyizheng · 2021-08-08T15:04:39Z

jyizheng
Aug 8, 2021

@leiysky I agree and thought about this, too. Maybe it is a good idea to borrow some ideas from clickhouse to define our own file format. My impression is that clickhouse is very good at filtering data so that the query can be fast. A Parquet file has some stats or summaries for the data it stores. We may want to extend a parquet file's stats or summaries with the clickhouse tricks.

0 replies

leiysky · 2021-08-08T15:52:12Z

leiysky
Aug 8, 2021

@leiysky I agree and thought about this, too. Maybe it is a good idea to borrow some ideas from clickhouse to define our own file format. My impression is that clickhouse is very good at filtering data so that the query can be fast. A Parquet file has some stats or summaries for the data it stores. We may want to extend a parquet file's stats or summaries with the clickhouse tricks.

Yes, I think it's mainly because of sparse index, which can help skipping some data units(depends on data distribution, the effect can be surprising) with min/max, bloom filter or else. It's convenient for us to build different index on a data file, even on demand on the fly.

And AFAIK, ORC does have sparse index in their data format, but I haven't do any benchmark on them, so just FYI.

Anyway, we can choose a major data format for now to bootstrap fuse-store, but eventually there should be a unified data format designed by ourselves.

0 replies

dantengsky · 2021-08-08T16:48:04Z

dantengsky
Aug 8, 2021
Collaborator Author

Clickhouse's mergetree engine, and the way they organize the index, is definitely a valuable reference, but IMHO, I am afraid it is not "cloud-native" friendly, @zhang2014 @sundy-li may have more insightful comments on this.

We do consider using something like iceberg's table format, which is more "batch_write" friendly as @leiysky mentioned, i.e. if small parts keep being ingested in without batching, something like CK's "Too many parts.." may happen (and the metadata increases for each small ingestion, which brings burdens to the meta layer, as iceberg/Uber mentioned)

Totally agreed with you that statistics of parquet and orc is a kind of sparse index.

But I am afraid, the statistics of Parquet or ORC can not be used directly, since the data files are supposed to be stored in Object Storage, which is not that efficient to access; the sparse index should be maintained by meta service, and ideally, they could be accessed by computation layers as a normal data source (relation), so that we can do computations with the sparse index parallelly (or even distributedly, but that might be too much). Also, we are considering utilizing some fancy data structures to accelerating the query, like BloomRF etc.

May I transfer this issue to a discussion? DataFuse is young, constructive discussions like this are preferred.

3 replies

jyizheng Aug 8, 2021

Why "cloud-native" is a limiting factor here? What are its requirements?

There are some solutions to addressing the problem of small files/commits for Iceberg.
Adobe uses buffered writes to consolidate small ingestions or updates. Here is a link for this:
https://www.dremio.com/subsurface/high-frequency-small-files-vs-slow-moving-datasets/

According to a post from Adobe, the metadata of the table (manifest files) and the data file of the table are both
stored in the object storages (S3 or ADLS). There is another piece of metadata for the whole dataset, which is managed
by the catalog as a service. It might be possible to cache some table metadata in the catalog.

I guess BloomRF means range filters. There is some research work in this area for LSM-trees, such as
SuRF, Rosetta, and REMIX. It might be helpful to know what are the use cases of this type of data structure
and what index data structures is it used for.

leiysky Aug 9, 2021

I am afraid it is not "cloud-native" friendly

IMO, a typical read process in cloud system is like:

Check if the target file is loaded into the cache(local disk, or distributed cache service), and load it if not.
Read data from storage system and transfer it to query system.

The cache system is necessary, since it has better performance, and actually much cheaper than reading from S3 directly(S3 API is costly, here's an interesting story https://news.ycombinator.com/item?id=22626097).

As I understand, the problem here is that indexes cannot accelerate reading cold data(data stored in object store like S3 or else), which is procedure 1. I've mentioned above. While it can accelerate 2. by a lot, that's why I think it's valueable.

dantengsky Aug 9, 2021
Collaborator Author

Thanks for your pointers to docs and valuable thoughts! really appreciate it!

BohuTANG · 2021-08-09T01:05:35Z

BohuTANG
Aug 9, 2021
Maintainer

Personally, we should choose a simple&workable solution to implement first.

The sparse index is required here, this can be extracted from the parquet or even directly by ourself, and write them to our metadata service.
Agree with yours Totally agreed with you that statistics of parquet and orc is a kind of sparse index..

2 replies

leiysky Aug 9, 2021

Agree.

Runnable first, and the most important thing is to design robust API for storage system.

drmingdrmer Aug 9, 2021
Maintainer

Agree.

leiysky · 2021-08-20T08:10:33Z

leiysky
Aug 20, 2021

@dantengsky Shall we have any catalog object provider like CatalogFactory or else?

1 reply

dantengsky Aug 20, 2021
Collaborator Author

Something like spark's catalog plugin? I have not seriously thought about that yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalog API #1354

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 16 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Catalog API #1354

dantengsky Jun 22, 2021 Collaborator

Replies: 16 comments · 6 replies

drmingdrmer Jun 22, 2021 Maintainer

BohuTANG Jun 22, 2021 Maintainer

drmingdrmer Jun 22, 2021 Maintainer

dantengsky Jun 22, 2021 Collaborator Author

dantengsky Jun 22, 2021 Collaborator Author

dantengsky Aug 8, 2021 Collaborator Author

dantengsky Aug 8, 2021 Collaborator Author

dantengsky Aug 9, 2021 Collaborator Author

BohuTANG Aug 9, 2021 Maintainer

drmingdrmer Aug 9, 2021 Maintainer

dantengsky Aug 20, 2021 Collaborator Author

dantengsky
Jun 22, 2021
Collaborator

Replies: 16 comments 6 replies

drmingdrmer
Jun 22, 2021
Maintainer

BohuTANG
Jun 22, 2021
Maintainer

drmingdrmer
Jun 22, 2021
Maintainer

dantengsky
Jun 22, 2021
Collaborator Author

dantengsky
Jun 22, 2021
Collaborator Author

dantengsky
Aug 8, 2021
Collaborator Author

dantengsky
Aug 8, 2021
Collaborator Author

dantengsky Aug 9, 2021
Collaborator Author

BohuTANG
Aug 9, 2021
Maintainer

drmingdrmer Aug 9, 2021
Maintainer

dantengsky Aug 20, 2021
Collaborator Author