RFC: Semistructured Columns #54864

alexey-milovidov · 2023-09-21T00:01:04Z

(This is an alternative to the experimental JSON data type, aiming to replace it and address its drawbacks).

Implementation proposal

1. Dynamic columns.

A table (IStorage) can tell that it supports dynamic columns (bool canHaveDynamicColumns()).
If this is the case, a query can reference columns or subcolumns, not necessarily contained in the table's definition.
The table should provide a method (e.g., getColumnType) to dynamically obtain the type of a column or subcolumn (or expression?).

As a bonus and a demo, we can allow a variant of Merge table that will dynamically extend if new tables appear.
This Merge table can even have an empty table structure (no fixed columns).

2. Variant data type and column.

A new column type works as a discriminated union of nested columns. For example, Variant(Int8, Array(String)) has every value either Int8 or Array(String). It is serialized to multiple streams: a stream for every option and a stream with option numbers (discriminator).

This column can be accessed in the following ways:

with type conversion: c::Int8: we will read every subcolumn and convert if needed.
with tuple-like access: c.Int8: we will read only the requested subcolumn as Nullable (this is unneccessarily).
with implicit conversion: c: if the column is requested as is, we will find the least common type of the variant types (and throw an exception if it does not exist) and convert it.

3. JSON data type.

A column with JSON data type works as an opaque placeholder for dynamic subcolumns.

It is serialized into multiple streams: the first one is an index, enumerating every unique path in JSON, along with their types, and streams for every variant type for every path.

A user can insert JSON as String - then we convert it to JSON by deriving all data types of all paths as variants.

Different types inside variants are not converted on merges. Merges can only extend the variant data types by collecting more types inside them. The types are only converted on INSERT and SELECT.

If there is a JSON data type in a table, the table should tell that it supports dynamic columns. Querying any column not present in the static schema will trigger reading the "index" stream of every JSON type on query analysis. This is done lazily (on demand), but tables can subsequently cache this information inside in-memory information about data parts.

Distributed and Merge tables, as well as Views can trigger additional calls when asked about dynamic columns.

Column-level RBAC does not apply to subcolumns.

4. Explicit conversions.

When a user referenced a subcolumn from JSON with explicit type conversion, e.g. SELECT json.CounterID::UInt32 - the table will tell that the column json.CounterID::UInt32 exists and has UInt32 type without looking inside the data. If a user referenced it on table creation, e.g., CREATE TABLE ... ORDER BY json.CounterID::UInt32, the expression should be successfully resolved.

If the column json.CounterID::UInt32 is requested, it should be propagated to reading, and the table will return the data already converted to UInt32, or throw an exception.

5. Limitations on the number of subcolumns.

We will not introduce additional part formats and will use the wide or compact format as usual.

But if the number of paths in JSON is larger than the limit specified in the table setting (e.g., 1000), the remaining path prefixes will be recollected into JSON and written as String. The first paths are selected by the first-come order.

We should implement an optimization for vertical merge to support it for subcolumns, and most likely, it will be a special case in the code.

6. Hints.

The JSON data type can have parameters, specifying hints for data types of particular subpaths, or the fact that a particular subpath should not be written, e.g., JSON(http.version.major UInt8, SKIP body.raw, SKIP tls.handshake.random). We could also implement support for ALTERs.

Motivation

If we will implement only №1 and №2 and fail to implement the JSON data type, we will still get a lot.

The current implementation of the JSON data type is experimental, and it is not too late to remove it. It has the following drawbacks:

eager creation of subcolumns: all subcolumns have to be present during query analysis, instead of only queried ones, and it slows the queries; having dynamic subcolumns should solve this problem;
eager type unification: for example, if a field was an integer in JSON, and you inserted a single object where it is a string, the table will be poisoned forever; having variant data type and the support for explicit type specification should solve this problem;
eager type unification leads to corner cases with empty objects, nulls, unification between arrays of different dimensions, etc. - these cases are either not supported or have bugs;
it's not possible to use subcolumns from JSON in table indices or reference to them in other columns;
if the number of unique paths in JSON grows too much, it becomes unusable;
it duplicates the logic of schema inference, but we can reuse it more;

Working with large number of paths.

One of the problem we want to solve with new JSON implementation is to be able to store JSON object with large amount of different paths. As the number of paths can be any large, we want to store only some of them as subcolumns and other inside a separate String column as a JSON string. So here we need to think about 2 topics:

What paths we will store and how we will serialize/deserialize them in parts.

First important note that we cannot store all unique paths in JSON, because this number can be even larger than the number of rows (for example if there is some unique ids used as a keys in each row). So, we can only store the set of paths that are stored separately as subcolumns (because we can control the number of these paths and limit it).
There are 2 possible ways how we can store/serialize/deserialize paths:

We can have a global set of paths for the part, so all granules will share this set. It can be implemented similar to shared dictionary in LowCardinality column. So we will write/read this set of paths once during part serialization/deserialization. But we need to take in mind that during merges we cannot ensure that all resulting columns will have the same set of paths separated to subcolumns (at least without changing code of merges). So, here we have 2 scenarious:

1.1. Allow resulting columns to have different sets of paths, for shared set in part we will use the set from the first column and will dynamicly expand it when new columns with new sets are serialized. Advantages: no need to change the code of merges. Disadvantages: need to expand global set of paths and possibly rewrite new columns to match this global set of paths, during serialization new substreams can appear and we need to support it somehow.

1.2. Change the code of merges and implement special logic for choosing empty resulting column using all source columns. It can be done as a new static method in IColumn interface like cloneSmth that will take an array of columns and construct new empty column. In this case we can choose some algorithm how we choose resulting set of paths so it will be always the same for all resulting columns during merge. Advantages: all columns will have the same set of paths during serialization, no need to expand/rewrite smth, also we can use some statistics when we choose resulting set of paths. Disadvantages: need to modify code of merges, but it should not be a big problem.
We can have a separate set of paths separated to subcolumns for each granule, but some granules can share their set of paths with other granules. Advantages: each resulting column can have different set of paths, no need to modify code of merges. Disadvantages: we can store more separated paths than the limit as we don’t control total number of paths in part, during deserialization of granules with different set of paths we will need to rewrite the column, need to upport dynamic substreams during serialization.

The better option is to have shared set of paths for part and modify the code of merges a bit by implementing special choosing of resulting empty column as described in 1.2

How we will choose what paths will be stored separately and what paths will be stored in String column.

We should choose what paths will be stored as subcolumn in 2 cases:

During insert from the input data.

Here we can have 2 options:

Select paths by the first-come order. Advantages - simple to implement, no finalization, no any additional logic, no memory overhead. Disadvantages - it's possible that we will choose quite rare and useless paths for subcolumns if they come first in read block.
During insert store all paths as subcolumns in memory and use some finalization to aggregate statistics what paths are better to store separately (for example what paths are more frequent than others). Advantages - we will choose better paths for subcolumns based on statistics. Disadvantages - need additional column finalization after reading, additional logic for statistics, converting rare subcolumns to String column on finalization, memory overhead on large number of paths.

I would prefer 1-st option as second is error-prone and the disadvantages exceed the advantages in my opinion. If there are other options/ideas - feel free to add.

During merge of JSON columns

During merge we will have multiple JSON columns from different parts with possible different sets of paths that are stored as subcolumns and the total number of paths can exceed the limit. In this case we should decide what paths should remain as subcolumns and what paths should be serialized into String column. This logic depends on how we are going to store and serialize paths. Let’s say we will use 1.1 option when we have shared set of paths for part and modify the code of merges a bit by implementing special choosing of resulting empty column. How we can choose from N columns with N different sets of paths what will be the result set or pahts:

We can just take first limit/N from each set of paths. Similar to first come order logic.
We can use local statistics of first columns (the frequency of each path in this column) and take the most frequent paths into the result set.
Along with global set of paths in part we can store global statistic (frequency of each path) across the whole part (we can calculate it during serialization and then serialize it in the end together with global set of paths). And during deserialization we can store this statistic inside each column (together with set of paths as it also the same for all deserialized columns). And we can use this global statistic to determine the resulting set of paths during merge (so we will take most frequent paths into the result set).

The third option will be the best and seems like it’s not difficult to implement it.

Querying subcolumns of JSON.

Originally we wanted to use Variant for JSON subcolumns and obtain the type of a subcolumn dynamically during query analysis. But the problem is that we are not able to store the set of all unique paths we have in JSON column (as it can be larger than the number of total rows), so we will store only subset of paths that are separated to subcolumns and the rest paths will be optimized into single String column and stored as a string JSON object. And to dynamically obtain the type for any subcolumn in worst case we will have to do a fullscan on this String column to find if we have this subcolumn or not. To solve this problem I proposed new type called Dynamic (see the proposal above). Let’s describe how it will be implemented and how it will work briefly.

New Dynamic type

This type will represent a single column with dynamic type, so it can store inside values of different types and extend it’s type dynamically. We can think of it as a dynamic Variant type. Initially it will store column with empty type Variant() and during insert or merge it will dynamically extend it. Column will store its current type and column with type Variant inside, and during extension new variants will be added to internal Variant column. During serialization we will serialize it’s typename and internal Variant column, similar to JSON we can have 2 options here: use common global typename for the whole part or serialize this typename per each granule. First will require the same modification in merges as described in JSON, second will require dynamic adding of substreams.

Dynamic column will support subcolumns as type names similar to Variant type: dynamic.Type with type Nullable(Type) and it will read subcolumn .Type from all inner Variants. If none of the Variants have this subcolumn, it will contain just NULL values. The type for such subcolumn can be determined by it’s name, so no need to read any data during query analysis. Dynamic can be querried as is, it will support text serializations so it can be displayed to the user. Also it will support CAST to any type, it will just CAST inner Variant type to the destination type.

Using Dynamic type for JSON subcolumns

We can say that any subcolumn requested from JSON column will have type Dynamic, so we won’t need to obtain it’s real type during query analysis. So if user requested json.a.b.c we will tell that it has type Dynamic so the user can do the next 3 things with this column:

CAST it to some type: select json.a.b.c::UInt32 - it will cast all inner Variants into UInt32 and throw an exception if it can’t
Select it’s subcolumn as a type: select json.a.b.c.UInt32 - it will read only UInt32 variant from all inner Variants
Select it as is: select json.a.b.c - in this case it will be just displayed to user in some text format.

If user will try to use subcolumn json.a.b.c in any other scanario, it will just get an exception that type Dynamic is not supported for this operation.

It may happen that user requested JSON path that doesn’t exists in the table, in this case we will just return Dynamic column filled with NULL values.

But here we have a problem: as we use Dynamic type for any JSON path without checking if it really exists, we cannot distinguish if json.a.b.UInt64 is a path in JSON with key UInt64 or it’s the subcolumn of path json.a.b with type UInt64.

Possible solutions:

Use another syntax for requesting type subcolumns from Variant. Examples: variant:::Type, variant.:Type, variant.::Type.
Consider all JSON subcolumns requested with . as path and allow to extract type subcolumns using separate function like jsonExtractType(json.a.b, 'UInt64'). We can use optimization “rewrite functions to subcolumns” to optimize it internally to read only required subcolmn. So we can leave syntax with . for Variant (and maybe for Dynamic), but not for JSON (as this problem can happen only with JSON). Or we can use another syntax only for JSON type.
As the problem happens only with JSON, leave syntax .Type for Variant and Dynamic, but for JSON use another syntax like json.a.b.:UInt64. So subcolumns separated with . will be treated as path and Type subcolumn can be read only using new syntax. And we will succesfully resolve smth like json.a.b.UInt64.:UInt64 as UInt64 type subcolumn of path json.a.b.UInt64.

I would prefer the 3-d variant as this problem can happen only with JSON, so we should solve it only for JSON.

Important question: if we will implement Dynamic type, should we allow users to use it as a separate type or we will use it only internally for JSON implementation

Handling arrays of JSON objects.

In JSON data some paths can contain an arrays of JSON objects and we should be able to handle it.
Let’s take an example:

“obj” : {“arr” : [[{“a” : 42}], [{“a” : [1, 2, 3]}]]}
“obj” : {“arr” : [{“a” : 24}, {“a” : “String”}]}
“obj” : {“arr” : {“a” : “Some string 1”}}

There are 2 ways we can handle arrays of JSON objects and how we can access nested paths:

Treat it as one of the variants for the path.

So in this case we will have 2 paths obj.arr.a with type Variant(String) and obj.arr with type Variant(Array(JSON), Array(Array(JSON)))
So if we want to access value 24 of “a” from the second row we will have to write smth like select obj.arr.Array(JSON).a.UInt64.

Advantages: no need to add extra logic of extracting paths from objects inside arrays, ability to work with arrays of objects and not with separate paths.
Disadvantages: usage of nested JSON columns, so we should be accurate with merges (but I think it should work), also we should control the number of subcolumns in nested JSON columns (we can use lower the limit for nested JSON), we can’t handle array of objects at a top level (”obj” : [{…}, {…}])

We can extract all paths inside nested objects in Arrays into separate paths.

In this case we have 2 suboptions:

- Reflect array level in result path.

So we will have paths: obj.arr.a with type Variant(String) (from 3 row), obj.arr.[].a with type Array(Variant(UInt64, String)) (from row 2), obj.arr.[].[].a with type Array(Array (Variant(UInt64, Array(UInt64)))). Indicator of array level can be different [] used as an example.

Advantages: no need to use nested JSON columns, ability to separately access paths in objects from arrays in different level, we can handle array of objects at a top level
Disadvantages: need separate logic for extracting paths from arrays during insert, we have types Array(Variant(…)) and it’s not clear how to hadle them especially together with Dynamic type, this will require additional logic or usage of smth like Variant(Array(…Variant(…)…)).

- Don’t reflect array level in result path.

So we will have sigle path obj.arr.a with type Variant(String, Array(Variant(UInt64, String)), Array(Array(Variant(UInt64, Array(UInt64))))).
So to access 24 value from the second row we will have to write smth ugly like select obj.arr.a.Array(Variant(UInt64, String)).UInt64.

Advantages: no need to use nested JSON columns, we will have single path, should be no problem with handling it’s type, we can handle array of objects at a top level
Disadvantages: need separate logic for extracting paths from arrays during insert, large resulting Variant type for path, not convinient to extract types subcolumns as types are large and not trivial.

I am actually not sure which option is the best here. I like the first one because it should be easier to implement and it looks more natural to me, but we need to discuss it.

@alexey-milovidov @CurtizJ @KochetovNicolai please add comments/questions/ideas.

qusijun · 2024-02-04T08:06:38Z

@Avogar hi, What is the current progress of dynamic datatype and new json? Is there an ETA?

Avogar · 2024-02-04T08:44:10Z

I am working on Dynamic type right now. ETA - end of February/start of March. After Dynamic is done, I will start working on JSON, but can't say ETA by now.

UPD: due to other tasks, ETA for Dynamic is moving to the end of March

orloffv · 2024-04-02T20:00:31Z

@Avogar any updates here?

Avogar · 2024-04-02T20:19:51Z

The Dynamic implementation is almost ready, but the PR is delayed, because during March we put most of our priority and efforts to fixing bugs/fuzzing issuse/flaky tests, so I didn't have time to finish it. But don't worry, the PR with Dynamic will be soon and I will start working on JSON type after that.

By now I can show some demo of Dynamic type:

select d, dynamicType(d) from format(JSONEachRow, 'd Dynamic', $$
{"d" : [1, 2, 3]}
{"d" : "Hello"}
{"d" : 42}
{"d" : "2020-01-01"}
{"d" : {"a" : 1, "b" : 2}}
{"d" : null}
$$)

   ┌─d──────────┬─dynamicType(d)──────────┐
1. │ [1,2,3]    │ Array(Int64)            │
2. │ Hello      │ String                  │
3. │ 42         │ Int64                   │
4. │ 2020-01-01 │ Date                    │
5. │ (1,2)      │ Tuple(a Int64, b Int64) │
6. │ ᴺᵁᴸᴸ       │ None                    │
   └────────────┴─────────────────────────┘

create table test (d Dynamic) engine=MergeTree order by tuple();
insert into test select * from format(JSONEachRow, 'd Dynamic', $$
{"d" : [1, 2, 3]}
{"d" : "Hello"}
{"d" : 42}
{"d" : "2020-01-01"}
{"d" : {"a" : 1, "b" : 2}}
{"d" : null}
$$);
select d, d.Int64, d.`Tuple(a Int64, b Int64)`.a from test;

   ┌─d──────────┬─d.Int64─┬─d.Tuple(a Int64, b Int64).a─┐
1. │ [1,2,3]    │    ᴺᵁᴸᴸ │                        ᴺᵁᴸᴸ │
2. │ Hello      │    ᴺᵁᴸᴸ │                        ᴺᵁᴸᴸ │
3. │ 42         │      42 │                        ᴺᵁᴸᴸ │
4. │ 2020-01-01 │    ᴺᵁᴸᴸ │                        ᴺᵁᴸᴸ │
5. │ (1,2)      │    ᴺᵁᴸᴸ │                           1 │
6. │ ᴺᵁᴸᴸ       │    ᴺᵁᴸᴸ │                        ᴺᵁᴸᴸ │
   └────────────┴─────────┴─────────────────────────────┘

create table test (map Map(String, Array(Dynamic))) engine=MergeTree order by tuple();
insert into test select map('key1', ['str'::Dynamic, 42::Int64::Dynamic], 'key2', ['2020-01-01'::Date::Dynamic, [1,2,3]::Array(UInt64)::Dynamic]);
insert into test select map('key3', [43::Int64::Dynamic], 'key4', [tuple(1, 2)::Tuple(a Int64, b Int64)::Dynamic, 'str_2'::Dynamic]);
select map.values.`Int64`, map.values.`String`, map.values.`Tuple(a Int64, b Int64)`.a from test;

   ┌─map.values.Int64────────┬─map.values.String──────────┬─map.values.Tuple(a Int64, b Int64).a─┐
1. │ [[NULL,42],[NULL,NULL]] │ [['str',NULL],[NULL,NULL]] │ [[NULL,NULL],[NULL,NULL]]            │
   └─────────────────────────┴────────────────────────────┴──────────────────────────────────────┘
   ┌─map.values.Int64───┬─map.values.String───────┬─map.values.Tuple(a Int64, b Int64).a─┐
2. │ [[43],[NULL,NULL]] │ [[NULL],[NULL,'str_2']] │ [[NULL],[1,NULL]]                    │
   └────────────────────┴─────────────────────────┴──────────────────────────────────────┘

Avogar · 2024-05-21T11:19:18Z

As Dynamic type is ready and almost merged, let's discuss some implementation details about JSON type.

Handling arrays of JSON objects.

We still didn't discuss how we want to handle arrays of JSON objects. See possible implementations in one of the comments above.

Schema inference

During parsing of JSON object we will need to infer the data type for each path and we need to determine what data types we want to support. Here we I see 2 possible ways:

Support only types Bool, Int64, Float64, String, Array(...) and maybe unnamed Tuple for arrays with different values.
Except for types from above, also try to infer other types from String like Date, DateTime, IPv4, IPv6, UUID.

I would do 1. by default and all other types from 2. under settings that will be enabled by default (very similar to schema inference from input formats, maybe we can even reuse the settings).

Also good to discuss how want to handle arrays of values with different data types, should we use unnamed tuple for it how it's done in JSON data formats or we want to handle it differently (for example, as array of Variant or Dynamic types).

How to store paths and values in `String` column when max number of separately stored paths is exceeded

One of the main goal of the new implementation is to be able to handle large amount of different paths in JSON, and to implement it we will limit the number of separately stored paths and when this limit is exceeded all other paths will be stored in a single String column. We need to decide in what format we want to store these paths with values in String column. Here I see several possible ways:

Store paths as flattened JSON object

Very simple implementations is to store all paths as a flattened JSON object in string representation. Let's see an example:

Let's have these JSON objects:

{"a" : 1, "b" : {"c" : [1], "d" : "str_1"}, {"e" : {"f" : {"g" : true}}}, "h" : 1.1}
{"a" : 2, "b" : {"c" : [1, 2], "d" : "str_2"}, {"e" : {"f" : {"h" : false}}}, "i" : 2.2}
{"a" : 3, "b" : {"c" : [1, 2, 3], "d" : "str_3"}, {"e" : {"f" : {"i" : true}}}, "k" : 3.3}

And consider we have limit=2 on separetely stored path. Let's see how the storage can look like:

                                    JSON column
           /                             |                             \
(path "a", Int64 column)   (path "b.c", Array(Int64) column)   (other paths, String column)
         1                           [1]                   '{"b.d":"str_1","e.f.g":true,"h":1.1}'
         2                           [1, 2]                '{"b.d":"str_2","e.f.h":false,"i":2.2}' 
         3                           [1, 2, 3]             '{"b.d":"str_3","e.f.i":true,"k":3.3}'

Advantage: parsing and insertion from input format are easy to implement.
Disadvantage: reading separeate path from such string will be not effective (for each row we will need to parse this JSON, find the requested path, infer the type of value, parse the value according to the type and insert into resulting column).

Use some special data structure for fast extracting of separate paths

These ideas are outdated.

To optimize the reading of separate paths from the single String column we can store paths in a special way that can optimize finding and reading of separate paths from it. Here I can suggest sevaral data structures (from the most simple, to complex):
1. Store paths first and values second
<number_of_paths><path_1>...<path_N><index_of_value_1>...<index_of_value_N><value_1>...<value_N>

In this case when we need to read some path, we will do the following:

Read the list of paths and determine if this structure contains our path or not. If doesn't contain, insert default value to destination column and don't read anything else. If contains, read the corresponding index of the value, read the value by the index, infer the data type, parse it and insert into the destination column.
If we take the above example, it can look like this:
<b.d><e.f.g><h><0><7><11><"str_1"><true><1.1>
<b.d><e.f.h><i><0><7><12><"str_2"><false><2.2>
<b.d><e.f.i><k><0><7><11><"str_3><true><3.3>

Paths and values probably can be stored as binary strings (size + data), or as JSON text with , as separator.
Advantages:

More or less fast determinition if row contains the path (we will read only list of paths).
If row contains requested path, we will read only corresponding value by index.

Disadvantages:

Complicates data parsing and insertion - we should create this structure for each row.
We will store additional data - indexes of values.

2. Store hashes of paths first, then pairs (path, value)
<number_of_paths><hash_of_path_1>...<hash_of_path_N><index_of_path_1>...<index_of_path_N><path_and_value_1>...<path_and_value_N>

In this case when we need to read some path, we will do the following:

Calculate the hash of requested path in advance. Read the list of hashes, check if it contains our hash or not. If not, insert default value to destination column and don't read anything else. If contains, read corresponding index, read the path, check if path is the same as requested (because we have possibility of collision), if they are not the same, continue reading hashes, if the same, read corresponding value, determine the data type, parse it and insert into the destination column.
If we take the above example, it can look like this:
<hash_of_b.d><hash_of_e.f.g><hash_of_h><0><13><19><"b.d":"str_1"><"e.f.g":true><"h":1.1>
<hash_of_b.d><hash_of_e.f.h><hash_of_i><0><13><20><"b.d":"str_2"><"e.f.h":false><"i":2.2>
<hash_of_b.d><hash_of_e.f.i><hash_of_k><0><13><19><"b.d":"str_3"><"e.f.i":true><"k":3.3>

Advantages:

Faster determinition if row contains the path (we will read only list of hashes), it should be faster than reading and comparing paths itself.
If row contains requested path, we will read only corresponding value by index.

Disadvantages:

Complicates data parsing and insertion - we should create this structure for each row.
We will store additional data - hashes of paths and indexes of pairs (path, values).

3. Store sorted hashes and use binary search
We can improve the previous approach and store hashes in sorted order, so we can take this chunk of memory, reinterpret it as UInt64 * and use binary search to find if it contains requested hash or not:
<number_of_paths><first_hash_in_sorted_order>...<N-th_hash_in_sorted_order><index_of_path_corresponding_to_first_hash_in_sorted_order>...<index_of_path_corresponding_to_N-th_hash_in_sorted_order><path_and_value_1>...<path_and_value_N>

If we take the above example, it can look like this:
<hash_of_e.f.g><hash_of_h><hash_of_b.d><13><19><0><"b.d":"str_1"><"e.f.g":true><"h":1.1>
<hash_of_b.d><hash_of_i><hash_of_e.f.h><0><20><13><"b.d":"str_2"><"e.f.h":false><"i":2.2>
<hash_of_e.f.i><hash_of_b.d><hash_of_k><13><0><19><"b.d":"str_3"><"e.f.i":true><"k":3.3>

Advantages:

Even faster determinition if row contains the path (binary search over hashes). Don't know if it can be faster than that.
If row contains requested path, we will read only corresponding value by index.

Disadvantages:

The same as previous approach.

4. Store types of values together with values
Determining the data type of each value during reading of the path can be quite expensive, we can determine the type of the value during parsing and insertion and store the type together with the value:
<number_of_paths><first_hash_in_sorted_order>...<N-th_hash_in_sorted_order><index_of_path_corresponding_to_first_hash_in_sorted_order>...<index_of_path_corresponding_to_N-th_hash_in_sorted_order><path_type_and_value_1>...<path_type_and_value_N>

If we take the above example, it can look like this:
<hash_of_e.f.g><hash_of_h><hash_of_b.d><13><19><0><"b.d","String","str_1"><"e.f.g","Bool",true><"h","Float64",1.1>
<hash_of_b.d><hash_of_i><hash_of_e.f.h><0><20><13><"b.d","String","str_2"><"e.f.h","Bool",false><"i","Float64",2.2>
<hash_of_e.f.i><hash_of_b.d><hash_of_k><13><0><19><"b.d","String","str_3"><"e.f.i","Bool",true><"k","Float64",3.3>

As we know the data types, we can store values not in JSON text format, but in binary format. Also, we can avoid storing data types as type names (it's quite expensive) and instead create a binary encoding for our data types and use it (one byte per simple types, 2 bytes for Array(SimpleType), etc.).
So, the most effective in my opinion will be the next structure:
<number_of_paths><first_hash_in_sorted_order>...<N-th_hash_in_sorted_order><index_of_path_corresponding_to_first_hash_in_sorted_order>...<index_of_path_corresponding_to_N-th_hash_in_sorted_order><path_1><encoded_type_1><binary_value_1>...<path_N><encoded_type_N><binary_value_N>

Advantages:

Fast determinition if row contains the path (binary search over hashes).
If row contains requested path, we will read only corresponding value by index.
Types of the values are known, no need to use schema inference during reading.
Values are stored in binary format, parsing is faster and it uses less space.

Disadvantages:

Complicates data parsing and insertion - we should create this structure for each row, determine the type for each value, parse it and convert to binary representation.
We will store additional data - hashes of paths, indexes, data types.

Maybe it's overcomplicated, but I think reading of a separate path from single String column should be optimized and we can sacrifice the efficiency of data parsing and insertion. Maybe suggested structures doesn't make sense and won't improve anything. So, please, add your comments/opinions about it.

Use Map columnd for paths and values

We can store paths and values in Map column with sorted paths. So in each row we will have an array of paths and an array of corresponding values. Paths will be sorted, so we can use binary search to check if path exist or not in specific row. Values will be stored in binary format together with binary encoded data types. This is how it will be stored in memory and in compact parts in MergeTree.

For wide parts in MergeTree there will be one addition: before each granule we will write the sorted list of all paths in this granule and replace column with paths with column with indexes in this sorted list (so, we will be able to skip the granule by looking at the list of paths and won't iterate over each row to find requested path, it will be effecient especially when we read more than 1 path and can use substreams cache).

bvt123 · 2024-05-21T12:54:15Z

Why don't count the approach used for the Map type—store path and values in two separate array columns? Probably, we don't need to read the values column in many cases.

debugmiller · 2024-05-22T10:57:40Z

If we want to allow for querying of parent paths, hashing may not work. In your example consider a query for path b. It should return values for both b.c and b.d. But if it searches for hash of b it will find nothing and it will not know to search specifically for hash of b.d.

An alternative could be to store the paths in sorted order which would still allow for binary searching.

If we want to use hashes we could also hash each part of the path instead. For example <hash of b>.<hash of d> instead of <hash of b.d>

Avogar · 2024-05-22T11:18:36Z

If we want to allow for querying of parent paths

I think we won't allow querying of parent paths as subcolumns, because it will lead to fullscan on each subcolumn request. We don't know all the paths in the part in advance, only paths that were separated to subcolumns (and this number will be limited), and it means that we can't say if requested path is a parent path or not, so we will have to check all paths with such prefix both in a set of separated paths and in all other paths stored together in some data structure. So, when user requests json.a.b.c we will return only data from path a.b.c not including data from a.b.c.d and etc.

For reading parent paths we can implement special function.

Or we can think of introducing special syntaxis for reading sub paths as JSON columns. Like json.^a.b that will return column with type JSON and all nested paths without a.b prefix.

Avogar · 2024-05-22T11:36:18Z

We also think that we will implement 2 data structures for paths stored together - one for in memory representation, one for serialization in granules in MergeTree. The second one can be more complex and collect data for the whole granule, maybe for each granule we will write something like a tree with all paths in this granule. We are discussing it. My simple structure with hashes may be used for in memory representation, or we will come up with something better

abraithwaite · 2024-05-22T15:44:02Z

Thanks for sharing this RFC! It sounds like you're working towards the right use cases.

For us, we want to use this feature for querying semistructured JSON logs. As a result, the implementation detail that's most important to us is to be able to efficiently filter rows at query time, so storing the type with the type-specific binary row format is highly useful to our use case.

I can't speak for others, but querying parents of subcolumns is not a high priority for us.

The way we plan to use this feature is to have both a JSON column that infers the data types and can be used for fast filtering, while still storing the raw data in a separate column as a string, so if we want to query a substructure instead of a leaf of the JSON object, we can fallback to that raw JSON string column.

Thanks again for sharing the direction and thoughts on this! Very excited for this feature! 🎉

debugmiller · 2024-05-23T10:27:23Z

So, when user requests json.a.b.c we will return only data from path a.b.c not including data from a.b.c.d and etc.

It would be nice to not have special syntax to select objects. For example if I inserted the following data:

{"d" : 42}
{"d" : "2020-01-01"}
{"d" : {"a" : 1, "b" : 2}}

It would feel weird to me if SELECT json.d returned

42
"2020-01-01"
NULL

I would only know I am missing data if I instead query for SELECT json.^d. And unless I can guarantee through some other means that my data does not have nested columns I would always have to query that way.

debugmiller · 2024-05-23T10:33:23Z

The syntax is less important, what I am trying to say is that I think the case of querying for a parent and also retrieving the children should not be ignored when considering the data structure.

Avogar · 2024-05-23T12:31:43Z

It would feel weird to me if SELECT json.d returned

It may feel weird, yes, but it will work like this. The whole point of the new JSON implementation is to be able to query separate paths fast enough without reading most of the data. Reading all children paths for requested path will require fullscan of all the data in worst case, even if it doesn't have any children (becase we won't know all paths stored in the part in advance, only paths that are stored as separate subcolumns in separate streams).

In your example:

{"d" : 42}
{"d" : "2020-01-01"}
{"d" : {"a" : 1, "b" : 2}}

We will store it as 3 subcolumns (some of them may be stored as separate subcolumns with individual stream for fast reading, some may be stored together with some other paths in a single stream in some data structure)

"d" : [42, "2020-01-01", NULL]
"d.a" : [NULL, NULL, 1]
"d.b" : [NULL, NULL, 2]

And in this approach, reading "d" path as [42, "2020-01-01", NULL] will be quite fast, and that's the point. Maybe later we will introduce a new mode or syntax for reading of subcolumns that will also check for all childrens, but in this mode reading will be not so efficient, so it won't the default behaviour for sure

debugmiller · 2024-05-24T11:59:29Z

For my use case this will be less helpful but I can understand that it is a development trade off you are making and this is all experimental anyways :)

debugmiller · 2024-05-24T12:16:51Z

Reading all children paths for requested path will require fullscan of all the data in worst case

If you store the paths sorted, you only need to binary search them which of course is worst case log n.

ahmed-adly-khalil · 2024-06-06T09:28:26Z

It would feel weird to me if SELECT json.d returned

It may feel weird, yes, but it will work like this. The whole point of the new JSON implementation is to be able to query separate paths fast enough without reading most of the data. Reading all children paths for requested path will require fullscan of all the data in worst case, even if it doesn't have any children (becase we won't know all paths stored in the part in advance, only paths that are stored as separate subcolumns in separate streams).

In your example:
{"d" : 42}
{"d" : "2020-01-01"}
{"d" : {"a" : 1, "b" : 2}}
We will store it as 3 subcolumns (some of them may be stored as separate subcolumns with individual stream for fast reading, some may be stored together with some other paths in a single stream in some data structure)
"d" : [42, "2020-01-01", NULL]
"d.a" : [NULL, NULL, 1]
"d.b" : [NULL, NULL, 2] 
And in this approach, reading "d" path as [42, "2020-01-01", NULL] will be quite fast, and that's the point. Maybe later we will introduce a new mode or syntax for reading of subcolumns that will also check for all childrens, but in this mode reading will be not so efficient, so it won't the default behaviour for sure

@Avogar Thanks for the detailed explanation, is it possible to share some ETA around the new JSON type? thanks!

Avogar · 2024-06-10T12:36:25Z

@ahmed-adly-khalil Hi!

is it possible to share some ETA around the new JSON type?

Right now ETA is the first half of July, so most likely new JSON type will be available in 24.7 version.

alexey-milovidov · 2024-06-16T03:05:50Z

How to store paths and values in String column when max number of separately stored paths is exceeded

As I understood, we need to be able to quickly check if the path exists.
For this purpose, we can use Map(String, ...) where the key is path.
It could be sorted.

As a value, we can use String with the serialized content.

Avogar · 2024-06-16T10:20:17Z

For this purpose, we can use Map(String, ...) where the key is path.
It could be sorted.

Yes, I already understood that my suggestion is overcomplicated and Map with sorted paths and binary serialized values will be enough. But with one addition: for serialization in wide parts before each granule I will write the sorted list of all paths in this granule and replace column with paths with column with indexes in this sorted list (so, we will be able to skip the granule by looking at the list of paths and won't iterate over each row to find requested path, it will be effecient especially when we read more than 1 path and can use substreams cache).

And I am already working on it

wilberding · 2024-06-25T19:14:02Z

@Avogar I saw your timeline for JSON Object is first half of July. Do we have an updated timeline?

Avogar · 2024-06-25T19:22:46Z

The timeline didn't change. I plan to create a PR in around 2 weeks

melvynator · 2024-07-03T02:03:57Z

There is currently an alias JSON for Object('JSON); Given the new datatype name will also be JSON, we might need to put it behind a compatibility setting.

cc: @Avogar

Avogar · 2024-07-31T12:37:44Z

The feature is postponed to 24.8. The review process takes longer than usual because we are making a lot of efforts to stabilize the CI and fix existing bugs right now.

alexey-milovidov added the feature label Sep 21, 2023

alexey-milovidov mentioned this issue Sep 27, 2023

Frequent Crashing (every few minutes), Segmentation Fault / Unknown si_code. #54654

Closed

alexey-milovidov assigned Avogar Oct 10, 2023

Slach mentioned this issue Oct 16, 2023

JSON columns should handle arrays, too #46190

Closed

alexey-milovidov mentioned this issue Nov 12, 2023

Optimize system.asynchronous_metric_log #56635

Open

Avogar mentioned this issue Dec 19, 2023

Implement Variant data type #58047

Merged

1 task

melvynator mentioned this issue Dec 21, 2023

LIVE VIEWs not supported for JSON columns #49571

Closed

Avogar mentioned this issue Feb 14, 2024

Variant data type additional test scenarios #59736

Closed

1 task

This was referenced Feb 16, 2024

Add support for experimental Variant data type ClickHouse/clickhouse-go#1195

Open

Add support for experimental Variant data type ClickHouse/clickhouse-connect#302

Closed

Algunenano mentioned this issue Feb 28, 2024

Update json.md #59367

Merged

slvrtrn mentioned this issue Mar 18, 2024

Add RowBinary format for SELECT operations ClickHouse/clickhouse-js#216

Open

21 tasks

Avogar mentioned this issue Apr 26, 2024

Implement Dynamic data type #63058

Merged

1 task

redsquare mentioned this issue May 10, 2024

Add json column support killwort/ClickHouse-Net#170

Closed

alexey-milovidov mentioned this issue May 12, 2024

Support dot syntax access field for Map type column #63652

Open

ahmed-adly-khalil mentioned this issue Jun 7, 2024

Roadmap 2024 (discussion) #58392

Open

slvrtrn mentioned this issue Jul 1, 2024

Enable :nested-field-columns feature in the driver ClickHouse/metabase-clickhouse-driver#252

Open

capaj mentioned this issue Jul 4, 2024

Audit log proposal II graphql-hive/platform#5062

Closed

huydhn mentioned this issue Jul 11, 2024

Add a lambda to replicate dynamo CDC stream to ClickHouse pytorch/test-infra#5419

Closed

Avogar mentioned this issue Jul 12, 2024

Implement new JSON data type. #66444

Merged

19 tasks

auxten mentioned this issue Aug 6, 2024

Support complex type in Python Table Engine like Array and Map, Object chdb-io/chdb#251

Open

Avogar mentioned this issue Aug 15, 2024

[Umbrella] New JSON type improvements #68428

Open

22 tasks

Avogar closed this as completed in #66444 Aug 16, 2024

jirislav mentioned this issue Sep 17, 2024

Implement full Avro Union support for Avro Format SerDe #69713

Closed

RFC: Semistructured Columns #54864

RFC: Semistructured Columns #54864

Comments

alexey-milovidov commented Sep 21, 2023 • edited Loading

Implementation proposal

1. Dynamic columns.

2. Variant data type and column.

3. JSON data type.

4. Explicit conversions.

5. Limitations on the number of subcolumns.

6. Hints.

Motivation

See also

Avogar commented Oct 10, 2023

candiduslynx commented Oct 16, 2023

ernado commented Nov 30, 2023

alexey-milovidov commented Dec 4, 2023

Avogar commented Dec 18, 2023 • edited Loading

UnamedRus commented Dec 19, 2023

CurtizJ commented Dec 20, 2023 • edited Loading

Avogar commented Dec 20, 2023 • edited Loading

Avogar commented Jan 10, 2024 • edited Loading

Working with large number of paths.

What paths we will store and how we will serialize/deserialize them in parts.

How we will choose what paths will be stored separately and what paths will be stored in String column.

Querying subcolumns of JSON.

New Dynamic type

Using Dynamic type for JSON subcolumns

Handling arrays of JSON objects.

qusijun commented Feb 4, 2024

Avogar commented Feb 4, 2024 • edited Loading

orloffv commented Apr 2, 2024

Avogar commented Apr 2, 2024 • edited Loading

Avogar commented May 21, 2024 • edited Loading

Handling arrays of JSON objects.

Schema inference

How to store paths and values in String column when max number of separately stored paths is exceeded

Store paths as flattened JSON object

Use some special data structure for fast extracting of separate paths

1. Store paths first and values second

2. Store hashes of paths first, then pairs (path, value)

3. Store sorted hashes and use binary search

4. Store types of values together with values

Use Map columnd for paths and values

bvt123 commented May 21, 2024

debugmiller commented May 22, 2024

Avogar commented May 22, 2024 • edited Loading

Avogar commented May 22, 2024

abraithwaite commented May 22, 2024

debugmiller commented May 23, 2024

debugmiller commented May 23, 2024 • edited Loading

Avogar commented May 23, 2024 • edited Loading

debugmiller commented May 24, 2024

debugmiller commented May 24, 2024

ahmed-adly-khalil commented Jun 6, 2024

Avogar commented Jun 10, 2024

alexey-milovidov commented Jun 16, 2024 • edited Loading

Avogar commented Jun 16, 2024 • edited Loading

wilberding commented Jun 25, 2024

Avogar commented Jun 25, 2024

melvynator commented Jul 3, 2024 • edited Loading

Avogar commented Jul 31, 2024

alexey-milovidov commented Sep 21, 2023 •

edited

Loading

Avogar commented Dec 18, 2023 •

edited

Loading

CurtizJ commented Dec 20, 2023 •

edited

Loading

Avogar commented Dec 20, 2023 •

edited

Loading

Avogar commented Jan 10, 2024 •

edited

Loading

Avogar commented Feb 4, 2024 •

edited

Loading

Avogar commented Apr 2, 2024 •

edited

Loading

Avogar commented May 21, 2024 •

edited

Loading

How to store paths and values in `String` column when max number of separately stored paths is exceeded

Avogar commented May 22, 2024 •

edited

Loading

debugmiller commented May 23, 2024 •

edited

Loading

Avogar commented May 23, 2024 •

edited

Loading

alexey-milovidov commented Jun 16, 2024 •

edited

Loading

Avogar commented Jun 16, 2024 •

edited

Loading

melvynator commented Jul 3, 2024 •

edited

Loading