-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Semistructured Columns #54864
Comments
I will start working on Variant data type. |
I'd like to point out on one if the use-cases we had in #46190, too. Specifically, it would be great to be able to perform INSERT INTO tbl VALUES ('b', '[{"x":{"y":"z"}}]') Additionally, as the JSON value can represent both a & if we already have |
Please also consider supporting (like column type and related functions for manipulation) binary representation of json, like PostgreSQL's jsonb or YTsaurus's YSON or MongoDB's BSON. This is slightly off-topic, but somehow relevant and probably can improve efficency for unstructured data. |
There is already a prior work for that: TLDR: implementation of a binary JSON format is out of scope of this task. |
I have a proposal with slightly different implementation of JSON type. In my opinion, it will simplify the code implementation and will be quite natural:
This new type will represent a single column with dynamic type. It means that we can insert a column with any type into
Inside
For this new type of column we won't need to know all the inner
And this new
In JSON type we can just use
I don't see any big disadvantages in this approach and at the same time I think it will simplify the implementation. There will one difference: if we request JSON path that doesn't exists in any part, we will not throw an exception that this path doesn't exists, but return column filled with NULLs. But I think if we deal with dynamic data it can be even more natural. @alexey-milovidov WDYT? |
It's quite possible (and common), that JSON can have a lot of uniq paths (for example they use uniq user id as key value). |
With such implementation I see two problems for now:
Also from your description it's unclear how to store and merge data types in |
Can you explan it in more details? I don't fully understand what you mean by selecting the whole JSON column and selecting the intermediate object?
I guess you mean Arrays of nested JSON objects. But actually, I think we can just not store Array of JSON objects as a set of nested paths as arrays, we can store the path to this array as Dynamic column, and it will have internally type
In this case it will be more natural to have path UPD: even if we want to store arrays of objects as paths with arrays that share offsets, what is the difference betwee using Variant and Dynamic types? In first case we will have Array of Variants, in the second Array of Dynamic columns, don't see the problem here. I guess if we want to support inserting an array of JSON objects into JSON column, we should do in this way with shared offests, but I don't know yet how we are going to implement it, because as we don't use Tuples, we will need to track shared offsets inside JSON column somehow during serialization/deserialization.
Type will be the same across the single part, so we can just serilize its name in So, I would say the only difference with approach with just Variant type for paths is that we won't need to know the Variant type and all the paths in advance to be able to read subcolumns and that we won't need to deal with different Variant types inside JSON implementation |
As Variant type is implemented and can be merged soon (I hope), let's discuss implementation details of the JSON type. Working with large number of paths.One of the problem we want to solve with new JSON implementation is to be able to store JSON object with large amount of different paths. As the number of paths can be any large, we want to store only some of them as subcolumns and other inside a separate String column as a JSON string. So here we need to think about 2 topics: What paths we will store and how we will serialize/deserialize them in parts.First important note that we cannot store all unique paths in JSON, because this number can be even larger than the number of rows (for example if there is some unique ids used as a keys in each row). So, we can only store the set of paths that are stored separately as subcolumns (because we can control the number of these paths and limit it).
The better option is to have shared set of paths for part and modify the code of merges a bit by implementing special choosing of resulting empty column as described in 1.2 How we will choose what paths will be stored separately and what paths will be stored in String column.We should choose what paths will be stored as subcolumn in 2 cases:
Here we can have 2 options:
I would prefer 1-st option as second is error-prone and the disadvantages exceed the advantages in my opinion. If there are other options/ideas - feel free to add.
During merge we will have multiple JSON columns from different parts with possible different sets of paths that are stored as subcolumns and the total number of paths can exceed the limit. In this case we should decide what paths should remain as subcolumns and what paths should be serialized into String column. This logic depends on how we are going to store and serialize paths. Let’s say we will use 1.1 option when we have shared set of paths for part and modify the code of merges a bit by implementing special choosing of resulting empty column. How we can choose from N columns with N different sets of paths what will be the result set or pahts:
The third option will be the best and seems like it’s not difficult to implement it. Querying subcolumns of JSON.Originally we wanted to use Variant for JSON subcolumns and obtain the type of a subcolumn dynamically during query analysis. But the problem is that we are not able to store the set of all unique paths we have in JSON column (as it can be larger than the number of total rows), so we will store only subset of paths that are separated to subcolumns and the rest paths will be optimized into single String column and stored as a string JSON object. And to dynamically obtain the type for any subcolumn in worst case we will have to do a fullscan on this String column to find if we have this subcolumn or not. To solve this problem I proposed new type called New Dynamic typeThis type will represent a single column with dynamic type, so it can store inside values of different types and extend it’s type dynamically. We can think of it as a dynamic Dynamic column will support subcolumns as type names similar to Using Dynamic type for JSON subcolumnsWe can say that any subcolumn requested from JSON column will have type
If user will try to use subcolumn It may happen that user requested JSON path that doesn’t exists in the table, in this case we will just return But here we have a problem: as we use Possible solutions:
I would prefer the 3-d variant as this problem can happen only with JSON, so we should solve it only for JSON. Important question: if we will implement Dynamic type, should we allow users to use it as a separate type or we will use it only internally for JSON implementation Handling arrays of JSON objects.In JSON data some paths can contain an arrays of JSON objects and we should be able to handle it.
There are 2 ways we can handle arrays of JSON objects and how we can access nested paths:
So in this case we will have 2 paths Advantages: no need to add extra logic of extracting paths from objects inside arrays, ability to work with arrays of objects and not with separate paths.
In this case we have 2 suboptions:
So we will have paths: Advantages: no need to use nested JSON columns, ability to separately access paths in objects from arrays in different level, we can handle array of objects at a top level
So we will have sigle path Advantages: no need to use nested JSON columns, we will have single path, should be no problem with handling it’s type, we can handle array of objects at a top level I am actually not sure which option is the best here. I like the first one because it should be easier to implement and it looks more natural to me, but we need to discuss it. @alexey-milovidov @CurtizJ @KochetovNicolai please add comments/questions/ideas. |
@Avogar hi, What is the current progress of dynamic datatype and new json? Is there an ETA? |
I am working on Dynamic type right now. ETA - end of February/start of March. After Dynamic is done, I will start working on JSON, but can't say ETA by now. UPD: due to other tasks, ETA for Dynamic is moving to the end of March |
@Avogar any updates here? |
The Dynamic implementation is almost ready, but the PR is delayed, because during March we put most of our priority and efforts to fixing bugs/fuzzing issuse/flaky tests, so I didn't have time to finish it. But don't worry, the PR with Dynamic will be soon and I will start working on JSON type after that. By now I can show some demo of Dynamic type: select d, dynamicType(d) from format(JSONEachRow, 'd Dynamic', $$
{"d" : [1, 2, 3]}
{"d" : "Hello"}
{"d" : 42}
{"d" : "2020-01-01"}
{"d" : {"a" : 1, "b" : 2}}
{"d" : null}
$$)
create table test (d Dynamic) engine=MergeTree order by tuple();
insert into test select * from format(JSONEachRow, 'd Dynamic', $$
{"d" : [1, 2, 3]}
{"d" : "Hello"}
{"d" : 42}
{"d" : "2020-01-01"}
{"d" : {"a" : 1, "b" : 2}}
{"d" : null}
$$);
select d, d.Int64, d.`Tuple(a Int64, b Int64)`.a from test;
create table test (map Map(String, Array(Dynamic))) engine=MergeTree order by tuple();
insert into test select map('key1', ['str'::Dynamic, 42::Int64::Dynamic], 'key2', ['2020-01-01'::Date::Dynamic, [1,2,3]::Array(UInt64)::Dynamic]);
insert into test select map('key3', [43::Int64::Dynamic], 'key4', [tuple(1, 2)::Tuple(a Int64, b Int64)::Dynamic, 'str_2'::Dynamic]);
select map.values.`Int64`, map.values.`String`, map.values.`Tuple(a Int64, b Int64)`.a from test;
|
As Dynamic type is ready and almost merged, let's discuss some implementation details about JSON type. Handling arrays of JSON objects.We still didn't discuss how we want to handle arrays of JSON objects. See possible implementations in one of the comments above. Schema inferenceDuring parsing of JSON object we will need to infer the data type for each path and we need to determine what data types we want to support. Here we I see 2 possible ways:
I would do 1. by default and all other types from 2. under settings that will be enabled by default (very similar to schema inference from input formats, maybe we can even reuse the settings). Also good to discuss how want to handle arrays of values with different data types, should we use unnamed tuple for it how it's done in JSON data formats or we want to handle it differently (for example, as array of How to store paths and values in
|
Why don't count the approach used for the Map type—store path and values in two separate array columns? Probably, we don't need to read the values column in many cases. |
If we want to allow for querying of parent paths, hashing may not work. In your example consider a query for path An alternative could be to store the paths in sorted order which would still allow for binary searching. If we want to use hashes we could also hash each part of the path instead. For example |
I think we won't allow querying of parent paths as subcolumns, because it will lead to fullscan on each subcolumn request. We don't know all the paths in the part in advance, only paths that were separated to subcolumns (and this number will be limited), and it means that we can't say if requested path is a parent path or not, so we will have to check all paths with such prefix both in a set of separated paths and in all other paths stored together in some data structure. So, when user requests For reading parent paths we can implement special function. Or we can think of introducing special syntaxis for reading sub paths as JSON columns. Like |
We also think that we will implement 2 data structures for paths stored together - one for in memory representation, one for serialization in granules in MergeTree. The second one can be more complex and collect data for the whole granule, maybe for each granule we will write something like a tree with all paths in this granule. We are discussing it. My simple structure with hashes may be used for in memory representation, or we will come up with something better |
Thanks for sharing this RFC! It sounds like you're working towards the right use cases. For us, we want to use this feature for querying semistructured JSON logs. As a result, the implementation detail that's most important to us is to be able to efficiently filter rows at query time, so storing the type with the type-specific binary row format is highly useful to our use case. I can't speak for others, but querying parents of subcolumns is not a high priority for us. The way we plan to use this feature is to have both a JSON column that infers the data types and can be used for fast filtering, while still storing the raw data in a separate column as a string, so if we want to query a substructure instead of a leaf of the JSON object, we can fallback to that raw JSON string column. Thanks again for sharing the direction and thoughts on this! Very excited for this feature! 🎉 |
It would be nice to not have special syntax to select objects. For example if I inserted the following data:
It would feel weird to me if
I would only know I am missing data if I instead query for |
The syntax is less important, what I am trying to say is that I think the case of querying for a parent and also retrieving the children should not be ignored when considering the data structure. |
It may feel weird, yes, but it will work like this. The whole point of the new JSON implementation is to be able to query separate paths fast enough without reading most of the data. Reading all children paths for requested path will require fullscan of all the data in worst case, even if it doesn't have any children (becase we won't know all paths stored in the part in advance, only paths that are stored as separate subcolumns in separate streams). In your example:
We will store it as 3 subcolumns (some of them may be stored as separate subcolumns with individual stream for fast reading, some may be stored together with some other paths in a single stream in some data structure)
And in this approach, reading |
For my use case this will be less helpful but I can understand that it is a development trade off you are making and this is all experimental anyways :) |
If you store the paths sorted, you only need to binary search them which of course is worst case log n. |
@Avogar Thanks for the detailed explanation, is it possible to share some ETA around the new JSON type? thanks! |
Right now ETA is the first half of July, so most likely new JSON type will be available in 24.7 version. |
As I understood, we need to be able to quickly check if the path exists. As a value, we can use String with the serialized content. |
Yes, I already understood that my suggestion is overcomplicated and Map with sorted paths and binary serialized values will be enough. But with one addition: for serialization in wide parts before each granule I will write the sorted list of all paths in this granule and replace column with paths with column with indexes in this sorted list (so, we will be able to skip the granule by looking at the list of paths and won't iterate over each row to find requested path, it will be effecient especially when we read more than 1 path and can use substreams cache). And I am already working on it |
@Avogar I saw your timeline for JSON Object is first half of July. Do we have an updated timeline? |
The timeline didn't change. I plan to create a PR in around 2 weeks |
There is currently an alias JSON for Object('JSON); Given the new datatype name will also be JSON, we might need to put it behind a compatibility setting. cc: @Avogar |
The feature is postponed to 24.8. The review process takes longer than usual because we are making a lot of efforts to stabilize the CI and fix existing bugs right now. |
(This is an alternative to the experimental JSON data type, aiming to replace it and address its drawbacks).
Implementation proposal
1. Dynamic columns.
A table (
IStorage
) can tell that it supports dynamic columns (bool canHaveDynamicColumns()
).If this is the case, a query can reference columns or subcolumns, not necessarily contained in the table's definition.
The table should provide a method (e.g.,
getColumnType
) to dynamically obtain the type of a column or subcolumn (or expression?).As a bonus and a demo, we can allow a variant of
Merge
table that will dynamically extend if new tables appear.This
Merge
table can even have an empty table structure (no fixed columns).2. Variant data type and column.
A new column type works as a discriminated union of nested columns. For example,
Variant(Int8, Array(String))
has every value eitherInt8
orArray(String)
. It is serialized to multiple streams: a stream for every option and a stream with option numbers (discriminator).This column can be accessed in the following ways:
c::Int8
: we will read every subcolumn and convert if needed.c.Int8
: we will read only the requested subcolumn as Nullable (this is unneccessarily).c
: if the column is requested as is, we will find the least common type of the variant types (and throw an exception if it does not exist) and convert it.3. JSON data type.
A column with JSON data type works as an opaque placeholder for dynamic subcolumns.
It is serialized into multiple streams: the first one is an index, enumerating every unique path in JSON, along with their types, and streams for every variant type for every path.
A user can insert JSON as String - then we convert it to JSON by deriving all data types of all paths as variants.
Different types inside variants are not converted on merges. Merges can only extend the variant data types by collecting more types inside them. The types are only converted on INSERT and SELECT.
If there is a JSON data type in a table, the table should tell that it supports dynamic columns. Querying any column not present in the static schema will trigger reading the "index" stream of every JSON type on query analysis. This is done lazily (on demand), but tables can subsequently cache this information inside in-memory information about data parts.
Distributed and Merge tables, as well as Views can trigger additional calls when asked about dynamic columns.
Column-level RBAC does not apply to subcolumns.
4. Explicit conversions.
When a user referenced a subcolumn from JSON with explicit type conversion, e.g.
SELECT json.CounterID::UInt32
- the table will tell that the columnjson.CounterID::UInt32
exists and hasUInt32
type without looking inside the data. If a user referenced it on table creation, e.g.,CREATE TABLE ... ORDER BY json.CounterID::UInt32
, the expression should be successfully resolved.If the column
json.CounterID::UInt32
is requested, it should be propagated to reading, and the table will return the data already converted toUInt32
, or throw an exception.5. Limitations on the number of subcolumns.
We will not introduce additional part formats and will use the wide or compact format as usual.
But if the number of paths in JSON is larger than the limit specified in the table setting (e.g., 1000), the remaining path prefixes will be recollected into JSON and written as
String
. The first paths are selected by the first-come order.We should implement an optimization for vertical merge to support it for subcolumns, and most likely, it will be a special case in the code.
6. Hints.
The JSON data type can have parameters, specifying hints for data types of particular subpaths, or the fact that a particular subpath should not be written, e.g.,
JSON(http.version.major UInt8, SKIP body.raw, SKIP tls.handshake.random)
. We could also implement support for ALTERs.Motivation
If we will implement only №1 and №2 and fail to implement the JSON data type, we will still get a lot.
The current implementation of the
JSON
data type is experimental, and it is not too late to remove it. It has the following drawbacks:See also
Amazon ION data format and PartiQL query language.
Sneller query engine and its zION data format.
The text was updated successfully, but these errors were encountered: