-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feature](hive)support hive catalog read json table. #43469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
| Arena& mem_pool, int row_num) const; | ||
| virtual Status read_one_cell_from_json(IColumn& column, const rapidjson::Value& result) const; | ||
|
|
||
| virtual DataTypeSerDeSPtrs get_nested_serdes() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can support
get_key_serde() get_value_serde() for map
get_data_serde() for array nullcolumn
get_serde(int) for struct
not get_nested_serdes() for any complex type
|
TeamCity be ut coverage result: |
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
TeamCity be ut coverage result: |
| for (auto* slot_desc : slot_descs) { | ||
| if (_is_hive_table) { | ||
| //don't like _fuzzy_parse,each line read in must modify name_map once. | ||
| for (auto* v : slot_descs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will seriously impact the performance.
What if we set a session variable to control it? And by default, in most cases, there is no duplicate column name in json value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we not only need to deal with the same column names, but also the order of the table column names and the json data. I think I can simplify this double for loop by using a layer of for loop plus map.
| if (slot_desc->is_nullable()) { | ||
| vectorized::IColumn* data_column_ptr = column_ptr; | ||
| DataTypeSerDeSPtr data_serde = serde; | ||
| vectorized::DataTypeSerDe::FormatOptions _options; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this _options can be created only once at reuse it in each call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah , i will fix it.
| *value, "Json value is null, but the column `{}` is not nullable.", | ||
| slot_desc->col_name(), valid)); | ||
| return Status::OK(); | ||
| } else if (type_desc.type == TYPE_STRUCT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use switch ... case for following logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i need check _is_load and type_desc , It seems that there is no way to use switch.
|
|
||
| } else { | ||
| return Status::InternalError<false>("It should not here."); | ||
| // Maybe we can `switch (value->GetType()) case: kNumberType`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using switch (value->GetType())?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if using switch (value->GetType()) to get the string behavior would improve performance, but I might give it a try.
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
c10850b to
b478158
Compare
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
| // simdjson, replace none simdjson function if it is ready | ||
| Status NewJsonReader::_simdjson_init_reader() { | ||
| Status NewJsonReader::_simdjson_init_reader(bool is_load) { | ||
| _is_load = is_load; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already set in Status NewJsonReader::init_reader.
And not need to pass it from function parameter
|
|
||
| int32_t skip_bitmap_col_idx {-1}; | ||
|
|
||
| bool _is_load = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment in code to describe this field
eldenmoon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
|
clang-tidy review says "All clean, LGTM! 👍" |
|
PR approved by at least one committer and no changes requested. |
Problem Summary:
Support reading json format hive table like:
```mysql
mysql> show create table basic_json_table;
CREATE TABLE `basic_json_table`(
`id` int,
`name` string,
`age` tinyint,
`salary` float,
`is_active` boolean,
`join_date` date,
`last_login` timestamp,
`height` double,
`profile` binary,
`rating` decimal(10,2))
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
```
Behavior changed:
To implement this feature, this pr modifies `new_json_reader`.
Previously, `new_json_reader` could only insert data into columnString.
In order to support inserting data into columns of other types,
`DataTypeSerDe` is introduced to insert data into columns. To maintain
compatibility with previous versions, changes to this pr are triggered
only when reading hive json tables.
Limitation of Use:
1. Currently, only query is supported, and writing is not supported.
2. Currently, only the `ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe';` scenario is supported. For
some properties specified in `with serdeproperties`, Doris does not take
effect.
3. Since Hive does not allow columns with the same name but different
case when creating a table in Json format (including inside a Struct),
we convert the field names in the Json data to lowercase when reading
the Json data file, and then match according to the lowercase field
names. For field names that are duplicated after being converted to
lowercase in the data, the value of the last field is used (consistent
with Hive behavior).
example:
```
create table json_table(
column int
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
a.json:
{"column":1,"COLumn",2,"COLUMN":3}
{"column":10,"COLumn",20}
{"column":100}
in Hive : load a.json to table json_table
in Doris query:
---
3
20
100
---
```
Todo(in next pr):
Merge `serde` and `json_reader` ,because they have logical conflicts.
Hive catalog support read json format table.
Problem Summary:
Support reading json format hive table like:
```mysql
mysql> show create table basic_json_table;
CREATE TABLE `basic_json_table`(
`id` int,
`name` string,
`age` tinyint,
`salary` float,
`is_active` boolean,
`join_date` date,
`last_login` timestamp,
`height` double,
`profile` binary,
`rating` decimal(10,2))
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
```
Behavior changed:
To implement this feature, this pr modifies `new_json_reader`.
Previously, `new_json_reader` could only insert data into columnString.
In order to support inserting data into columns of other types,
`DataTypeSerDe` is introduced to insert data into columns. To maintain
compatibility with previous versions, changes to this pr are triggered
only when reading hive json tables.
Limitation of Use:
1. Currently, only query is supported, and writing is not supported.
2. Currently, only the `ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe';` scenario is supported. For
some properties specified in `with serdeproperties`, Doris does not take
effect.
3. Since Hive does not allow columns with the same name but different
case when creating a table in Json format (including inside a Struct),
we convert the field names in the Json data to lowercase when reading
the Json data file, and then match according to the lowercase field
names. For field names that are duplicated after being converted to
lowercase in the data, the value of the last field is used (consistent
with Hive behavior).
example:
```
create table json_table(
column int
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
a.json:
{"column":1,"COLumn",2,"COLUMN":3}
{"column":10,"COLumn",20}
{"column":100}
in Hive : load a.json to table json_table
in Doris query:
---
3
20
100
---
```
Todo(in next pr):
Merge `serde` and `json_reader` ,because they have logical conflicts.
Hive catalog support read json format table.
fix coredump with new_json_reader if we read invalid data will core which source from #43469
fix coredump with new_json_reader if we read invalid data will core which source from #43469
…JsonSerDe (#49209) ### What problem does this PR solve? Problem Summary: Initial support for Hive `org.openx.data.jsonserde.JsonSerDe`(https://github.com/rcongiu/Hive-JSON-Serde). The specific behavior of read is similar to pr #43469. By referring to the description in the link, here are some explanations: Support: 1. Querying Complex Fields 2. Importing Malformed Data (serde prop: ignore.malformed.json) Not supported, this parameter will not affect the query results 1. dots.in.keys 2. Case Sensitivity in mappings 3. Mapping Hive Keywords Not supported, but will report an error: 1. Using Arrays 2. Promoting a Scalar to an Array error : [DATA_QUALITY_ERROR]JSON data is array-object, `strip_outer_array` must be TRUE. In order to allow some json strings that do not support parsing to be processed by users, a session variable is introduced: `read_hive_json_in_one_column` (default is false). When this variable is true, a whole line of json is read into the first column, and users can choose to process a whole line of json, such as JSON_PARSE. The data type of the first column of the table needs to be string. Currently only valid for org.openx.data.jsonserde.JsonSerDe.
…JsonSerDe (apache#49209) ### What problem does this PR solve? Problem Summary: Initial support for Hive `org.openx.data.jsonserde.JsonSerDe`(https://github.com/rcongiu/Hive-JSON-Serde). The specific behavior of read is similar to pr apache#43469. By referring to the description in the link, here are some explanations: Support: 1. Querying Complex Fields 2. Importing Malformed Data (serde prop: ignore.malformed.json) Not supported, this parameter will not affect the query results 1. dots.in.keys 2. Case Sensitivity in mappings 3. Mapping Hive Keywords Not supported, but will report an error: 1. Using Arrays 2. Promoting a Scalar to an Array error : [DATA_QUALITY_ERROR]JSON data is array-object, `strip_outer_array` must be TRUE. In order to allow some json strings that do not support parsing to be processed by users, a session variable is introduced: `read_hive_json_in_one_column` (default is false). When this variable is true, a whole line of json is read into the first column, and users can choose to process a whole line of json, such as JSON_PARSE. The data type of the first column of the table needs to be string. Currently only valid for org.openx.data.jsonserde.JsonSerDe.
### What problem does this PR solve? Related PR: #43469 Problem Summary: PR #43469 accidentally removed the logic for reading boolean "true"/"false" values in the simd join reader. Before PR #43469, "true"/"false" were treated as "1"/"0", allowing a bool column in a JSON file to be imported into an Doris int column. This PR restores this logic.
### What problem does this PR solve? Related PR: #43469 Problem Summary: PR #43469 accidentally removed the logic for reading boolean "true"/"false" values in the simd join reader. Before PR #43469, "true"/"false" were treated as "1"/"0", allowing a bool column in a JSON file to be imported into an Doris int column. This PR restores this logic.
### What problem does this PR solve? Related PR: #43469 Problem Summary: PR #43469 accidentally removed the logic for reading boolean "true"/"false" values in the simd join reader. Before PR #43469, "true"/"false" were treated as "1"/"0", allowing a bool column in a JSON file to be imported into an Doris int column. This PR restores this logic.
What problem does this PR solve?
Problem Summary:
Support reading json format hive table like:
Behavior changed:
To implement this feature, this pr modifies
new_json_reader. Previously,new_json_readercould only insert data into columnString. In order to support inserting data into columns of other types,DataTypeSerDeis introduced to insert data into columns. To maintain compatibility with previous versions, changes to this pr are triggered only when reading hive json tables.Limitation of Use:
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';scenario is supported. For some properties specified inwith serdeproperties, Doris does not take effect.example:
Todo(in next pr):
Merge
serdeandjson_reader,because they have logical conflicts.Release note
Hive catalog support read json format table.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)