[feature](hive)support hive catalog read json table. #43469

hubgeter · 2024-11-07T15:41:52Z

What problem does this PR solve?

Problem Summary:
Support reading json format hive table like:

mysql> show create table basic_json_table;
CREATE TABLE `basic_json_table`(
  `id` int,
  `name` string,
  `age` tinyint,
  `salary` float,
  `is_active` boolean,
  `join_date` date,
  `last_login` timestamp,
  `height` double,
  `profile` binary,
  `rating` decimal(10,2))
ROW FORMAT SERDE
  'org.apache.hive.hcatalog.data.JsonSerDe'

Behavior changed:
To implement this feature, this pr modifies new_json_reader. Previously, new_json_reader could only insert data into columnString. In order to support inserting data into columns of other types, DataTypeSerDe is introduced to insert data into columns. To maintain compatibility with previous versions, changes to this pr are triggered only when reading hive json tables.

Limitation of Use:

Currently, only query is supported, and writing is not supported.
Currently, only the ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; scenario is supported. For some properties specified in with serdeproperties, Doris does not take effect.
Since Hive does not allow columns with the same name but different case when creating a table in Json format (including inside a Struct), we convert the field names in the Json data to lowercase when reading the Json data file, and then match according to the lowercase field names. For field names that are duplicated after being converted to lowercase in the data, the value of the last field is used (consistent with Hive behavior).
example:

create table json_table(
    column int 
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

a.json:
{"column":1,"COLumn",2,"COLUMN":3}
{"column":10,"COLumn",20}
{"column":100}
in Hive : load a.json to table json_table

in Doris query:
---
3
20
100
---

Todo(in next pr):
Merge serde and json_reader ,because they have logical conflicts.

Release note

Hive catalog support read json format table.

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

doris-robot · 2024-11-07T15:41:57Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

hubgeter · 2024-11-07T15:42:30Z

run buildall

github-actions · 2024-11-07T15:47:12Z

clang-tidy review says "All clean, LGTM! 👍"

hubgeter · 2024-11-07T15:47:42Z

be/src/vec/data_types/serde/data_type_serde.h

                                          Arena& mem_pool, int row_num) const;
    virtual Status read_one_cell_from_json(IColumn& column, const rapidjson::Value& result) const;

+    virtual DataTypeSerDeSPtrs get_nested_serdes() const {


maybe we can support
get_key_serde() get_value_serde() for map
get_data_serde() for array nullcolumn
get_serde(int) for struct
not get_nested_serdes() for any complex type

doris-robot · 2024-11-07T17:27:55Z

TeamCity be ut coverage result:
Function Coverage: 37.88% (9862/26034)
Line Coverage: 29.03% (82003/282472)
Region Coverage: 28.26% (42240/149458)
Branch Coverage: 24.83% (21419/86256)
Coverage Report: http://coverage.selectdb-in.cc/coverage/4120ac989a900c6d6af8401251c242a6690de1df_4120ac989a900c6d6af8401251c242a6690de1df/report/index.html

hubgeter · 2024-11-10T10:19:10Z

run buildall

github-actions · 2024-11-10T10:23:53Z

clang-tidy review says "All clean, LGTM! 👍"

doris-robot · 2024-11-10T12:09:43Z

TeamCity be ut coverage result:
Function Coverage: 37.91% (9868/26033)
Line Coverage: 29.07% (82263/282975)
Region Coverage: 28.23% (42347/150029)
Branch Coverage: 24.78% (21447/86550)
Coverage Report: http://coverage.selectdb-in.cc/coverage/14b5e53e374c7b4d59ae9dc746ff46a37d551ace_14b5e53e374c7b4d59ae9dc746ff46a37d551ace/report/index.html

morningman · 2024-11-10T14:19:09Z

be/src/vec/exec/format/json/new_json_reader.cpp

-    for (auto* slot_desc : slot_descs) {
+    if (_is_hive_table) {
+        //don't like _fuzzy_parse,each line read in must modify name_map once.
+        for (auto* v : slot_descs) {


I think it will seriously impact the performance.
What if we set a session variable to control it? And by default, in most cases, there is no duplicate column name in json value.

Here we not only need to deal with the same column names, but also the order of the table column names and the json data. I think I can simplify this double for loop by using a layer of for loop plus map.

morningman · 2024-11-10T14:21:01Z

be/src/vec/exec/format/json/new_json_reader.cpp

-    if (slot_desc->is_nullable()) {
+    vectorized::IColumn* data_column_ptr = column_ptr;
+    DataTypeSerDeSPtr data_serde = serde;
+    vectorized::DataTypeSerDe::FormatOptions _options;


Looks like this _options can be created only once at reuse it in each call.

yeah , i will fix it.

morningman · 2024-11-10T14:22:53Z

be/src/vec/exec/format/json/new_json_reader.cpp

-                    *value, "Json value is null, but the column `{}` is not nullable.",
-                    slot_desc->col_name(), valid));
-            return Status::OK();
+    } else if (type_desc.type == TYPE_STRUCT) {


Can we use switch ... case for following logic?

i need check _is_load and type_desc , It seems that there is no way to use switch.

morningman · 2024-11-10T14:24:25Z

be/src/vec/exec/format/json/new_json_reader.cpp

+
        } else {
-            return Status::InternalError<false>("It should not here.");
+            // Maybe we can `switch (value->GetType()) case: kNumberType`.


Why not using switch (value->GetType())?

I'm not sure if using switch (value->GetType()) to get the string behavior would improve performance, but I might give it a try.

hubgeter · 2024-11-11T11:32:56Z

run buildall

github-actions · 2024-11-11T11:37:47Z

clang-tidy review says "All clean, LGTM! 👍"

hubgeter · 2024-11-12T15:43:25Z

run buildall

github-actions · 2024-11-12T15:48:40Z

clang-tidy review says "All clean, LGTM! 👍"

morningman

LGTM

github-actions · 2024-11-13T07:02:48Z

PR approved by at least one committer and no changes requested.

github-actions · 2024-11-13T07:02:50Z

PR approved by anyone and no changes requested.

morningman · 2024-11-13T07:10:11Z

be/src/vec/exec/format/json/new_json_reader.cpp

 // simdjson, replace none simdjson function if it is ready
-Status NewJsonReader::_simdjson_init_reader() {
+Status NewJsonReader::_simdjson_init_reader(bool is_load) {
+    _is_load = is_load;


already set in Status NewJsonReader::init_reader.
And not need to pass it from function parameter

morningman · 2024-11-13T07:28:07Z

be/src/vec/exec/format/json/new_json_reader.h


    int32_t skip_bitmap_col_idx {-1};
+
+    bool _is_load = true;


Add comment in code to describe this field

eldenmoon

LGTM

hubgeter · 2024-11-13T18:04:28Z

run buildall

github-actions · 2024-11-13T18:09:03Z

clang-tidy review says "All clean, LGTM! 👍"

github-actions · 2024-11-14T06:49:07Z

clang-tidy review says "All clean, LGTM! 👍"

github-actions · 2024-11-28T14:01:30Z

PR approved by at least one committer and no changes requested.

Problem Summary: Support reading json format hive table like: ```mysql mysql> show create table basic_json_table; CREATE TABLE `basic_json_table`( `id` int, `name` string, `age` tinyint, `salary` float, `is_active` boolean, `join_date` date, `last_login` timestamp, `height` double, `profile` binary, `rating` decimal(10,2)) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' ``` Behavior changed: To implement this feature, this pr modifies `new_json_reader`. Previously, `new_json_reader` could only insert data into columnString. In order to support inserting data into columns of other types, `DataTypeSerDe` is introduced to insert data into columns. To maintain compatibility with previous versions, changes to this pr are triggered only when reading hive json tables. Limitation of Use: 1. Currently, only query is supported, and writing is not supported. 2. Currently, only the `ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';` scenario is supported. For some properties specified in `with serdeproperties`, Doris does not take effect. 3. Since Hive does not allow columns with the same name but different case when creating a table in Json format (including inside a Struct), we convert the field names in the Json data to lowercase when reading the Json data file, and then match according to the lowercase field names. For field names that are duplicated after being converted to lowercase in the data, the value of the last field is used (consistent with Hive behavior). example: ``` create table json_table( column int )ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; a.json: {"column":1,"COLumn",2,"COLUMN":3} {"column":10,"COLumn",20} {"column":100} in Hive : load a.json to table json_table in Doris query: --- 3 20 100 --- ``` Todo(in next pr): Merge `serde` and `json_reader` ,because they have logical conflicts. Hive catalog support read json format table.

bp #43469

fix coredump with new_json_reader if we read invalid data will core which source from #43469

…JsonSerDe (#49209) ### What problem does this PR solve? Problem Summary: Initial support for Hive `org.openx.data.jsonserde.JsonSerDe`(https://github.com/rcongiu/Hive-JSON-Serde). The specific behavior of read is similar to pr #43469. By referring to the description in the link, here are some explanations: Support: 1. Querying Complex Fields 2. Importing Malformed Data (serde prop: ignore.malformed.json) Not supported, this parameter will not affect the query results 1. dots.in.keys 2. Case Sensitivity in mappings 3. Mapping Hive Keywords Not supported, but will report an error: 1. Using Arrays 2. Promoting a Scalar to an Array error : [DATA_QUALITY_ERROR]JSON data is array-object, `strip_outer_array` must be TRUE. In order to allow some json strings that do not support parsing to be processed by users, a session variable is introduced: `read_hive_json_in_one_column` (default is false). When this variable is true, a whole line of json is read into the first column, and users can choose to process a whole line of json, such as JSON_PARSE. The data type of the first column of the table needs to be string. Currently only valid for org.openx.data.jsonserde.JsonSerDe.

…JsonSerDe (apache#49209) ### What problem does this PR solve? Problem Summary: Initial support for Hive `org.openx.data.jsonserde.JsonSerDe`(https://github.com/rcongiu/Hive-JSON-Serde). The specific behavior of read is similar to pr apache#43469. By referring to the description in the link, here are some explanations: Support: 1. Querying Complex Fields 2. Importing Malformed Data (serde prop: ignore.malformed.json) Not supported, this parameter will not affect the query results 1. dots.in.keys 2. Case Sensitivity in mappings 3. Mapping Hive Keywords Not supported, but will report an error: 1. Using Arrays 2. Promoting a Scalar to an Array error : [DATA_QUALITY_ERROR]JSON data is array-object, `strip_outer_array` must be TRUE. In order to allow some json strings that do not support parsing to be processed by users, a session variable is introduced: `read_hive_json_in_one_column` (default is false). When this variable is true, a whole line of json is read into the first column, and users can choose to process a whole line of json, such as JSON_PARSE. The data type of the first column of the table needs to be string. Currently only valid for org.openx.data.jsonserde.JsonSerDe.

### What problem does this PR solve? Related PR: #43469 Problem Summary: PR #43469 accidentally removed the logic for reading boolean "true"/"false" values in the simd join reader. Before PR #43469, "true"/"false" were treated as "1"/"0", allowing a bool column in a JSON file to be imported into an Doris int column. This PR restores this logic.

hubgeter commented Nov 7, 2024

View reviewed changes

hubgeter marked this pull request as ready for review November 10, 2024 10:19

morningman reviewed Nov 10, 2024

View reviewed changes

morningman added dev/2.1.x dev/3.0.x labels Nov 12, 2024

hubgeter force-pushed the feature_hive_json_serde branch from c10850b to b478158 Compare November 12, 2024 15:43

morningman previously approved these changes Nov 13, 2024

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 13, 2024

github-actions bot added the reviewed label Nov 13, 2024

morningman reviewed Nov 13, 2024

View reviewed changes

eldenmoon previously approved these changes Nov 13, 2024

View reviewed changes

hubgeter dismissed stale reviews from eldenmoon and morningman via 7da0fde November 13, 2024 18:04

github-actions bot removed the approved Indicates a PR has been approved by one committer. label Nov 13, 2024

morningman approved these changes Nov 28, 2024

View reviewed changes

CalvinKirs approved these changes Nov 28, 2024

View reviewed changes

morningman merged commit 358ed8c into apache:master Nov 28, 2024
17 of 18 checks passed

github-actions bot added dev/3.0.x-conflict dev/2.1.x-conflict labels Nov 28, 2024

morningman removed dev/2.1.x dev/2.1.x-conflict labels Nov 29, 2024

hubgeter mentioned this pull request Dec 2, 2024

[feature](hive)support hive catalog read json table. (#43469) #44848

Merged

16 tasks

morningman pushed a commit that referenced this pull request Dec 4, 2024

[feature](hive)support hive catalog read json table. (#43469) (#44848)

8efb98e

bp #43469

morningman added dev/3.0.4-merged and removed dev/3.0.x dev/3.0.x-conflict labels Dec 4, 2024

morningman added the dev/2.1.x-experimental label Dec 23, 2024

eldenmoon mentioned this pull request Dec 24, 2024

[fix](nestedtype)fix nested type with timezone datetime #45885

Closed

16 tasks

amorynan mentioned this pull request Dec 25, 2024

[fix](new_json_reader)fix core for new_json_reader #45905

Merged

16 tasks

eldenmoon pushed a commit that referenced this pull request Jan 2, 2025

[fix](new_json_reader)fix core for new_json_reader (#45905)

8cb4830

fix coredump with new_json_reader if we read invalid data will core which source from #43469

github-actions bot pushed a commit that referenced this pull request Jan 2, 2025

[fix](new_json_reader)fix core for new_json_reader (#45905)

ffd3ae7

fix coredump with new_json_reader if we read invalid data will core which source from #43469

morningman added the dev/2.1.x-lh label Feb 17, 2025

morningman pushed a commit that referenced this pull request Feb 17, 2025

[feature](hive)support hive catalog read json table. (#43469)

d888c93

gavinchou mentioned this pull request Feb 18, 2025

Release Notes 3.0.4 #48013

Closed

hubgeter mentioned this pull request Mar 18, 2025

[enhancement](hive)Initial support for Hive org.openx.data.jsonserde.JsonSerDe #49209

Merged

16 tasks

hubgeter mentioned this pull request Aug 6, 2025

[fix](load)fix load json format boolean type to int column. #54397

Merged

16 tasks

[feature](hive)support hive catalog read json table. #43469

[feature](hive)support hive catalog read json table. #43469

Uh oh!

Conversation

hubgeter commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

doris-robot commented Nov 7, 2024

Uh oh!

hubgeter commented Nov 7, 2024

Uh oh!

github-actions bot commented Nov 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

doris-robot commented Nov 7, 2024

Uh oh!

hubgeter commented Nov 10, 2024

Uh oh!

github-actions bot commented Nov 10, 2024

Uh oh!

doris-robot commented Nov 10, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hubgeter commented Nov 11, 2024

Uh oh!

github-actions bot commented Nov 11, 2024

Uh oh!

hubgeter commented Nov 12, 2024

Uh oh!

github-actions bot commented Nov 12, 2024

Uh oh!

morningman left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 13, 2024

Uh oh!

github-actions bot commented Nov 13, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eldenmoon left a comment

Choose a reason for hiding this comment

Uh oh!

hubgeter commented Nov 13, 2024

Uh oh!

github-actions bot commented Nov 13, 2024

Uh oh!

github-actions bot commented Nov 14, 2024

Uh oh!

github-actions bot commented Nov 28, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

hubgeter commented Nov 7, 2024 •

edited

Loading