Skip to content

Conversation

@hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Nov 7, 2024

What problem does this PR solve?

Problem Summary:
Support reading json format hive table like:

mysql> show create table basic_json_table;
CREATE TABLE `basic_json_table`(
  `id` int,
  `name` string,
  `age` tinyint,
  `salary` float,
  `is_active` boolean,
  `join_date` date,
  `last_login` timestamp,
  `height` double,
  `profile` binary,
  `rating` decimal(10,2))
ROW FORMAT SERDE
  'org.apache.hive.hcatalog.data.JsonSerDe'

Behavior changed:
To implement this feature, this pr modifies new_json_reader. Previously, new_json_reader could only insert data into columnString. In order to support inserting data into columns of other types, DataTypeSerDe is introduced to insert data into columns. To maintain compatibility with previous versions, changes to this pr are triggered only when reading hive json tables.

Limitation of Use:

  1. Currently, only query is supported, and writing is not supported.
  2. Currently, only the ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; scenario is supported. For some properties specified in with serdeproperties, Doris does not take effect.
  3. Since Hive does not allow columns with the same name but different case when creating a table in Json format (including inside a Struct), we convert the field names in the Json data to lowercase when reading the Json data file, and then match according to the lowercase field names. For field names that are duplicated after being converted to lowercase in the data, the value of the last field is used (consistent with Hive behavior).
    example:
create table json_table(
    column int 
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

a.json:
{"column":1,"COLumn",2,"COLUMN":3}
{"column":10,"COLumn",20}
{"column":100}
in Hive : load a.json to table json_table

in Doris query:
---
3
20
100
---

Todo(in next pr):
Merge serde and json_reader ,because they have logical conflicts.

Release note

Hive catalog support read json format table.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hubgeter
Copy link
Contributor Author

hubgeter commented Nov 7, 2024

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2024

clang-tidy review says "All clean, LGTM! 👍"

Arena& mem_pool, int row_num) const;
virtual Status read_one_cell_from_json(IColumn& column, const rapidjson::Value& result) const;

virtual DataTypeSerDeSPtrs get_nested_serdes() const {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can support
get_key_serde() get_value_serde() for map
get_data_serde() for array nullcolumn
get_serde(int) for struct
not get_nested_serdes() for any complex type

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.88% (9862/26034)
Line Coverage: 29.03% (82003/282472)
Region Coverage: 28.26% (42240/149458)
Branch Coverage: 24.83% (21419/86256)
Coverage Report: http://coverage.selectdb-in.cc/coverage/4120ac989a900c6d6af8401251c242a6690de1df_4120ac989a900c6d6af8401251c242a6690de1df/report/index.html

@hubgeter
Copy link
Contributor Author

run buildall

@hubgeter hubgeter marked this pull request as ready for review November 10, 2024 10:19
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.91% (9868/26033)
Line Coverage: 29.07% (82263/282975)
Region Coverage: 28.23% (42347/150029)
Branch Coverage: 24.78% (21447/86550)
Coverage Report: http://coverage.selectdb-in.cc/coverage/14b5e53e374c7b4d59ae9dc746ff46a37d551ace_14b5e53e374c7b4d59ae9dc746ff46a37d551ace/report/index.html

for (auto* slot_desc : slot_descs) {
if (_is_hive_table) {
//don't like _fuzzy_parse,each line read in must modify name_map once.
for (auto* v : slot_descs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will seriously impact the performance.
What if we set a session variable to control it? And by default, in most cases, there is no duplicate column name in json value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we not only need to deal with the same column names, but also the order of the table column names and the json data. I think I can simplify this double for loop by using a layer of for loop plus map.

if (slot_desc->is_nullable()) {
vectorized::IColumn* data_column_ptr = column_ptr;
DataTypeSerDeSPtr data_serde = serde;
vectorized::DataTypeSerDe::FormatOptions _options;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this _options can be created only once at reuse it in each call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah , i will fix it.

*value, "Json value is null, but the column `{}` is not nullable.",
slot_desc->col_name(), valid));
return Status::OK();
} else if (type_desc.type == TYPE_STRUCT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use switch ... case for following logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need check _is_load and type_desc , It seems that there is no way to use switch.


} else {
return Status::InternalError<false>("It should not here.");
// Maybe we can `switch (value->GetType()) case: kNumberType`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using switch (value->GetType())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if using switch (value->GetType()) to get the string behavior would improve performance, but I might give it a try.

@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

morningman
morningman previously approved these changes Nov 13, 2024
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 13, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

// simdjson, replace none simdjson function if it is ready
Status NewJsonReader::_simdjson_init_reader() {
Status NewJsonReader::_simdjson_init_reader(bool is_load) {
_is_load = is_load;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already set in Status NewJsonReader::init_reader.
And not need to pass it from function parameter


int32_t skip_bitmap_col_idx {-1};

bool _is_load = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment in code to describe this field

eldenmoon
eldenmoon previously approved these changes Nov 13, 2024
Copy link
Member

@eldenmoon eldenmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hubgeter hubgeter dismissed stale reviews from eldenmoon and morningman via 7da0fde November 13, 2024 18:04
@hubgeter
Copy link
Contributor Author

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Nov 13, 2024
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@morningman morningman merged commit 358ed8c into apache:master Nov 28, 2024
17 of 18 checks passed
hubgeter added a commit to hubgeter/doris that referenced this pull request Dec 2, 2024
Problem Summary:
Support reading json format hive table like:
```mysql
mysql> show create table basic_json_table;
CREATE TABLE `basic_json_table`(
  `id` int,
  `name` string,
  `age` tinyint,
  `salary` float,
  `is_active` boolean,
  `join_date` date,
  `last_login` timestamp,
  `height` double,
  `profile` binary,
  `rating` decimal(10,2))
ROW FORMAT SERDE
  'org.apache.hive.hcatalog.data.JsonSerDe'
```

Behavior changed:
To implement this feature, this pr modifies `new_json_reader`.
Previously, `new_json_reader` could only insert data into columnString.
In order to support inserting data into columns of other types,
`DataTypeSerDe` is introduced to insert data into columns. To maintain
compatibility with previous versions, changes to this pr are triggered
only when reading hive json tables.

Limitation of Use:
1. Currently, only query is supported, and writing is not supported.
2. Currently, only the `ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe';` scenario is supported. For
some properties specified in `with serdeproperties`, Doris does not take
effect.
3. Since Hive does not allow columns with the same name but different
case when creating a table in Json format (including inside a Struct),
we convert the field names in the Json data to lowercase when reading
the Json data file, and then match according to the lowercase field
names. For field names that are duplicated after being converted to
lowercase in the data, the value of the last field is used (consistent
with Hive behavior).
example:
```
create table json_table(
    column int
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

a.json:
{"column":1,"COLumn",2,"COLUMN":3}
{"column":10,"COLumn",20}
{"column":100}
in Hive : load a.json to table json_table

in Doris query:
---
3
20
100
---
```

Todo(in next pr):
Merge `serde` and `json_reader` ,because they have logical conflicts.

Hive catalog support read json format table.
hubgeter added a commit to hubgeter/doris that referenced this pull request Dec 4, 2024
Problem Summary:
Support reading json format hive table like:
```mysql
mysql> show create table basic_json_table;
CREATE TABLE `basic_json_table`(
  `id` int,
  `name` string,
  `age` tinyint,
  `salary` float,
  `is_active` boolean,
  `join_date` date,
  `last_login` timestamp,
  `height` double,
  `profile` binary,
  `rating` decimal(10,2))
ROW FORMAT SERDE
  'org.apache.hive.hcatalog.data.JsonSerDe'
```

Behavior changed:
To implement this feature, this pr modifies `new_json_reader`.
Previously, `new_json_reader` could only insert data into columnString.
In order to support inserting data into columns of other types,
`DataTypeSerDe` is introduced to insert data into columns. To maintain
compatibility with previous versions, changes to this pr are triggered
only when reading hive json tables.

Limitation of Use:
1. Currently, only query is supported, and writing is not supported.
2. Currently, only the `ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe';` scenario is supported. For
some properties specified in `with serdeproperties`, Doris does not take
effect.
3. Since Hive does not allow columns with the same name but different
case when creating a table in Json format (including inside a Struct),
we convert the field names in the Json data to lowercase when reading
the Json data file, and then match according to the lowercase field
names. For field names that are duplicated after being converted to
lowercase in the data, the value of the last field is used (consistent
with Hive behavior).
example:
```
create table json_table(
    column int
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

a.json:
{"column":1,"COLumn",2,"COLUMN":3}
{"column":10,"COLumn",20}
{"column":100}
in Hive : load a.json to table json_table

in Doris query:
---
3
20
100
---
```

Todo(in next pr):
Merge `serde` and `json_reader` ,because they have logical conflicts.

Hive catalog support read json format table.
eldenmoon pushed a commit that referenced this pull request Jan 2, 2025
fix coredump with new_json_reader if we read invalid data will core
which source from #43469
github-actions bot pushed a commit that referenced this pull request Jan 2, 2025
fix coredump with new_json_reader if we read invalid data will core
which source from #43469
@gavinchou gavinchou mentioned this pull request Feb 18, 2025
morningman pushed a commit that referenced this pull request Apr 9, 2025
…JsonSerDe (#49209)

### What problem does this PR solve?

Problem Summary:
Initial support for Hive
`org.openx.data.jsonserde.JsonSerDe`(https://github.com/rcongiu/Hive-JSON-Serde).
The specific behavior of read  is similar to pr #43469.

By referring to the description in the link, here are some explanations:
Support:
1. Querying Complex Fields
2. Importing Malformed Data (serde prop: ignore.malformed.json)

Not supported, this parameter will not affect the query results
1. dots.in.keys
2. Case Sensitivity in mappings
3. Mapping Hive Keywords

Not supported, but will report an error:
1. Using Arrays
2. Promoting a Scalar to an Array
error : [DATA_QUALITY_ERROR]JSON data is array-object,
`strip_outer_array` must be TRUE.

In order to allow some json strings that do not support parsing to be
processed by users, a session variable is introduced:
`read_hive_json_in_one_column` (default is false). When this variable is
true, a whole line of json is read into the first column, and users can
choose to process a whole line of json, such as JSON_PARSE. The data
type of the first column of the table needs to be string. Currently only
valid for org.openx.data.jsonserde.JsonSerDe.
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…JsonSerDe (apache#49209)

### What problem does this PR solve?

Problem Summary:
Initial support for Hive
`org.openx.data.jsonserde.JsonSerDe`(https://github.com/rcongiu/Hive-JSON-Serde).
The specific behavior of read  is similar to pr apache#43469.

By referring to the description in the link, here are some explanations:
Support:
1. Querying Complex Fields
2. Importing Malformed Data (serde prop: ignore.malformed.json)

Not supported, this parameter will not affect the query results
1. dots.in.keys
2. Case Sensitivity in mappings
3. Mapping Hive Keywords

Not supported, but will report an error:
1. Using Arrays
2. Promoting a Scalar to an Array
error : [DATA_QUALITY_ERROR]JSON data is array-object,
`strip_outer_array` must be TRUE.

In order to allow some json strings that do not support parsing to be
processed by users, a session variable is introduced:
`read_hive_json_in_one_column` (default is false). When this variable is
true, a whole line of json is read into the first column, and users can
choose to process a whole line of json, such as JSON_PARSE. The data
type of the first column of the table needs to be string. Currently only
valid for org.openx.data.jsonserde.JsonSerDe.
morningman pushed a commit that referenced this pull request Aug 12, 2025
### What problem does this PR solve?

Related PR: #43469
Problem Summary:
PR #43469 accidentally removed the logic for reading boolean
"true"/"false" values in the simd join reader. Before PR #43469,
"true"/"false" were treated as "1"/"0", allowing a bool column in a JSON
file to be imported into an Doris int column. This PR restores this
logic.
github-actions bot pushed a commit that referenced this pull request Aug 12, 2025
### What problem does this PR solve?

Related PR: #43469
Problem Summary:
PR #43469 accidentally removed the logic for reading boolean
"true"/"false" values in the simd join reader. Before PR #43469,
"true"/"false" were treated as "1"/"0", allowing a bool column in a JSON
file to be imported into an Doris int column. This PR restores this
logic.
github-actions bot pushed a commit that referenced this pull request Aug 12, 2025
### What problem does this PR solve?

Related PR: #43469
Problem Summary:
PR #43469 accidentally removed the logic for reading boolean
"true"/"false" values in the simd join reader. Before PR #43469,
"true"/"false" were treated as "1"/"0", allowing a bool column in a JSON
file to be imported into an Doris int column. This PR restores this
logic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants