-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[enhancement](hive)Initial support for Hive org.openx.data.jsonserde.JsonSerDe #49209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
TeamCity cloud ut coverage result: |
TPC-H: Total hot run time: 32970 ms |
TPC-DS: Total hot run time: 185231 ms |
ClickBench: Total hot run time: 31.05 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
| // in the data to lowercase,and use the last one as the insertion value | ||
|
|
||
| bool _openx_json_ignore_malformed = false; | ||
| // hive : org.openx.data.jsonserde.JsonSerDe, `ignore.malformed.json` prop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move comment before the field
|
run buildall |
|
TeamCity cloud ut coverage result: |
TPC-H: Total hot run time: 34703 ms |
TPC-DS: Total hot run time: 194406 ms |
ClickBench: Total hot run time: 31.92 s |
| } | ||
|
|
||
| public boolean canReadHiveJsonInOneColumn() { | ||
| return ConnectContext.get().getSessionVariable().isReadHiveJsonInOneColumn() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a sessionVariable instance in HiveScanNode. Use it instead of ConnectContext.get().getSessionVariable()
| || serDeLib.equals(HiveMetaStoreClientHelper.LEGACY_HIVE_JSON_SERDE)) { | ||
| type = TFileFormatType.FORMAT_JSON; | ||
| } else if (serDeLib.equals(HiveMetaStoreClientHelper.OPENX_JSON_SERDE)) { | ||
| if (hmsTable.canReadHiveJsonInOneColumn()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return error if READ_HIVE_JSON_IN_ONE_COLUMN is true but the first column is not string?
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
TeamCity cloud ut coverage result: |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…JsonSerDe (apache#49209) ### What problem does this PR solve? Problem Summary: Initial support for Hive `org.openx.data.jsonserde.JsonSerDe`(https://github.com/rcongiu/Hive-JSON-Serde). The specific behavior of read is similar to pr apache#43469. By referring to the description in the link, here are some explanations: Support: 1. Querying Complex Fields 2. Importing Malformed Data (serde prop: ignore.malformed.json) Not supported, this parameter will not affect the query results 1. dots.in.keys 2. Case Sensitivity in mappings 3. Mapping Hive Keywords Not supported, but will report an error: 1. Using Arrays 2. Promoting a Scalar to an Array error : [DATA_QUALITY_ERROR]JSON data is array-object, `strip_outer_array` must be TRUE. In order to allow some json strings that do not support parsing to be processed by users, a session variable is introduced: `read_hive_json_in_one_column` (default is false). When this variable is true, a whole line of json is read into the first column, and users can choose to process a whole line of json, such as JSON_PARSE. The data type of the first column of the table needs to be string. Currently only valid for org.openx.data.jsonserde.JsonSerDe.
…onserde.JsonSerDe" (apache#49928) Reverts apache#49209
What problem does this PR solve?
Problem Summary:
Initial support for Hive
org.openx.data.jsonserde.JsonSerDe(https://github.com/rcongiu/Hive-JSON-Serde).The specific behavior of read is similar to pr #43469.
By referring to the description in the link, here are some explanations:
Support:
Not supported, this parameter will not affect the query results
Not supported, but will report an error:
error : [DATA_QUALITY_ERROR]JSON data is array-object,
strip_outer_arraymust be TRUE.In order to allow some json strings that do not support parsing to be processed by users, a session variable is introduced:
read_hive_json_in_one_column(default is false). When this variable is true, a whole line of json is read into the first column, and users can choose to process a whole line of json, such as JSON_PARSE. The data type of the first column of the table needs to be string. Currently only valid for org.openx.data.jsonserde.JsonSerDe.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)