-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feat](refactor-param) add file format configuration #50225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat](refactor-param) add file format configuration #50225
Conversation
|
run buildall |
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
TPC-H: Total hot run time: 34091 ms |
TPC-DS: Total hot run time: 185889 ms |
ClickBench: Total hot run time: 29.62 s |
|
run buildall |
TPC-H: Total hot run time: 34061 ms |
TPC-DS: Total hot run time: 192693 ms |
ClickBench: Total hot run time: 30.12 s |
4a4e8f4 to
00c292e
Compare
|
run buildall |
TPC-H: Total hot run time: 33719 ms |
TPC-DS: Total hot run time: 184971 ms |
ClickBench: Total hot run time: 29.59 s |
|
run buildall |
TPC-H: Total hot run time: 34175 ms |
TPC-DS: Total hot run time: 193610 ms |
ClickBench: Total hot run time: 29.8 s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds file format configuration support by introducing new property classes and comprehensive unit tests for multiple file formats (Parquet, ORC, JSON, CSV, Avro, and WAL).
- New property classes for each file format type have been implemented.
- Extensive unit tests have been added to validate the behavior for valid and invalid property inputs.
Reviewed Changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| ParquetFileFormatPropertiesTest.java | Adds tests for default Parquet property values. |
| OrcFileFormatPropertiesTest.java | Adds tests for ORC file format property analysis. |
| JsonFileFormatPropertiesTest.java | Covers property validations for JSON input configurations. |
| CsvFileFormatPropertiesTest.java | Introduces validations and error handling for CSV settings. |
| AvroFileFormatPropertiesTest.java | Provides a basic test for Avro file format properties. |
| WalFileFormatProperties.java | Implements a skeleton for WAL file format properties. |
| FileFormatProperties and constants | Implements property creation and common defaults. |
Comments suppressed due to low confidence (2)
fe/fe-core/src/main/java/org/apache/doris/datasource/property/fileformat/WalFileFormatProperties.java:53
- The analyzeFileFormatProperties method in WalFileFormatProperties is empty, which could lead to silent failures when unexpected properties are provided. Consider adding an explicit exception or implementation to handle unsupported properties.
public void analyzeFileFormatProperties(Map<String, String> formatProperties, boolean isRemoveOriginProperty) throws AnalysisException {
fe/fe-core/src/main/java/org/apache/doris/datasource/property/fileformat/JsonFileFormatProperties.java:91
- [nitpick] The method name 'setJsonpaths' should follow consistent camelCase naming conventions (e.g., setJsonPaths) to improve readability and maintain consistency.
fileAttributes.setJsonpaths(jsonPaths);
...core/src/main/java/org/apache/doris/datasource/property/fileformat/FileFormatProperties.java
Outdated
Show resolved
Hide resolved
|
|
||
| package org.apache.doris.datasource.property.constants; | ||
|
|
||
| public class JsonProperties { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can merge these class under constants package into classes under fileformat package.
For example, JsonProperties only defines some constants string, which can be placed in JsonFileFormatProperties?
.../src/main/java/org/apache/doris/datasource/property/fileformat/AvroFileFormatProperties.java
Outdated
Show resolved
Hide resolved
...core/src/main/java/org/apache/doris/datasource/property/fileformat/FileFormatProperties.java
Show resolved
Hide resolved
...e/src/main/java/org/apache/doris/datasource/property/fileformat/OrcFileFormatProperties.java
Outdated
Show resolved
Hide resolved
...e/src/main/java/org/apache/doris/datasource/property/fileformat/CsvFileFormatProperties.java
Outdated
Show resolved
Hide resolved
...e/src/main/java/org/apache/doris/datasource/property/fileformat/CsvFileFormatProperties.java
Show resolved
Hide resolved
| @Override | ||
| public void analyzeFileFormatProperties(Map<String, String> formatProperties, boolean isRemoveOriginProperty) | ||
| throws AnalysisException { | ||
| // 这几个json应该移到json checker中 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
English
.../src/main/java/org/apache/doris/datasource/property/fileformat/JsonFileFormatProperties.java
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| public static FileFormatProperties createFileFormatProperties(Map<String, String> formatProperties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes we may use the suffix of a file to guess its format.
For example, the file path is /path/to/1.parquet, so we know that this is a parquet format, even if user does not specify the "file_format" property.
We should handle this.
- create a tool method to infer the file format from path.
- change the signature of this
createFileFormatPropertiestocreateFileFormatProperties(TFileFormatType fileFormat, properties)
And I think without the suffix and user specified "file_format" property, we should provide a "default" format file, eg, csv. But it depends because I am not sure what the previous logic is
|
run buildall |
TPC-H: Total hot run time: 33755 ms |
TPC-DS: Total hot run time: 192764 ms |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
Issue Number: #50238 Problem Summary: Previously, we refactored the code of the `fileFormat` attribute (#50225). However, we only added the relevant code without modifying the business code. This pull request (PR) modifies the code of the table-valued function (tvf) that is related to the `fileformat` format.
### What problem does this PR solve? Issue Number: apache#50238
…0463) Issue Number: apache#50238 Problem Summary: Previously, we refactored the code of the `fileFormat` attribute (apache#50225). However, we only added the relevant code without modifying the business code. This pull request (PR) modifies the code of the table-valued function (tvf) that is related to the `fileformat` format.
…he#50471) Issue Number: apache#50238 Problem Summary: Previously, we refactored the code of the fileFormat attribute (apache#50225). However, we only added the relevant code without modifying the business code. This pull request modifies the code of the `SELECT INTO OUTFILE` feature that is related to the fileformat.
…pache#50552) Issue Number: apache#50238 Problem Summary: Previously, we refactored the code of the fileFormat attribute (apache#50225). However, we only added the relevant code without modifying the business code. This pull request modifies the code of the `RoutineLoad` feature that is related to the fileformat.
…pache#50882) Issue Number:apache#50238 Problem Summary: Previously, we refactored the code of the fileFormat attribute (apache#50225). However, we only added the relevant code without modifying the business code. This pull request modifies the code of the BrokerLoad feature that is related to the fileformat.
### What problem does this PR solve? Issue Number: apache#50238
…0463) Issue Number: apache#50238 Problem Summary: Previously, we refactored the code of the `fileFormat` attribute (apache#50225). However, we only added the relevant code without modifying the business code. This pull request (PR) modifies the code of the table-valued function (tvf) that is related to the `fileformat` format.
…he#50471) Issue Number: apache#50238 Problem Summary: Previously, we refactored the code of the fileFormat attribute (apache#50225). However, we only added the relevant code without modifying the business code. This pull request modifies the code of the `SELECT INTO OUTFILE` feature that is related to the fileformat.
…ileformat (apache#50552) Issue Number: apache#50238 Problem Summary: Previously, we refactored the code of the fileFormat attribute (apache#50225). However, we only added the relevant code without modifying the business code. This pull request modifies the code of the `RoutineLoad` feature that is related to the fileformat. (cherry picked from commit b3abfab)
…pache#50882) Issue Number:apache#50238 Problem Summary: Previously, we refactored the code of the fileFormat attribute (apache#50225). However, we only added the relevant code without modifying the business code. This pull request modifies the code of the BrokerLoad feature that is related to the fileformat.
What problem does this PR solve?
Issue Number: #50238
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)