Skip to content

Commit

Permalink
ORC-1740: Avoid the dump tool repeatedly parsing ColumnStatistics
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?
This PR aims to avoid the dump tool repeatedly parsing ColumnStatistics.

### Why are the changes needed?
`org.apache.orc.StripeStatistics#getColumnStatistics` always generates statistical information for all columns. When there are many columns, the parsing performance decreases.

https://github.com/apache/orc/blob/c38e20d862ce19395558e092dd42033a000fe22d/java/core/src/java/org/apache/orc/StripeStatistics.java#L57-L66

### How was this patch tested?
local test and exist UT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #1972 from cxzl25/ORC-1740.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: William Hyun <william@apache.org>
  • Loading branch information
cxzl25 authored and dongjoon-hyun committed Jul 11, 2024
1 parent a7e1068 commit 6d478cb
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 4 deletions.
5 changes: 3 additions & 2 deletions java/tools/src/java/org/apache/orc/tools/FileDump.java
Original file line number Diff line number Diff line change
Expand Up @@ -357,9 +357,10 @@ private static void printMetaDataImpl(final String filename,
for (int n = 0; n < stripeStats.size(); n++) {
System.out.println(" Stripe " + (n + 1) + ":");
StripeStatistics ss = stripeStats.get(n);
for (int i = 0; i < ss.getColumnStatistics().length; ++i) {
ColumnStatistics[] columnStatistics = ss.getColumnStatistics();
for (int i = 0; i < columnStatistics.length; ++i) {
System.out.println(" Column " + i + ": " +
ss.getColumnStatistics()[i].toString());
columnStatistics[i].toString());
}
}
ColumnStatistics[] stats = reader.getStatistics();
Expand Down
5 changes: 3 additions & 2 deletions java/tools/src/java/org/apache/orc/tools/JsonFileDump.java
Original file line number Diff line number Diff line change
Expand Up @@ -112,10 +112,11 @@ public static void printJsonMetaData(List<String> files,
writer.name("stripeNumber").value(n + 1);
StripeStatistics ss = stripeStatistics.get(n);
writer.name("columnStatistics").beginArray();
for (int i = 0; i < ss.getColumnStatistics().length; i++) {
ColumnStatistics[] columnStatistics = ss.getColumnStatistics();
for (int i = 0; i < columnStatistics.length; i++) {
writer.beginObject();
writer.name("columnId").value(i);
writeColumnStatistics(writer, ss.getColumnStatistics()[i]);
writeColumnStatistics(writer, columnStatistics[i]);
writer.endObject();
}
writer.endArray();
Expand Down

0 comments on commit 6d478cb

Please sign in to comment.