Skip to content

Conversation

@tmater
Copy link
Contributor

@tmater tmater commented Nov 18, 2025

Fix NameMapping loss in ParquetUtil.footerMetrics

Summary

Fixed a bug where ParquetUtil.footerMetrics was losing field IDs when using NameMapping, resulting in empty metrics for Parquet files without embedded field IDs.

Background

When footerMetrics is called with a NameMapping, it applies the mapping to get field IDs via getParquetTypeWithIds(), but then passed the original MessageType to ParquetMetrics.metrics. Later in the metrics() call, field IDs are extracted from the MessageType via type.getColumnDescription().getPrimitiveType().getId(), which returns null for the original MessageType without IDs, causing all metrics to be skipped.

Changes

  • Pass parquetTypeWithIds to ParquetMetrics.metrics to preserve field IDs from NameMapping
  • Removed unused messageType variable

Testing

  • Added testFooterMetricsWithNameMappingForFileWithoutIds that verifies metrics are keyed by field IDs from NameMapping

When footerMetrics was called with a NameMapping, it correctly applied
the mapping to get field IDs but then passed the wrong MessageType to
ParquetMetrics.metrics, causing the IDs to be lost. This resulted in
empty metrics for Parquet files without embedded field IDs.

Fixed by passing parquetTypeWithIds instead of the original messageType.
.optionalString("data")
.endRecord();

ParquetWriter<GenericData.Record> writer =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can use try-with-resource here too.

.endRecord();

ParquetWriter<GenericData.Record> writer =
AvroParquetWriter.<GenericData.Record>builder(new org.apache.hadoop.fs.Path(file.toURI()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

builder(Path file) is deprecated. Can we builder(OutputFile file) instead? You can refer to #14620

reader.getFooter(), Stream.empty(), MetricsConfig.getDefault(), nameMapping);

// The key assertion: column sizes should be keyed by field IDs from NameMapping
assertThat(metrics.columnSizes()).containsKeys(1, 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion passes even when the metrics.columnSizes() contains additional keys. Could you use containsOnlyKeys instead?

@tmater tmater requested review from ebyhr and nandorKollar November 19, 2025 10:31
Copy link
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@huaxingao huaxingao merged commit a3c538f into apache:main Nov 25, 2025
44 checks passed
@huaxingao
Copy link
Contributor

Thanks @tmater for the PR. Thanks @ebyhr @nandorKollar for the review!

@tmater
Copy link
Contributor Author

tmater commented Nov 25, 2025

Thanks @huaxingao , @nandorKollar, @ebyhr for the review!

@tmater tmater deleted the footermetrics_namemapping branch November 25, 2025 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants