Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: DH-18174: Delay reading from parquet file when creating table and column location #6606

Merged

Conversation

malhotrashivam
Copy link
Contributor

@devinrsmith devinrsmith requested a review from rcaudy January 30, 2025 22:14
devinrsmith
devinrsmith previously approved these changes Jan 30, 2025
Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double check that we're not going to NPE because of a missed initialization.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@rcaudy rcaudy modified the milestones: 0.38.0, 0.37.5 Feb 3, 2025
@rcaudy
Copy link
Member

rcaudy commented Feb 3, 2025

I'm concerned about a potential side effect of this change, but I think we're OK. @malhotrashivam please confirm you've tested coalescing a table with many Parquet files that doesn't use _metadata and _common_metadata files, and that there is no performance regression there.

SourceTable has the following:

    private void maybeAddLocations(@NotNull final Collection<LiveSupplier<ImmutableTableLocationKey>> locationKeys) {
        if (locationKeys.isEmpty()) {
            return;
        }
        filterLocationKeys(locationKeys)
                .parallelStream()
                .forEach(lk -> columnSourceManager.addLocation(locationProvider.getTableLocation(lk.get())));
    }

Note the parallelStream() call.

Now, I expect that we're reading Parquet file metadata under RegionedColumnSourceManager.initialize(), which delegates to ``RegionedColumnSourceManager.update(true), which uses a parallel stream when checking empty entries for existence (see the true` last argument to `StreamSupport.stream`):

        final Collection<EmptyTableLocationEntry> entriesToInclude = StreamSupport.stream(Spliterators.spliterator(
                                emptyTableLocations.iterator(),
                                emptyTableLocations.size(),
                                Spliterator.IMMUTABLE | Spliterator.DISTINCT | Spliterator.NONNULL),
                        true) //@formatter:on
                .peek(EmptyTableLocationEntry::refresh)
                .filter((final EmptyTableLocationEntry emptyEntry) -> {

The first call to EmptyTableLocationEntry.refresh is going to cause the actual TableLocation.refresh/TableLocation.subscribe call, which will force initialization.

@malhotrashivam
Copy link
Contributor Author

From what I can see in my test, EmptyTableLocationEntry.refresh calls ParquetTableLocation.refresh which is an empty method. The actual reading of parquet metadata happens in

final RowSet locationRowSet = emptyEntry.location.getRowSet();

which calls AbstractTableLocation::getRowSet and then ParquetTableLocation::initialize

@rcaudy
Copy link
Member

rcaudy commented Feb 4, 2025

From what I can see in my test, EmptyTableLocationEntry.refresh calls ParquetTableLocation.refresh which is an empty method. The actual reading of parquet metadata happens in

final RowSet locationRowSet = emptyEntry.location.getRowSet();

which calls AbstractTableLocation::getRowSet and then ParquetTableLocation::initialize

And the getRowSet() is also on a parallel stream, so we should be good here.
Since you said you're testing agreed, let's declare victory.

@malhotrashivam malhotrashivam merged commit 8edb2b1 into deephaven:main Feb 4, 2025
17 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Feb 4, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature request New feature or request NoDocumentationNeeded ReleaseNotesNeeded Release notes are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants