Adds is_refreshing support for parquet read #3600

devinrsmith · 2023-03-24T22:03:01Z

Additionally, applies black formatting to parquet.py, and plumbs target_page_size from #2555.

Fixes #3596

Additionaly, applies black formatting to parquet.py. Fixes deephaven#3596

rcaudy

I think we need additional changes:

io.deephaven.parquet.table.layout.KeyValuePartitionLayout is inferring partitioning column types from the set of values encountered. This will create issues if we change our mind about types on a subsequent scan. We should consider whether we can look for this scenario and provide a better error message.
io.deephaven.parquet.table.layout.ParquetKeyValuePartitionedLayout and io.deephaven.parquet.table.layout.ParquetFlatPartitionedLayout should probably record the location keys they have created in a map by path, in order to avoid recreating them. This will reduce some of the costs associated with repeat scans.
We should check for trailing magic numbers and refuse to create ParquetTableLocationKey instances for incomplete files.

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java

rcaudy · 2023-03-25T02:29:31Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java

@@ -513,8 +526,8 @@ public static Table readPartitionedTableInferSchema(
 getUnboxedTypeIfBoxed(partitionValue.getClass()), null, ColumnDefinition.ColumnType.Partitioning));
 }
 allColumns.addAll(schemaInfo.getFirst());
- return readPartitionedTable(recordingLocationKeyFinder, schemaInfo.getSecond(),
- TableDefinition.of(allColumns));
+ return readPartitionedTable(readInstructions.isRefreshing() ? locationKeyFinder : initialKeys,


This is a little unfortunate if it causes us to scan for keys an extra time initially. I suppose it depends on the key finder implementation from a performance standpoint.

I think some of these interfaces could be improved. For example, I think there is room for the user to explicitly provide the schema instead of inferring it from the first file.

This situation could also be improved by updating or changing TableLocationKeyFinder - we shouldn't need to read all of the table locations if we only want just the first one (for inferring schema).

I think we already have an exposed public method that allows the user to specify. I also think it's not that easy for a user to do.

The key finders were built for that static use case. I agree that we could have a richer interface (findFirstKey, findNewKeys(Set)).

devinrsmith · 2023-03-25T03:37:37Z

io.deephaven.parquet.table.layout.KeyValuePartitionLayout is inferring partitioning column types from the set of values encountered. This will create issues if we change our mind about types on a subsequent scan. We should consider whether we can look for this scenario and provide a better error message.

I'm assuming that all TLKs found by an instance of tableLocationKeyFinder.findKeys(...) should have consistent TableLocationKey#getPartitionKeys?

io.deephaven.parquet.table.layout.ParquetKeyValuePartitionedLayout and io.deephaven.parquet.table.layout.ParquetFlatPartitionedLayout should probably record the location keys they have created in a map by path, in order to avoid recreating them. This will reduce some of the costs associated with repeat scans.

I wish there was a better way to do this instead of TableLocationKeyFinder - it seems like we could have a listener based interface or some such that automatically produces / notifies when a new TableLocationKey is created / found. (In the case of file-based TableLocationKey, could be based off of java.nio.file.WatchService... of course, you could still do a polling based impl w/ a map... regardless, the caller-based TableLocationKeyFinder#findKeys seems a bit funky to me.)

…der#safetyCheck

devinrsmith · 2023-03-27T18:17:34Z

I'm very surprised that we are washing through string CSV building for io.deephaven.parquet.table.layout.KeyValuePartitionLayout; as such, I think any "costs associated with repeat scans" will be better focused on cleaning that up first?

It's going to be a much larger change if we want to cache and validate the same way for ParquetKeyValuePartitionedLayout

devinrsmith · 2023-03-27T19:14:57Z

I'm not sure if it's "worth it" to cache ParquetFlatPartitionedLayout, seems like premature optimization, but I've gone ahead and implemented it that way.

As far as "should we check magic bytes in header/footer", that potentially adds a lot of extraneous overhead unless managed carefully. In the case of ParquetKeyValuePartitionedLayout, I think there would be a lot more groundwork we'd need to cover to get this working efficiently. I've gone ahead and added the check to ParquetFlatPartitionedLayout.

rcaudy · 2023-03-28T03:06:10Z

io.deephaven.parquet.table.layout.KeyValuePartitionLayout is inferring partitioning column types from the set of values encountered. This will create issues if we change our mind about types on a subsequent scan. We should consider whether we can look for this scenario and provide a better error message.

I'm assuming that all TLKs found by an instance of tableLocationKeyFinder.findKeys(...) should have consistent TableLocationKey#getPartitionKeys?

io.deephaven.parquet.table.layout.ParquetKeyValuePartitionedLayout and io.deephaven.parquet.table.layout.ParquetFlatPartitionedLayout should probably record the location keys they have created in a map by path, in order to avoid recreating them. This will reduce some of the costs associated with repeat scans.

I wish there was a better way to do this instead of TableLocationKeyFinder - it seems like we could have a listener based interface or some such that automatically produces / notifies when a new TableLocationKey is created / found. (In the case of file-based TableLocationKey, could be based off of java.nio.file.WatchService... of course, you could still do a polling based impl w/ a map... regardless, the caller-based TableLocationKeyFinder#findKeys seems a bit funky to me.)

Yes, we require identical partition layouts across TLKs for the same table. No way for the system to work properly without that, the partitions are columns of the data.

The deeper interfaces expect a TableLocationProvider (TLP) to support a listener model, and the typical listener then pushes into a buffer object that allows polling by the "real" engine constructs. The PollingTableLocationProvider implementation is much simpler than a lot of the other refreshing TLPs that we don't have in community right now. WatchService, in my opinion, is kind of a hot mess; it's lossy, with significant implementation differences across OSes. We can absolutely implement per-layout alternative TLPs that can discover parquet files, but it's likely a premature optimization. In one of my other replies I suggested something like findMoreKeys(Map) as way to make the key finder interface better.

rcaudy · 2023-03-28T03:09:23Z

I'm very surprised that we are washing through string CSV building for io.deephaven.parquet.table.layout.KeyValuePartitionLayout; as such, I think any "costs associated with repeat scans" will be better focused on cleaning that up first?

No way. I'm very, very happy with using the CSV parser here. It's basically perfect for the purpose, since it supports columnar string->type inference and conversion out of the box. I would hate to have to build that from scratch, and I like standardizing on the CSV parser's decisions. The only reasonable alternative would be to show one column at a time to some lower layer of the parser to avoid constructing the CSV text.

I admit this does present an issue for repeated evaluation. We can't have subsequent find calls pick different types. I don't see that there's a good solution, besides insisting on the types from the first evaluation continuing and complaining if we can't fit new values.

We could certainly support a version where instead of inference, the user specifies partition key names, types, and conversions-from-string.

...main/java/io/deephaven/engine/table/impl/locations/util/ExecutorTableDataRefreshService.java

...able/src/main/java/io/deephaven/engine/table/impl/locations/impl/TableLocationKeyFinder.java

.../src/main/java/io/deephaven/engine/table/impl/locations/impl/TableLocationKeySafetyImpl.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ParquetFileReader.java

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/layout/ParquetFileHelper.java

...quet/table/src/main/java/io/deephaven/parquet/table/layout/ParquetFlatPartitionedLayout.java

rcaudy

Better, but some of the outstanding comments stand.

…ption instead of TableDataException

…bstractTableLocationProvider

devinrsmith · 2023-03-30T22:45:02Z

The only new safety check manifests itself if all the partitions change values at once (otherwise, there is internal logic in io.deephaven.parquet.table.layout.KeyValuePartitionLayout that ensures no inconsistencies).

io.deephaven.engine.table.impl.locations.TableDataException: PollingTableLocationProvider[StandaloneTableKey] has produced an inconsistent TableLocationKey with unexpected partition keys. expected=[TimestampHour2] actual=[TimestampHour].                         
        at io.deephaven.engine.table.impl.locations.impl.AbstractTableLocationProvider.verifyPartitionKeys(AbstractTableLocationProvider.java:212)                                                                                                                    
        at io.deephaven.engine.table.impl.locations.impl.AbstractTableLocationProvider.handleTableLocationKey(AbstractTableLocationProvider.java:113)                                                                                                                 
        at io.deephaven.parquet.table.layout.KeyValuePartitionLayout.lambda$findKeys$0(KeyValuePartitionLayout.java:163)                                                                                                                                              
        at io.deephaven.engine.rowset.RowSet.lambda$forAllRowKeys$0(RowSet.java:306)                                               
        at io.deephaven.engine.rowset.impl.singlerange.SingleRange.ixForEachLong(SingleRange.java:105)                                                                                                                                                                
        at io.deephaven.engine.rowset.impl.WritableRowSetImpl.forEachRowKey(WritableRowSetImpl.java:339)                                                                                                                                                              
        at io.deephaven.engine.rowset.RowSet.forAllRowKeys(RowSet.java:305)                                                        
        at io.deephaven.parquet.table.layout.KeyValuePartitionLayout.findKeys(KeyValuePartitionLayout.java:159)                                                                                                                                                       
        at io.deephaven.engine.table.impl.locations.impl.PollingTableLocationProvider.refresh(PollingTableLocationProvider.java:51)                                                                                                                                   
        at io.deephaven.engine.table.impl.locations.util.ExecutorTableDataRefreshService$ScheduledTableLocationProviderRefresh.refresh(ExecutorTableDataRefreshService.java:121)                                                                                      
        at io.deephaven.engine.table.impl.locations.util.ExecutorTableDataRefreshService$ScheduledSubscriptionTask.doRefresh(ExecutorTableDataRefreshService.java:87)

.../table/src/main/java/io/deephaven/parquet/table/layout/ParquetKeyValuePartitionedLayout.java

rcaudy · 2023-03-31T03:54:21Z

...c/main/java/io/deephaven/engine/table/impl/locations/impl/AbstractTableLocationProvider.java

+ private void verifyPartitionKeys(@NotNull TableLocationKey locationKey) {
+ if (partitionKeys == null) {
+ partitionKeys = new ArrayList<>(locationKey.getPartitionKeys());
+ } else if (!equals(partitionKeys, locationKey.getPartitionKeys())) {


This is an improvement. We should consider also verifying the types of the partition values. It's a little bit tricky, though, since we'd want to allow for assignable types based on the schema (either inferred or supplied).

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java

jmao-denver · 2023-03-31T14:59:44Z

py/server/deephaven/parquet.py

+ _JParquetTools.writeTable(
+ table.j_table, _JFile(path), write_instructions
+ )


I think it was formatting like that that turned @chipkent off when I first started using Black.

jmao-denver

I personally quite like Black's strong opinionated style. But if we decide to use it, we should use it across all the Python code base. It is best that we reach an agreement as a team, followed up by an initial reformatting of everything, and have the necessary enforcement in place for PRs.

devinrsmith · 2023-03-31T16:04:10Z

Created #3638

py/server/deephaven/parquet.py

jmao-denver

Some questions about parameter default values.

jmao-denver

The Python changes look good to me.

deephaven-internal · 2023-03-31T17:21:31Z

Labels indicate documentation is required. Issues for documentation have been opened:

How-to: https://github.com/deephaven/deephaven.io/issues/2397
Conceptual: https://github.com/deephaven/deephaven.io/issues/2396
Reference: https://github.com/deephaven/deephaven.io/issues/2395
Blog: https://github.com/deephaven/deephaven.io/issues/2394

Adds is_refreshing support for parquet read

bc1a2aa

Additionaly, applies black formatting to parquet.py. Fixes deephaven#3596

devinrsmith added parquet Related to the Parquet integration DocumentationNeeded ReleaseNotesNeeded Release notes are needed labels Mar 24, 2023

devinrsmith added this to the Mar 2023 milestone Mar 24, 2023

devinrsmith requested a review from chipkent as a code owner March 24, 2023 22:03

devinrsmith self-assigned this Mar 24, 2023

devinrsmith requested review from jmao-denver and rcaudy as code owners March 24, 2023 22:03

Plumbed additional target_page_size option

692a2ef

rcaudy reviewed Mar 25, 2023

View reviewed changes

Add io.deephaven.engine.table.impl.locations.impl.TableLocationKeyFin…

f57d1b6

…der#safetyCheck

devinrsmith requested a review from rcaudy March 25, 2023 04:28

devinrsmith added 2 commits March 27, 2023 11:05

Improve flat partitioned walking

5ca63e8

Add parquet flat partitioned cache

fc26901

devinrsmith added 2 commits March 27, 2023 11:22

Make fileNameMatches accessible

fe18e6f

Add metadata checking for ParquetFlatPartitionedLayout

6ca0b1e

It's going to be a much larger change if we want to cache and validate the same way for ParquetKeyValuePartitionedLayout

rcaudy reviewed Mar 28, 2023

View reviewed changes

Review responses

fa46cbf

rcaudy reviewed Mar 30, 2023

View reviewed changes

devinrsmith added 2 commits March 29, 2023 22:13

Easy stuff

321af7a

Create a PTLK path that creates a ParquetFileReader and throws IOExce…

6803aea

…ption instead of TableDataException

devinrsmith requested a review from rcaudy March 30, 2023 05:28

Add ParquetTableLocationKey#verifyFileReader, move safety code into A…

d789939

…bstractTableLocationProvider

rcaudy reviewed Mar 31, 2023

View reviewed changes

Review cleanup

e9184eb

jmao-denver reviewed Mar 31, 2023

View reviewed changes

rcaudy previously approved these changes Mar 31, 2023

View reviewed changes

Undo opinionated black changes

b2de176

devinrsmith dismissed rcaudy’s stale review via b2de176 March 31, 2023 15:50

devinrsmith requested a review from jmao-denver March 31, 2023 15:50

jmao-denver reviewed Mar 31, 2023

View reviewed changes

py/server/deephaven/parquet.py Outdated Show resolved Hide resolved

jmao-denver reviewed Mar 31, 2023

View reviewed changes

py/server/deephaven/parquet.py Show resolved Hide resolved

jmao-denver requested changes Mar 31, 2023

View reviewed changes

Review responses

ab085fd

devinrsmith requested a review from jmao-denver March 31, 2023 16:32

jmao-denver approved these changes Mar 31, 2023

View reviewed changes

devinrsmith merged commit a94110e into deephaven:main Mar 31, 2023

devinrsmith deleted the refreshing-parquet branch March 31, 2023 17:21

github-actions bot locked and limited conversation to collaborators Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds is_refreshing support for parquet read #3600

Adds is_refreshing support for parquet read #3600

devinrsmith commented Mar 24, 2023 •

edited

Loading

rcaudy left a comment

rcaudy Mar 25, 2023

devinrsmith Mar 25, 2023

rcaudy Mar 28, 2023

devinrsmith commented Mar 25, 2023

devinrsmith commented Mar 27, 2023

devinrsmith commented Mar 27, 2023 •

edited

Loading

rcaudy commented Mar 28, 2023 •

edited

Loading

rcaudy commented Mar 28, 2023 •

edited

Loading

rcaudy left a comment

devinrsmith commented Mar 30, 2023

rcaudy Mar 31, 2023

jmao-denver Mar 31, 2023 •

edited

Loading

jmao-denver left a comment

devinrsmith commented Mar 31, 2023

jmao-denver left a comment

jmao-denver left a comment

deephaven-internal commented Mar 31, 2023

Adds is_refreshing support for parquet read #3600

Adds is_refreshing support for parquet read #3600

Conversation

devinrsmith commented Mar 24, 2023 • edited Loading

rcaudy left a comment

Choose a reason for hiding this comment

rcaudy Mar 25, 2023

Choose a reason for hiding this comment

devinrsmith Mar 25, 2023

Choose a reason for hiding this comment

rcaudy Mar 28, 2023

Choose a reason for hiding this comment

devinrsmith commented Mar 25, 2023

devinrsmith commented Mar 27, 2023

devinrsmith commented Mar 27, 2023 • edited Loading

rcaudy commented Mar 28, 2023 • edited Loading

rcaudy commented Mar 28, 2023 • edited Loading

rcaudy left a comment

Choose a reason for hiding this comment

devinrsmith commented Mar 30, 2023

rcaudy Mar 31, 2023

Choose a reason for hiding this comment

jmao-denver Mar 31, 2023 • edited Loading

Choose a reason for hiding this comment

jmao-denver left a comment

Choose a reason for hiding this comment

devinrsmith commented Mar 31, 2023

jmao-denver left a comment

Choose a reason for hiding this comment

jmao-denver left a comment

Choose a reason for hiding this comment

deephaven-internal commented Mar 31, 2023

devinrsmith commented Mar 24, 2023 •

edited

Loading

devinrsmith commented Mar 27, 2023 •

edited

Loading

rcaudy commented Mar 28, 2023 •

edited

Loading

rcaudy commented Mar 28, 2023 •

edited

Loading

jmao-denver Mar 31, 2023 •

edited

Loading