-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds is_refreshing support for parquet read #3600
Conversation
Additionaly, applies black formatting to parquet.py. Fixes deephaven#3596
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need additional changes:
io.deephaven.parquet.table.layout.KeyValuePartitionLayout
is inferring partitioning column types from the set of values encountered. This will create issues if we change our mind about types on a subsequent scan. We should consider whether we can look for this scenario and provide a better error message.io.deephaven.parquet.table.layout.ParquetKeyValuePartitionedLayout
andio.deephaven.parquet.table.layout.ParquetFlatPartitionedLayout
should probably record the location keys they have created in a map by path, in order to avoid recreating them. This will reduce some of the costs associated with repeat scans.- We should check for trailing magic numbers and refuse to create
ParquetTableLocationKey
instances for incomplete files.
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java
Outdated
Show resolved
Hide resolved
@@ -513,8 +526,8 @@ public static Table readPartitionedTableInferSchema( | |||
getUnboxedTypeIfBoxed(partitionValue.getClass()), null, ColumnDefinition.ColumnType.Partitioning)); | |||
} | |||
allColumns.addAll(schemaInfo.getFirst()); | |||
return readPartitionedTable(recordingLocationKeyFinder, schemaInfo.getSecond(), | |||
TableDefinition.of(allColumns)); | |||
return readPartitionedTable(readInstructions.isRefreshing() ? locationKeyFinder : initialKeys, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little unfortunate if it causes us to scan for keys an extra time initially. I suppose it depends on the key finder implementation from a performance standpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of these interfaces could be improved. For example, I think there is room for the user to explicitly provide the schema instead of inferring it from the first file.
This situation could also be improved by updating or changing TableLocationKeyFinder - we shouldn't need to read all of the table locations if we only want just the first one (for inferring schema).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we already have an exposed public method that allows the user to specify. I also think it's not that easy for a user to do.
The key finders were built for that static use case. I agree that we could have a richer interface (findFirstKey
, findNewKeys(Set)
).
I'm assuming that all TLKs found by an instance of
I wish there was a better way to do this instead of |
I'm very surprised that we are washing through string CSV building for |
It's going to be a much larger change if we want to cache and validate the same way for ParquetKeyValuePartitionedLayout
I'm not sure if it's "worth it" to cache As far as "should we check magic bytes in header/footer", that potentially adds a lot of extraneous overhead unless managed carefully. In the case of |
Yes, we require identical partition layouts across TLKs for the same table. No way for the system to work properly without that, the partitions are columns of the data. The deeper interfaces expect a |
No way. I'm very, very happy with using the CSV parser here. It's basically perfect for the purpose, since it supports columnar string->type inference and conversion out of the box. I would hate to have to build that from scratch, and I like standardizing on the CSV parser's decisions. The only reasonable alternative would be to show one column at a time to some lower layer of the parser to avoid constructing the CSV text. I admit this does present an issue for repeated evaluation. We can't have subsequent We could certainly support a version where instead of inference, the user specifies partition key names, types, and conversions-from-string. |
...main/java/io/deephaven/engine/table/impl/locations/util/ExecutorTableDataRefreshService.java
Show resolved
Hide resolved
...able/src/main/java/io/deephaven/engine/table/impl/locations/impl/TableLocationKeyFinder.java
Outdated
Show resolved
Hide resolved
.../src/main/java/io/deephaven/engine/table/impl/locations/impl/TableLocationKeySafetyImpl.java
Outdated
Show resolved
Hide resolved
.../src/main/java/io/deephaven/engine/table/impl/locations/impl/TableLocationKeySafetyImpl.java
Outdated
Show resolved
Hide resolved
.../src/main/java/io/deephaven/engine/table/impl/locations/impl/TableLocationKeySafetyImpl.java
Outdated
Show resolved
Hide resolved
.../src/main/java/io/deephaven/engine/table/impl/locations/impl/TableLocationKeySafetyImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/layout/ParquetFileHelper.java
Outdated
Show resolved
Hide resolved
...quet/table/src/main/java/io/deephaven/parquet/table/layout/ParquetFlatPartitionedLayout.java
Outdated
Show resolved
Hide resolved
...quet/table/src/main/java/io/deephaven/parquet/table/layout/ParquetFlatPartitionedLayout.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better, but some of the outstanding comments stand.
…ption instead of TableDataException
…bstractTableLocationProvider
The only new safety check manifests itself if all the partitions change values at once (otherwise, there is internal logic in
|
.../table/src/main/java/io/deephaven/parquet/table/layout/ParquetKeyValuePartitionedLayout.java
Show resolved
Hide resolved
private void verifyPartitionKeys(@NotNull TableLocationKey locationKey) { | ||
if (partitionKeys == null) { | ||
partitionKeys = new ArrayList<>(locationKey.getPartitionKeys()); | ||
} else if (!equals(partitionKeys, locationKey.getPartitionKeys())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an improvement. We should consider also verifying the types of the partition values. It's a little bit tricky, though, since we'd want to allow for assignable types based on the schema (either inferred or supplied).
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java
Outdated
Show resolved
Hide resolved
py/server/deephaven/parquet.py
Outdated
_JParquetTools.writeTable( | ||
table.j_table, _JFile(path), write_instructions | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was formatting like that that turned @chipkent off when I first started using Black.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally quite like Black's strong opinionated style. But if we decide to use it, we should use it across all the Python code base. It is best that we reach an agreement as a team, followed up by an initial reformatting of everything, and have the necessary enforcement in place for PRs.
Created #3638 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions about parameter default values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Python changes look good to me.
Labels indicate documentation is required. Issues for documentation have been opened: How-to: https://github.com/deephaven/deephaven.io/issues/2397 |
Additionally, applies black formatting to parquet.py, and plumbs target_page_size from #2555.
Fixes #3596