-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Support writing shredded variant in Iceberg-Spark #14297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
d903c5b
c570ed8
eeeb35c
5c0533e
d7e15a7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -51,7 +51,6 @@ class ParquetWriter<T> implements FileAppender<T>, Closeable { | |
| private final Map<String, String> metadata; | ||
| private final ParquetProperties props; | ||
| private final CodecFactory.BytesCompressor compressor; | ||
| private final MessageType parquetSchema; | ||
| private final ParquetValueWriter<T> model; | ||
| private final MetricsConfig metricsConfig; | ||
| private final int columnIndexTruncateLength; | ||
|
|
@@ -60,6 +59,7 @@ class ParquetWriter<T> implements FileAppender<T>, Closeable { | |
| private final Configuration conf; | ||
| private final InternalFileEncryptor fileEncryptor; | ||
|
|
||
| private MessageType parquetSchema; | ||
| private ColumnChunkPageWriteStore pageStore = null; | ||
| private ColumnWriteStore writeStore; | ||
| private long recordCount = 0; | ||
|
|
@@ -134,6 +134,32 @@ private void ensureWriterInitialized() { | |
|
|
||
| @Override | ||
| public void add(T value) { | ||
| if (model instanceof WriterLazyInitializable lazy) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This kind of feels to me like we should be making a subclass of ParquetWriter which only takes lazy models rather than doing a special case init here based on type matching.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another possibility, can we create the ParquetWriter after buffering? The timeline now just seems a little odd Make Writer, Start buffering data, InitWriter I haven't walked through this but I would have thought we would do something like Buffer Data, Make Writer
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me try out such refactoring. |
||
| if (lazy.needsInitialization()) { | ||
| model.write(0, value); | ||
| recordCount += 1; | ||
|
|
||
| if (!lazy.needsInitialization()) { | ||
| WriterLazyInitializable.InitializationResult result = | ||
| lazy.initialize( | ||
| props, compressor, rowGroupOrdinal, columnIndexTruncateLength, fileEncryptor); | ||
| this.parquetSchema = result.getSchema(); | ||
| this.pageStore.close(); | ||
| this.pageStore = result.getPageStore(); | ||
| this.writeStore.close(); | ||
| this.writeStore = result.getWriteStore(); | ||
aihuaxu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| // Re-initialize the file writer with the new schema | ||
| ensureWriterInitialized(); | ||
|
|
||
| // Buffered rows were already written with endRecord() calls | ||
| // in the lazy writer's initialization, so we don't call endRecord() here | ||
| checkSize(); | ||
| } | ||
| return; | ||
| } | ||
| } | ||
|
|
||
| recordCount += 1; | ||
| model.write(0, value); | ||
| writeStore.endRecord(); | ||
|
|
@@ -255,6 +281,24 @@ private void startRowGroup() { | |
| public void close() throws IOException { | ||
| if (!closed) { | ||
| this.closed = true; | ||
|
|
||
| if (model instanceof WriterLazyInitializable lazy) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a little confusing to me, this is for the case in which our model has buffered but not yet put the information in to a row group? IE we are still deciding how to shred or something like that but close has been called?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's right. This is to handle special case that there are only a few rows which doesn't trigger schema inference logic and the file gets closing. We need to check if initialization has been done here.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I updated the comment a little bit. Hope it makes sense. |
||
| // If initialization is not triggered with few data, lazy writer needs to initialize and | ||
| // process remaining buffered data | ||
| if (lazy.needsInitialization()) { | ||
| WriterLazyInitializable.InitializationResult result = | ||
| lazy.initialize( | ||
| props, compressor, rowGroupOrdinal, columnIndexTruncateLength, fileEncryptor); | ||
| this.parquetSchema = result.getSchema(); | ||
| this.pageStore.close(); | ||
| this.pageStore = result.getPageStore(); | ||
| this.writeStore.close(); | ||
| this.writeStore = result.getWriteStore(); | ||
|
|
||
| ensureWriterInitialized(); | ||
| } | ||
| } | ||
|
|
||
| flushRowGroup(true); | ||
| writeStore.close(); | ||
| if (writer != null) { | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
| package org.apache.iceberg.parquet; | ||
|
|
||
| import org.apache.parquet.column.ColumnWriteStore; | ||
| import org.apache.parquet.column.ParquetProperties; | ||
| import org.apache.parquet.compression.CompressionCodecFactory; | ||
| import org.apache.parquet.crypto.InternalFileEncryptor; | ||
| import org.apache.parquet.hadoop.ColumnChunkPageWriteStore; | ||
| import org.apache.parquet.schema.MessageType; | ||
|
|
||
| /** | ||
| * Interface for ParquetValueWriters that need to defer initialization until they can analyze the | ||
| * data. This is useful for scenarios like variant shredding where the schema needs to be inferred | ||
| * from the actual data before creating the writer structures. | ||
| * | ||
| * <p>Writers implementing this interface can buffer initial rows and perform schema inference | ||
| * before committing to a final Parquet schema. | ||
| */ | ||
| public interface WriterLazyInitializable { | ||
| /** | ||
| * Result returned by lazy initialization of a ParquetValueWriter required by ParquetWriter. | ||
| * Contains the finalized schema and write stores after schema inference or other initialization | ||
| * logic. | ||
| */ | ||
| class InitializationResult { | ||
| private final MessageType schema; | ||
| private final ColumnChunkPageWriteStore pageStore; | ||
| private final ColumnWriteStore writeStore; | ||
|
|
||
| public InitializationResult( | ||
| MessageType schema, ColumnChunkPageWriteStore pageStore, ColumnWriteStore writeStore) { | ||
| this.schema = schema; | ||
| this.pageStore = pageStore; | ||
| this.writeStore = writeStore; | ||
| } | ||
|
|
||
| public MessageType getSchema() { | ||
| return schema; | ||
| } | ||
|
|
||
| public ColumnChunkPageWriteStore getPageStore() { | ||
| return pageStore; | ||
| } | ||
|
|
||
| public ColumnWriteStore getWriteStore() { | ||
| return writeStore; | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Checks if this writer still needs initialization. This will return true until the writer has | ||
| * buffered enough data to perform initialization (e.g., schema inference). | ||
| * | ||
| * @return true if initialization is still needed, false if already initialized | ||
| */ | ||
| boolean needsInitialization(); | ||
|
|
||
| /** | ||
| * Performs initialization and returns the result containing updated schema and write stores. This | ||
| * method should only be called when {@link #needsInitialization()} returns true. | ||
| * | ||
| * @param props Parquet properties needed for creating write stores | ||
| * @param compressor Bytes compressor for compression | ||
| * @param rowGroupOrdinal The ordinal number of the current row group | ||
| * @param columnIndexTruncateLength The column index truncate length from ParquetWriter config | ||
| * @param fileEncryptor The file encryptor from ParquetWriter, may be null if encryption is | ||
| * disabled | ||
| * @return InitializationResult containing the finalized schema and write stores | ||
| */ | ||
| InitializationResult initialize( | ||
| ParquetProperties props, | ||
| CompressionCodecFactory.BytesInputCompressor compressor, | ||
| int rowGroupOrdinal, | ||
| int columnIndexTruncateLength, | ||
| InternalFileEncryptor fileEncryptor); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any way we can avoid having to try/fail here? Just wondering if it's just for decimals can we do some decimal specific check or is this just much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is simpler. Probably we can introduce a DecimalWriter to handle that by checking if there is scale mismatch and fallback to write to value field.
Do you prefer that approach?