-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 Destination S3: parquet output #3908
Conversation
bd97fe8
to
176db13
Compare
.put("HKD", 10) | ||
.put("NZD", 700) | ||
.put("HKD", 10.0) | ||
.put("NZD", 700.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HKD
and NZD
are typed as number
in the catalog
. All other entries have decimals for these two fields. So I'd like to change these values to decimals as well so that the type is consistent.
In Parquet and probably other formats, we need to have strict type mappings, and number
is mapped to double
. If these two fields have flipping types, I need to do arbitrary conversions to pass the acceptance test, which seems unnecessary.
} catch (Exception e) { | ||
return Optional.empty(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the arbitrary type conversion I was talking about in the comment about changing integer HKD field to decimals. When the testing data has consistent typing, such conversion can be removed.
.queueCapacity(S3DestinationConstants.DEFAULT_QUEUE_CAPACITY) | ||
.numUploadThreads(S3DestinationConstants.DEFAULT_UPLOAD_THREADS) | ||
.partSize(S3DestinationConstants.DEFAULT_PART_SIZE_MD); | ||
.numStreams(S3CsvConstants.DEFAULT_NUM_STREAMS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHy don't we use the same hadoop s3 uploader here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which uploader do you mean by "hadoop s3 uploader"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am reading the PR correctly, it seems we are using two different ways to push data to s3.
- ParquetWriter
- StreamTransferManager
I am just curious if it is possible to use a similar one from the Hadoop package to push the CSV one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. The two writers output data in different data structures, and we do need them for different formats. The Parquet writer organizes data in Parquet row groups, while the stream transfer manager writes data line by line.
a9790dd
to
b369dd7
Compare
@@ -15,10 +15,18 @@ dependencies { | |||
implementation project(':airbyte-integrations:connectors:destination-jdbc') | |||
implementation files(project(':airbyte-integrations:bases:base-java').airbyteDocker.outputs) | |||
|
|||
// csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the comments here to make clear what dependencies are for what!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Appreciate the comments + the extensive tests.
My comments are:
- Minor readability changes.
- Some better commenting to help future OSS contributors. I know there are some that want to contribute other formats.
- Possibility of using an OSS tool to do the Json -> Avro conversion. Not a blocker. Thought it would be nice to not write our own tool.
- Possibility of sharing the CsvWriter stream transfer manager construction with the CopyConsumer.
- Possibility of using the PrimitiveJsonSchema class as a enum instead of having a separate listing in the S3 directory.
The last 2 points are more me thinking out loud. I'm entirely sure they are good ideas. These changes can be done in follow up PRs since this one is getting big as is.
...ors/destination-s3/src/main/java/io/airbyte/integrations/destination/s3/csv/S3CsvWriter.java
Show resolved
Hide resolved
if (hasFailed) { | ||
LOGGER.warn("Failure detected. Aborting upload of stream '{}'...", stream.getName()); | ||
csvPrinter.close(); | ||
outputStream.close(); | ||
uploadManager.abort(); | ||
LOGGER.warn("Upload of stream '{}' aborted.", stream.getName()); | ||
} else { | ||
LOGGER.info("Uploading remaining data for stream '{}'.", stream.getName()); | ||
csvPrinter.close(); | ||
outputStream.close(); | ||
uploadManager.complete(); | ||
LOGGER.info("Upload completed for stream '{}'.", stream.getName()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (hasFailed) { | |
LOGGER.warn("Failure detected. Aborting upload of stream '{}'...", stream.getName()); | |
csvPrinter.close(); | |
outputStream.close(); | |
uploadManager.abort(); | |
LOGGER.warn("Upload of stream '{}' aborted.", stream.getName()); | |
} else { | |
LOGGER.info("Uploading remaining data for stream '{}'.", stream.getName()); | |
csvPrinter.close(); | |
outputStream.close(); | |
uploadManager.complete(); | |
LOGGER.info("Upload completed for stream '{}'.", stream.getName()); | |
} | |
} | |
csvPrinter.close(); | |
outputStream.close(); | |
if (hasFailed) { | |
LOGGER.warn("Failure detected. Aborting upload of stream '{}'...", stream.getName()); | |
uploadManager.abort(); | |
LOGGER.warn("Upload of stream '{}' aborted.", stream.getName()); | |
return; | |
} | |
LOGGER.info("Uploading remaining data for stream '{}'.", stream.getName()); | |
uploadManager.complete(); | |
LOGGER.info("Upload completed for stream '{}'.", stream.getName()); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just slightly easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will move the two close
statements before the if
check.
Usually I am a fan of early returning. However, given the shortness of the if
blocks, it is already pretty readable, and early returning will slightly make it more confusing I think.
import java.util.Map; | ||
|
||
/** | ||
* This helper class tracks whether a Json has special field name that needs to be replaced with a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we also add why this is required. as is I'm not sure whether this is because parquet expects it this way or we want to standardise thing to make things simpler (it looks like the latter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the former. Parquet only allows these characters for record name: /a-zA-Z0-9_/
. Otherwise I won't go through all the trouble to do this. The necessity of this tracker actually complicates things a lot.
Will update the comment to reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet only allows these characters for record name: /a-zA-Z0-9_/.
We currently have to deal with some naming conventions because some destinations allow different subsets of characters for identifiers names. These are dealt in classes deriving from airbyte-integrations/bases/base-java/src/main/java/io/airbyte/integrations/destination/NamingConventionTransformer.java
It seems that for S3 parquet, a difference would be that it needs to apply those conventions to field names too. Other destinations only cares about conventions for stream names and namespace (table & schema).
But would it make sense to regroup them in the same kind of class/hierarchy too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But would it make sense to regroup them in the same kind of class/hierarchy too?
@ChristopheDuong, can you elaborate on this? What do you mean by "regroup them in the same kind of class"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChristopheDuong, can you elaborate on this? What do you mean by "regroup them in the same kind of class"?
Should we move some logic around the string transformations of this code to a class named S3NameTransformer
that extends NamingConventionTransformer
like we do with SnowflakeSQLNameTransformer
or RedshiftSQLNameTransformer
etc ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
The name conversion logic for Parquet is exactly the same as the one in StandardNameTransformer
. So there is nothing to override. It seems unnecessary to create a new class.
|
||
S3ParquetFormatConfig formatConfig = (S3ParquetFormatConfig) config.getFormatConfig(); | ||
Configuration hadoopConfig = getHadoopConfig(config); | ||
this.parquetWriter = AvroParquetWriter.<GenericData.Record>builder(HadoopOutputFile.fromPath(path, hadoopConfig)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sad we have to rely on hadoop libraries to do this: https://issues.apache.org/jira/browse/PARQUET-1822
Gah.
...tination-s3/src/main/java/io/airbyte/integrations/destination/s3/parquet/JsonSchemaType.java
Show resolved
Hide resolved
.../src/main/java/io/airbyte/integrations/destination/s3/parquet/JsonToAvroSchemaConverter.java
Show resolved
Hide resolved
...est-integration/java/io/airbyte/integrations/destination/s3/S3DestinationAcceptanceTest.java
Show resolved
Hide resolved
|
||
@Override | ||
public void close(boolean hasFailed) throws IOException { | ||
if (hasFailed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thought as the CsvWriter close method. I prefer return early instead of an else block.
@davinchia, thanks for the code review. I know it's a long PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I just have 1 comment.
I would have loved if somehow we could have reused the hadoop library. Is there not a csv version of HadoopOutputFile
so that we could have reused a lot of code?
@@ -109,7 +109,8 @@ jobs: | |||
ZENDESK_TALK_TEST_CREDS: ${{ secrets.ZENDESK_TALK_TEST_CREDS }} | |||
ZOOM_INTEGRATION_TEST_CREDS: ${{ secrets.ZOOM_INTEGRATION_TEST_CREDS }} | |||
PLAID_INTEGRATION_TEST_CREDS: ${{ secrets.PLAID_INTEGRATION_TEST_CREDS }} | |||
DESTINATION_S3_INTEGRATION_TEST_CREDS: ${{ secrets.DESTINATION_S3_INTEGRATION_TEST_CREDS }} | |||
DESTINATION_S3_CSV_INTEGRATION_TEST_CREDS: ${{ secrets.DESTINATION_S3_CSV_INTEGRATION_TEST_CREDS }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can have only 1 set of credentials and change the cofig method to only read the relevant information from the credentials.
For instance this part can be hardcoded in the test class in the config i.e. for CSV test it can be
"format": {
"format_type": "CSV",
"flattening": "Root level flattening"
}
and for Parquet test it can be
"format": { "format_type": "Parquet", "compression_codec": "GZIP" }
and the sensitive information can be populated from the credentials.
/test connector=connectors/destination-s3
|
/publish connector=connectors/destination-s3
|
/publish connector=connectors/destination-s3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff @tuliren!
Nothing else from me; feel free to merge whenever!
/publish connector=connectors/destination-s3
|
|
||
static List<JsonSchemaType> getTypes(String fieldName, JsonNode typeProperty) { | ||
if (typeProperty == null) { | ||
throw new IllegalStateException(String.format("Field %s has no type", fieldName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW some catalogs are producing fields without types, so it's not so uncommon...
For example, the source-facebook
does this I think... Does this exception cancel the sync of such catalogs? Should the fields be ignored or defaulted to a string for example instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. This is good to know.
In general, schemaless source is not suitable for Parquet. There is another big problem regarding how we are going to handle additionalProperties
whose types are known.
I will submit a follow-up PR to take care of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue: #4124
What
S3OutputFormatter
toS3Writer
, and add awriter
package.BaseS3Writer
.util
package.S3DestinationAcceptanceTest
.AvroParquetWriter
to output Parquet files on S3.hadoop-aws
ensures that data is uploaded to S3 while it is generated on the fly.JsonSchemaConverter
to convert JsonSchema to Avro schema.json2avro.converter
to convert Json object to Avro record based on the schema.Recommended reading order
spec.json
S3ParquetWriter
JsonSchemaConverter
Pre-merge Checklist
Expand the checklist which is relevant for this PR.
Connector checklist
airbyte_secret
in output spec./gradlew :airbyte-integrations:connectors:<name>:integrationTest
./test connector=connectors/<name>
command as documented here is passing.docs/integrations/
directory./publish
command described here