-
Notifications
You must be signed in to change notification settings - Fork 4.3k
feat(kinesisfirehose): support DeliveryStream record format conversion for S3 Bucket Destination #35410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(kinesisfirehose): support DeliveryStream record format conversion for S3 Bucket Destination #35410
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This review is outdated)
This comment was marked as outdated.
This comment was marked as outdated.
f936570 to
d57bd01
Compare
✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.
|
I'd consider this to be in a reviewable state now. A few things I am thinking of adding:
|
replace IDeserializer and ISerializer with IInputFormat and IOutputFormat rename DataFormatConversionSchema to ConversionSchema Rename bind() to render(), add validations and remove placeholder types to ParquetOutputFormatProps add OrcOutputFormat validations create HiveJson and OpenXJson validation/wrapper classes add comments to Hive Json props and run linter [skip actions] add comments for OpenX JSON and rename ConversionSchema to Schema add some more comments separate record format conversion changes to different files some linting add some more validation when record data format conversion is enabled initial README changes Add comments for parquet props add more comments on Schema and Parquet OutputFormat added comments for ORC format update README with ORC sample props update fixture add exports fix jsii errors fix build errors fix rosetta errors rename core to cdk move some validation to s3 destination instantiation add unit tests add tests for schema iam policy creation fix default value being incorrect when data formatting enabled fix permission issue and remove invalid DEFAULT timestamp parser add integ test update permissions
cefc863 to
9a756a2
Compare
This comment was marked as outdated.
This comment was marked as outdated.
kumvprat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @Kasra-G
I have added a couple of inline comments.
Since this is for native conversion of input records from one format to another in Firehose, should this be limited to Glue tables as source ?
Maybe it can be applied to any source of Firehose, I don't have full context here so feel free to share some documentation that you might have.
...-cdk-testing/framework-integ/test/aws-kinesisfirehose/test/integ.record-format-conversion.ts
Outdated
Show resolved
Hide resolved
Yes, the record format conversion can run on any input source for Firehose (data streams, direct put, etc) You can use a lambda that runs on each record that is input into the DeliveryStream, which will then convert the input json to an output json. But this does not effect the file format that firehose uses to write to s3. The record format conversion is specifically to write parquet/orc files to s3 instead of json, and would run after the lambda (though the lambda is optional). This is useful for analytics use cases since the parquet/orc files are much more efficient to query than jsonl, and for saving space since the compression is typically better. |
|
Curious: any thoughts on the Schema class implementation? {
tableName: tableName,
databaseName: databaseName,
region: region, // region of the table
catalogId: catalogId, // this is just the account id of the table
versionId: versionId, // this can be passed in or default to 'LATEST'
};This is cloudformation requirement. If the AWS Glue L2 constructs were in stable release, we could just do an implementation that follows the RFC:
However glue constructs are not in stable release so I cannot do this. Additionally I can see some abstractions we can provide for format conversion, such as just providing a list of columns and types, so that the glue table/database creation can be done behind the scenes. This is a level of abstraction that even the console doesn't provide, however.
I am thinking of this kind of usage: |
|
For the comment : #35410 (comment)
So basically any data source can use this BUT the main requirement is to use Glue catalog to store the schema and S3 as destination, do correct me if I am wrong here.
I think SchemaProperties/SchemaConfiguration will be a better fit since we are using metadata to identify the schema but not the schema itself. Overall the class looks good with potential renaming.
This is a great idea, but maybe the original destination of this would be glue package itself rather than firehose. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
|
I do see other PR builds running fine, will try by updating your branch with main |
… integ test cases for table with schema in schema registry
...-cdk-testing/framework-integ/test/aws-kinesisfirehose/test/integ.record-format-conversion.ts
Show resolved
Hide resolved
|
Still getting the eslint OOM error. gonna retry the build with a merge. I see other pr builds failing with the same error, I wonder if the actions worker needs more memory |
...ges/@aws-cdk-testing/framework-integ/test/aws-kinesisfirehose/test/util/assertion-helpers.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment on the helper class
Rest looks good
Pull request has been modified.
kumvprat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you for the work on this
|
Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
|
Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
|
Comments on closed issues and PRs are hard for our team to see. |
Issue #
Closes #15501.
Reason for this change
From the Kinesis Firehose RFC, the ability to define record format conversion is still missing
https://github.com/aws/aws-cdk-rfcs/blob/main/text/0340-firehose-l2.md#record-format-conversion-using-aws-glue
Description of changes
See my comment in the issue thread #15501 (comment) for a few more details about the changes
These changes add several classes and data models to support Firehose's record format conversion feature with the L2
DeliveryStreamconstruct, as specified in the RFC.The main focus for the design is to allow configurability of the underlying settings while still providing sane defaults for the 99% of users that won't touch the advanced settings.
One note is that the RFC shows usage of the (as of now) alpha
glue.Tableconstruct. Since those constructs are not yet in stable release, we should supportglue.CfnTablefor now and provide a way to later use the L2 construct whenever that is released without any breaking changes.With these changes merged, users can specify record format conversion like so:
If you want to customize the parameters of the OPENX_JSON input format, specify it like so:
If you want to customize the parameters of the HIVE_JSON input format, specify it like so:
If you want to customize the parameters of the PARQUET output format, specify it like so:
If you want to customize the parameters of the ORC output format, specify it like so:
Changelist:
Describe any new or updated permissions being added
The following permissions are added to the S3 Destination role when
dataFormatConversionis set:{ "Effect": "Allow", "Action": [ "glue:GetTable", "glue:GetTableVersion", "glue:GetTableVersions" ], "Resource": [ "arn:aws:glue:region:account-id:catalog", "arn:aws:glue:region:account-id:database/databaseName", "arn:aws:glue:region:account-id:table/databaseName/tableName", ] }{ "Effect": "Allow", "Action": "glue:GetSchemaVersion", "Resource": "*" }The permissions are acquired from the aws docs https://docs.aws.amazon.com/firehose/latest/dev/controlling-access.html#using-iam-glue, though these docs are a bit misleading. It specifies
table-arnas the resource, but you need to give permissions to the database and catalog as well. See https://docs.aws.amazon.com/glue/latest/dg/glue-specifying-resource-arns.htmlDescription of how you validated changes
Added unit test file and integration test.
Unit tests:
Integ test:
Checklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license