Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redshift dialect with multi-row insert #39

Closed
wants to merge 3 commits into from

Conversation

ahmeroxa
Copy link

This PR/Spike introduces a new Redshift Dialect and support for multi-row inserts.

Ideally we can roll in Redshift specific optimizations into the dialect. In particular the default data mapping of Kafka Records string to Postgres TEXT results in fields being defined as VARCHAR(256) within Redshift. You can read about the Redshift data type mapping here:
https://docs.aws.amazon.com/redshift/latest/dg/r_Character_types.html#r_Character_types-storage-and-ranges

For the sake of testing I have arbitrarily mapped string to VARCHAR(5000) but ideally a more deliberate/intelligent mapping would be implemented.

The multi-row insert is aimed at improving performance of a Sink connector writing to Redshift. During testing multi-row inserts are magnitudes faster than the existing single inserts that the insert mode uses. You can read about AWS's recommendation of multi-row insert here:
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-multi-row-inserts.html

@willyborankin
Copy link
Contributor

willyborankin commented Jun 19, 2020

@ahmeroxa Hi, thank you for PR. It looks great. After discussion with my colleges we would kindly ask you to add some changes:

  • Since Redshift is derived PostgreSQL database and data types are same except TEXT it is better to extend PostgreSqlDatabaseDialect, override the method getSqlType and use VARCHAR instead of TEXT.
  • In case of jdbc connect VARCHAR usually has a possible maximum size, so for Redshift it is better to set it not 5000 but 65535
  • Support of multi-insert is in the DatabaseDialect class, it should be implemented for other databases, as well. I think it would better to add implementation in the GenericDatabaseDialect class, so most of the dialects get it by default, except SAP Hana, Sybase and Derby they don't support multi-insert and Oracle has its own syntax (https://oracle-base.com/articles/9i/multitable-inserts#conditional-insert-all)
  • All of new features should be covered by tests

@ahmeroxa
Copy link
Author

@willyborankin Ok great, those suggestions make sense to me. I can work on those changes.

Regarding VARCHAR(65535) though, the general guidance is to use the smallest possible column size (for you particular usage) due to the fact that VARCHAR columns are left uncompressed in certain cases (e.g. complex queries).

You can find more information here: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-smallest-column-size.html

Perhaps it would be best to make that configurable and provide a sane default (either some arbitrary large number or maybe set the default to MAX and allow users to override it to something smaller).

What do you think?

@ivanyu
Copy link
Contributor

ivanyu commented Jun 23, 2020

@ahmeroxa yeah, I see no better way apart from configuration.

BTW, I made a separate issue for supporting multi-row insert independently from Redshift: #40. Would you like to work on it?
If so, I think we can close this PR.

@ahmeroxa
Copy link
Author

@ivanyu Sure, I can take #40

I do think it's worth having a Redshift specific dialect at some point though (to handle some of the quirks outlined above). We can close this PR and tackle that separately.

@ivanyu
Copy link
Contributor

ivanyu commented Jun 24, 2020

@ahmeroxa

I do think it's worth having a Redshift specific dialect at some point though (to handle some of the quirks outlined above).

Of course! I mean, when we have the support for multi-row insert in general (in existing dialects), adding Redshift would be a small change.

@ivanyu ivanyu closed this Jun 24, 2020
@hariso hariso mentioned this pull request Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants