Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination Redshift: Introduces configurable value for file buffer count #20879

Merged
merged 5 commits into from
Jan 3, 2023

Conversation

ryankfu
Copy link
Contributor

@ryankfu ryankfu commented Dec 26, 2022

What

Follows discussion in #13975 and partially addresses https://github.com/airbytehq/airbyte-internal-issues/issues/496 which asserts that interleaved messages (e.g. Change Data Capture aka CDC) trigger frequent flush all commands which reduces performance since buffers will thrash due to the constant need to switch buffers for different streams

This change doesn't fully address those tickets but introduces work that a community member has introduced which is a user configurable field for concurrent buffers to be created when a MessageConsumer is created for destinations. This change follows a discussion that mentions that each buffer requires a minimum of 31 MB and having an unbounded number of buffers is not feasible primarily because the amount of memory would need to scale and since Airbyte aims to not break existing customer's workflow this number is increased with that consideration in mind

How

Creates the ability for a user to exceed the default within the hard cap limit. If the value exceeds the soft cap then the user will be prompted with a warning indicating the reason to exceed grants additional performance so as long as the increase does not exceed the number of streams to be synced

Recommended reading order

  1. RedshiftStagingS3Destination.java
  2. FileBuffer.java
  3. RedshiftStagingS3DestinationTest.java

🚨 User Impact 🚨

Are there any breaking changes? What is the end result perceived by the user? If yes, please merge this PR with the 🚨🚨 emoji so changelog authors can further highlight this if needed.

This should not have any breaking changes as the increase in FileBuffer is within the number of memory that is allocated generally across all users (1 GB) and the soft cap is within that amount even with "wiggle" room

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the connector is published, connector added to connector index as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here
Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub and connector version bumped by running the /publish command described here
Connector Generator
  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed

Tests

Unit

Put your unit tests output here.

Integration

Put your integration tests output here.

Acceptance

Put your acceptance tests output here.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 26, 2022

Affected Connector Report

NOTE ⚠️ Changes in this PR affect the following connectors. Make sure to do the following as needed:

  • Run integration tests
  • Bump connector or module version
  • Add changelog
  • Publish the new version

✅ Sources (31)

Connector Version Changelog Publish
source-alloydb 1.0.34
source-alloydb-strict-encrypt 1.0.34 🔵
(ignored)
🔵
(ignored)
source-bigquery 0.2.3
source-clickhouse 0.1.14
source-clickhouse-strict-encrypt 0.1.14 🔵
(ignored)
🔵
(ignored)
source-cockroachdb 0.1.18
source-cockroachdb-strict-encrypt 0.1.18 🔵
(ignored)
🔵
(ignored)
source-db2 0.1.16
source-db2-strict-encrypt 0.1.16 🔵
(ignored)
🔵
(ignored)
source-dynamodb 0.1.0
source-e2e-test 2.1.3
source-e2e-test-cloud 2.1.1 🔵
(ignored)
🔵
(ignored)
source-elasticsearch 0.1.1
source-jdbc 0.3.5 🔵
(ignored)
🔵
(ignored)
source-kafka 0.2.3
source-mongodb-strict-encrypt 0.1.19 🔵
(ignored)
🔵
(ignored)
source-mongodb-v2 0.1.19
source-mssql 0.4.26
source-mssql-strict-encrypt 0.4.26 🔵
(ignored)
🔵
(ignored)
source-mysql 1.0.18
source-mysql-strict-encrypt 1.0.18 🔵
(ignored)
🔵
(ignored)
source-oracle 0.3.21
source-oracle-strict-encrypt 0.3.21 🔵
(ignored)
🔵
(ignored)
source-postgres 1.0.34
source-postgres-strict-encrypt 1.0.34 🔵
(ignored)
🔵
(ignored)
source-redshift 0.3.15
source-relational-db 0.3.1 🔵
(ignored)
🔵
(ignored)
source-scaffold-java-jdbc 0.1.0 🔵
(ignored)
🔵
(ignored)
source-sftp 0.1.2
source-snowflake 0.1.27
source-tidb 0.2.1
  • See "Actionable Items" below for how to resolve warnings and errors.

❌ Destinations (46)

Connector Version Changelog Publish
destination-azure-blob-storage 0.1.6
destination-bigquery 1.2.9
destination-bigquery-denormalized 1.2.9
destination-cassandra 0.1.4
destination-clickhouse 0.2.1
destination-clickhouse-strict-encrypt 0.2.1 🔵
(ignored)
🔵
(ignored)
destination-csv 0.2.10
destination-databricks 0.3.1
destination-dev-null 0.2.7 🔵
(ignored)
🔵
(ignored)
destination-doris 0.1.0
destination-dynamodb 0.1.7
destination-e2e-test 0.2.4
destination-elasticsearch 0.1.6
destination-elasticsearch-strict-encrypt 0.1.6 🔵
(ignored)
🔵
(ignored)
destination-gcs 0.2.12
destination-iceberg 0.1.0
destination-jdbc 0.3.14 🔵
(ignored)
🔵
(ignored)
destination-kafka 0.1.10
destination-keen 0.2.4
destination-kinesis 0.1.5
destination-local-json 0.2.11
destination-mariadb-columnstore 0.1.7
destination-mongodb 0.1.9
destination-mongodb-strict-encrypt 0.1.9 🔵
(ignored)
🔵
(ignored)
destination-mqtt 0.1.3
destination-mssql 0.1.22
destination-mssql-strict-encrypt 0.1.22 🔵
(ignored)
🔵
(ignored)
destination-mysql 0.1.20
destination-mysql-strict-encrypt 0.1.21
(mismatch: 0.1.20)
🔵
(ignored)
🔵
(ignored)
destination-oracle 0.1.19
destination-oracle-strict-encrypt 0.1.19 🔵
(ignored)
🔵
(ignored)
destination-postgres 0.3.26
destination-postgres-strict-encrypt 0.3.26 🔵
(ignored)
🔵
(ignored)
destination-pubsub 0.2.0
destination-pulsar 0.1.3
destination-r2 0.1.0
destination-redis 0.1.4
destination-redpanda 0.1.0
destination-redshift 0.3.52
destination-rockset 0.1.4
destination-s3 0.3.18
destination-s3-glue 0.1.1
destination-scylla 0.1.3
destination-snowflake 0.4.41
destination-tidb 0.1.0
destination-yugabytedb 0.1.0
  • See "Actionable Items" below for how to resolve warnings and errors.

👀 Other Modules (1)

  • base-normalization

Actionable Items

(click to expand)

Category Status Actionable Item
Version
mismatch
The version of the connector is different from its normal variant. Please bump the version of the connector.

doc not found
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug.
Changelog
doc not found
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug.

changelog missing
There is no chnagelog for the current version of the connector. If you are the author of the current version, please add a changelog.
Publish
not in seed
The connector is not in the seed file (e.g. source_definitions.yaml), so its publication status cannot be checked. This can be normal (e.g. some connectors are cloud-specific, and only listed in the cloud seed file). Please double-check to make sure that it is not a bug.

diff seed version
The connector exists in the seed file, but the latest version is not listed there. This usually means that the latest version is not published. Please use the /publish command to publish the latest version.

@ryankfu ryankfu force-pushed the ryan/increase-file-buffer-redshift branch from 864975e to 1495ff8 Compare December 26, 2022 05:59
@ryankfu
Copy link
Contributor Author

ryankfu commented Dec 26, 2022

/test connector=connectors/destination-redshift

🕑 connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/3779449845
❌ connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/3779449845
🐛 https://gradle.com/s/u4rrwq3phkn7o

Build Failed

Test summary info:

Could not find result summary

@ryankfu ryankfu force-pushed the ryan/increase-file-buffer-redshift branch from 1495ff8 to ffd7e72 Compare December 26, 2022 06:17
@ryankfu ryankfu force-pushed the ryan/increase-file-buffer-redshift branch from ffd7e72 to ac6d789 Compare December 27, 2022 17:24
@ryankfu
Copy link
Contributor Author

ryankfu commented Dec 27, 2022

/test connector=connectors/destination-redshift

🕑 connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/3788988942
✅ connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/3788988942
Python tests coverage:

Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
normalization/transform_config/__init__.py                            2      0   100%
normalization/transform_catalog/reserved_keywords.py                 14      0   100%
normalization/transform_catalog/__init__.py                           2      0   100%
normalization/destination_type.py                                    14      0   100%
normalization/__init__.py                                             4      0   100%
normalization/transform_catalog/destination_name_transformer.py     166      8    95%
normalization/transform_catalog/table_name_registry.py              174     34    80%
normalization/transform_config/transform.py                         189     48    75%
normalization/transform_catalog/utils.py                             51     14    73%
normalization/transform_catalog/dbt_macro.py                         22      7    68%
normalization/transform_catalog/catalog_processor.py                147     80    46%
normalization/transform_catalog/transform.py                         61     38    38%
normalization/transform_catalog/stream_processor.py                 595    400    33%
-------------------------------------------------------------------------------------
TOTAL                                                              1441    629    56%

Build Passed

Test summary info:

All Passed

@ryankfu ryankfu marked this pull request as ready for review December 27, 2022 18:33
@ryankfu ryankfu requested a review from a team as a code owner December 27, 2022 18:33
"file_buffer_size": {
"title": "File Buffer Count",
"type": "integer",
"minimum": 15,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new test checks that this value can be set as low as 10, but the spec says that minimum us 15. Should the two minimums be the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Yes, the minimums should be the same. I'll add on a test and change that ensures at minimum there are 15 file buffers

@ryankfu ryankfu force-pushed the ryan/increase-file-buffer-redshift branch from ac6d789 to be9aba3 Compare December 28, 2022 20:14
* before another stream's buffer can be created. Increasing the default max will reduce likelihood
* of thrashing but not entirely eliminate unless number of buffers equals streams to be synced
*/
public static final int DEFAULT_MAX_CONCURRENT_STREAM_IN_BUFFER = 15;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are changing the default and minimum from 10 to 15. This raises two questions:

  • How this will affect existing connections?
  • How this will affect non-cloud users, who may be running this connector in containers with less than 1GB RAM?

Copy link
Contributor Author

@ryankfu ryankfu Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The likelihood of this affecting existing connections will be low since the minimum memory needed for all the default buffers will be 465 MB (15 * 31 MB) which is strictly less than the 1 GB MAX_TOTAL_BUFFER_SIZE_BYTES
  • Furthermore, if a customer had low memory availability then they would have run into issues earlier since our MAX_TOTAL_BUFFER_SIZE_BYTES was 1 GB before a flush. To not run into issues prior, users would have had to tune the original DEFAULT_MAX_CONCURRENT_STREAM_IN_BUFFER or lower the MAX_TOTAL_BUFFER_SIZE_BYTES to the amount of memory available for the JVM

It is reasonable to have within the change-log to mark this change as potentially breaking to users with memory less than 665 MB (465 MB + 200 MB from MAX_PER_STREAM_BUFFER_SIZE_BYTES)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, we used to run containers with 500MB by default and recently switched to 1GB. Why not leave minimum as 10 instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MAX_TOTAL_BUFFER_SIZE_BYTES has been 1 GB since March so most of this decision is based on that value. Didn't realize that you already had a comment about this here in the community PR but it makes sense to reduce plausibilities of breakages. It begs the question of how we communicate better with our community about these increases. It seems like allowing it to be configurable is one option with proper warnings

Based on your comment, it seems better to just have the default maintain 10 with the option to configure for the user

@ryankfu ryankfu changed the title Increased default buffer count and introduces configurable value for destination redshift 🚨 Increased default buffer count and introduces configurable value for destination redshift 🚨 Dec 28, 2022
@ryankfu ryankfu changed the title 🚨 Increased default buffer count and introduces configurable value for destination redshift 🚨 Destination Redshift: Introduces configurable value for file buffer count Dec 28, 2022
@octavia-squidington-iv octavia-squidington-iv added the area/documentation Improvements or additions to documentation label Dec 30, 2022
@ryankfu
Copy link
Contributor Author

ryankfu commented Jan 2, 2023

/publish connector=connectors/destination-redshift run-tests=false

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

EDIT: Failed test run here

Re-running publish command since it failed to checkout Airbyte repository

@ryankfu
Copy link
Contributor Author

ryankfu commented Jan 2, 2023

/publish connector=connectors/destination-redshift run-tests=false

🕑 Publishing the following connectors:
connectors/destination-redshift
https://github.com/airbytehq/airbyte/actions/runs/3824144279


Connector Did it publish? Were definitions generated?
connectors/destination-redshift

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets January 2, 2023 17:35 — with GitHub Actions Inactive
@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets January 2, 2023 17:36 — with GitHub Actions Inactive
@ryankfu ryankfu merged commit 64254a4 into master Jan 3, 2023
@ryankfu ryankfu deleted the ryan/increase-file-buffer-redshift branch January 3, 2023 03:16
@adam-bloom
Copy link
Contributor

adam-bloom commented Jan 24, 2023

Hi @ryankfu - I found a typo in this that prevents it from working correctly - buffers are still capped at 10.

https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-redshift/src/main/java/io/airbyte/integrations/destination/redshift/RedshiftStagingS3Destination.java#L178 - this max is to the default, which is still 10. It should be a max with FileBuffer.MAX_CONCURRENT_STREAM_IN_BUFFER

Additionally, the default of 1 here doesn't seem correct - that should likely be FileBuffer.DEFAULT_MAX_CONCURRENT_STREAM_IN_BUFFER

@adam-bloom
Copy link
Contributor

Also - there's a discrepancy between the config keys used in the spec and the application, so I don't think its even possible to configure this correctly.

@ryankfu
Copy link
Contributor Author

ryankfu commented Jan 25, 2023

@adam-bloom thank you for pointing out the mismatched spec and application values, I will spin up a PR to address this

As for the buffers being capped at 10, that should not be the case since

return Math.max(numOfFileBuffers, FileBuffer.DEFAULT_MAX_CONCURRENT_STREAM_IN_BUFFER);

will return the larger of numOfFileBuffers or DEFAULT_MAX_CONCURRENT_STREAM_IN_BUFFER (10)

In fact there's a test that captures this:

public void testGetFileBufferMaxLimited() {
final JsonNode defaultConfig = Jsons.clone(config);
((ObjectNode) defaultConfig).put(FileBuffer.FILE_BUFFER_COUNT_KEY, 100);
final RedshiftStagingS3Destination destination = new RedshiftStagingS3Destination();
assertEquals(destination.getNumberOfFileBuffers(defaultConfig), FileBuffer.MAX_CONCURRENT_STREAM_IN_BUFFER);

@adam-bloom
Copy link
Contributor

I think it was really the config mismatch that I was seeing the impacts of. The 10 would have been coming from max(1, 10). Just needed a second pass!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/destination/redshift
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants