-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source Postgres : Emit estimate trace messages for non-CDC mode #20783
Conversation
/test connector=connectors/source-postgres |
/test connector=connectors/source-postgres-strict-encrypt |
Affected Connector ReportNOTE
|
Connector | Version | Changelog | Publish |
---|---|---|---|
source-alloydb |
1.0.34 |
✅ | ✅ |
source-alloydb-strict-encrypt |
1.0.34 |
🔵 (ignored) |
🔵 (ignored) |
source-bigquery |
0.2.3 |
✅ | ✅ |
source-clickhouse |
0.1.14 |
✅ | ✅ |
source-clickhouse-strict-encrypt |
0.1.14 |
🔵 (ignored) |
🔵 (ignored) |
source-cockroachdb |
0.1.18 |
✅ | ✅ |
source-cockroachdb-strict-encrypt |
0.1.18 |
🔵 (ignored) |
🔵 (ignored) |
source-db2 |
0.1.16 |
✅ | ✅ |
source-db2-strict-encrypt |
0.1.16 |
🔵 (ignored) |
🔵 (ignored) |
source-dynamodb |
0.1.0 |
✅ | ✅ |
source-e2e-test |
2.1.3 |
✅ | ✅ |
source-e2e-test-cloud |
2.1.1 |
🔵 (ignored) |
🔵 (ignored) |
source-elasticsearch |
0.1.1 |
✅ | ✅ |
source-jdbc |
0.3.5 |
🔵 (ignored) |
🔵 (ignored) |
source-kafka |
0.2.3 |
✅ | ✅ |
source-mongodb-strict-encrypt |
0.1.19 |
🔵 (ignored) |
🔵 (ignored) |
source-mongodb-v2 |
0.1.19 |
✅ | ✅ |
source-mssql |
0.4.26 |
✅ | ✅ |
source-mssql-strict-encrypt |
0.4.26 |
🔵 (ignored) |
🔵 (ignored) |
source-mysql |
1.0.18 |
✅ | ✅ |
source-mysql-strict-encrypt |
1.0.18 |
🔵 (ignored) |
🔵 (ignored) |
source-oracle |
0.3.21 |
✅ | ✅ |
source-oracle-strict-encrypt |
0.3.21 |
🔵 (ignored) |
🔵 (ignored) |
source-postgres |
1.0.37 |
✅ | ✅ |
source-postgres-strict-encrypt |
1.0.37 |
🔵 (ignored) |
🔵 (ignored) |
source-redshift |
0.3.15 |
✅ | ✅ |
source-relational-db |
0.3.1 |
🔵 (ignored) |
🔵 (ignored) |
source-scaffold-java-jdbc |
0.1.0 |
🔵 (ignored) |
🔵 (ignored) |
source-sftp |
0.1.2 |
✅ | ✅ |
source-snowflake |
0.1.28 |
✅ | ✅ |
source-tidb |
0.2.1 |
✅ | ✅ |
- See "Actionable Items" below for how to resolve warnings and errors.
❌ Destinations (47)
Connector | Version | Changelog | Publish |
---|---|---|---|
destination-azure-blob-storage |
0.1.6 |
✅ | ✅ |
destination-bigquery |
1.2.9 |
✅ | ✅ |
destination-bigquery-denormalized |
1.2.10 |
✅ | ✅ |
destination-cassandra |
0.1.4 |
✅ | ✅ |
destination-clickhouse |
0.2.1 |
✅ | ✅ |
destination-clickhouse-strict-encrypt |
0.2.1 |
🔵 (ignored) |
🔵 (ignored) |
destination-csv |
1.0.0 |
❌ (changelog missing) |
✅ |
destination-databricks |
0.3.1 |
✅ | ✅ |
destination-dev-null |
0.2.7 |
🔵 (ignored) |
🔵 (ignored) |
destination-doris |
0.1.0 |
✅ | ✅ |
destination-dynamodb |
0.1.7 |
✅ | ✅ |
destination-e2e-test |
0.2.4 |
✅ | ✅ |
destination-elasticsearch |
0.1.6 |
✅ | ✅ |
destination-elasticsearch-strict-encrypt |
0.1.6 |
🔵 (ignored) |
🔵 (ignored) |
destination-gcs |
0.2.12 |
✅ | ✅ |
destination-iceberg |
0.1.0 |
✅ | ✅ |
destination-jdbc |
0.3.14 |
🔵 (ignored) |
🔵 (ignored) |
destination-kafka |
0.1.10 |
✅ | ✅ |
destination-keen |
0.2.4 |
✅ | ✅ |
destination-kinesis |
0.1.5 |
✅ | ✅ |
destination-local-json |
0.2.11 |
✅ | ✅ |
destination-mariadb-columnstore |
0.1.7 |
✅ | ✅ |
destination-mongodb |
0.1.9 |
✅ | ✅ |
destination-mongodb-strict-encrypt |
0.1.9 |
🔵 (ignored) |
🔵 (ignored) |
destination-mqtt |
0.1.3 |
✅ | ✅ |
destination-mssql |
0.1.22 |
✅ | ✅ |
destination-mssql-strict-encrypt |
0.1.22 |
🔵 (ignored) |
🔵 (ignored) |
destination-mysql |
0.1.20 |
✅ | ✅ |
destination-mysql-strict-encrypt |
❌ 0.1.21 (mismatch: 0.1.20 ) |
🔵 (ignored) |
🔵 (ignored) |
destination-oracle |
0.1.19 |
✅ | ✅ |
destination-oracle-strict-encrypt |
0.1.19 |
🔵 (ignored) |
🔵 (ignored) |
destination-postgres |
0.3.26 |
✅ | ✅ |
destination-postgres-strict-encrypt |
0.3.26 |
🔵 (ignored) |
🔵 (ignored) |
destination-pubsub |
0.2.0 |
✅ | ✅ |
destination-pulsar |
0.1.3 |
✅ | ✅ |
destination-r2 |
0.1.0 |
✅ | ✅ |
destination-redis |
0.1.4 |
✅ | ✅ |
destination-redpanda |
0.1.0 |
✅ | ✅ |
destination-redshift |
0.3.53 |
✅ | ✅ |
destination-rockset |
0.1.4 |
✅ | ✅ |
destination-s3 |
0.3.18 |
✅ | ✅ |
destination-s3-glue |
0.1.1 |
✅ | ✅ |
destination-scylla |
0.1.3 |
✅ | ✅ |
destination-snowflake |
0.4.42 |
❌ (changelog missing) |
✅ |
destination-teradata |
0.1.0 |
✅ | ✅ |
destination-tidb |
0.1.0 |
✅ | ✅ |
destination-yugabytedb |
0.1.0 |
✅ | ✅ |
- See "Actionable Items" below for how to resolve warnings and errors.
✅ Other Modules (0)
Actionable Items
(click to expand)
Category | Status | Actionable Item |
---|---|---|
Version | ❌ mismatch |
The version of the connector is different from its normal variant. Please bump the version of the connector. |
⚠ doc not found |
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug. |
|
Changelog | ⚠ doc not found |
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug. |
❌ changelog missing |
There is no chnagelog for the current version of the connector. If you are the author of the current version, please add a changelog. | |
Publish | ⚠ not in seed |
The connector is not in the seed file (e.g. source_definitions.yaml ), so its publication status cannot be checked. This can be normal (e.g. some connectors are cloud-specific, and only listed in the cloud seed file). Please double-check to make sure that it is not a bug. |
❌ diff seed version |
The connector exists in the seed file, but the latest version is not listed there. This usually means that the latest version is not published. Please use the /publish command to publish the latest version. |
/test connector=connectors/source-postgres-strict-encrypt |
/test connector=connectors/source-postgres |
/test connector=connectors/source-postgres-strict-encrypt
Build PassedTest summary info:
|
/test connector=connectors/source-postgres
Build PassedTest summary info:
|
...rs/source-postgres/src/main/java/io/airbyte/integrations/source/postgres/PostgresSource.java
Outdated
Show resolved
Hide resolved
...rs/source-postgres/src/main/java/io/airbyte/integrations/source/postgres/PostgresSource.java
Outdated
Show resolved
Hide resolved
...rs/source-postgres/src/main/java/io/airbyte/integrations/source/postgres/PostgresSource.java
Show resolved
Hide resolved
...ource-postgres/src/main/java/io/airbyte/integrations/source/postgres/PostgresQueryUtils.java
Outdated
Show resolved
Hide resolved
I believe something about this PR causes our Postgres sync to hang. For what it's worth, our Postgres database is on Heroku. Notice below in logs from 1.0.36 that the
In logs from 1.0.37, the last source message happens a minute or more before I canceled it:
I've only waited up to 6 minutes before canceling as this is on a sync that normally takes end-to-end 1 minute to run. |
SELECT (SELECT COUNT(*) FROM %s) AS %s, | ||
pg_relation_size('%s') AS %s; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description says the intention is to only estimate table sizes using pg_relation_size
, but won't this query end up doing a full count also? I believe this is what's causing my sync to hang for minutes (and maybe more) as attempting a count on my 186M row table takes quite a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the number of rows is needed, I'd recommend using an estimate via SELECT reltuples AS estimate FROM pg_class WHERE relname = 'table_name'
. [1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha, you know about this as part of #21499.
Thanks for pointing this out. I have done some analysis and testing above to determine how much latency is added due to the select count(*) : #20783 (comment) But it seems like there are still some cases where the latency is a problem if especially if your sync is really fast without the estimation. I also opened issue The issue I had encountered with that query is that the estimates were really off for smaller tables. However, I don't think this is too much of a problem : With smaller tables we can omit emitting the trace message itself. So, will start fixing this issue. In the meantime, I suggest using 1.0.36. FYI I have #21683 open to address this issue |
Thanks! Probably obvious, but works for me now! |
What
Closes #19199
Postgres sources implement Progress Bar protocol -
AirbyteEstimateTraceMessage
An
AirbyteEstimateTraceMessage
is emitted at the beginning of each sync with:How
Estimated byte count is calculated by asking postgres for the fast table silze
select pg_relation_size('table_name')
and then scaling by the amount of bytes that correspond to the actual number of rows being transmitted (considering the incremental offset). To better represent the amount of data Airbyte is moving (due to serialization), we are multiplying this by a factor of 2. Each row is estimated to be the same size.Estimates are sent once at the beginning of each sync. Progress bar for CDC is NOT supported.
Recommended reading order
AbstractDbSource.java
: define abstract methods forestimateFullRefreshSyncSize
andestimateIncrementalSyncSize
& call methods to estimate sync size while creating read iterators. Default behavior is a no-op for all non-Postgres connectors.PostgresSource.java
andPostgresQueryUtils.java
: Logic to query Postgres to estimate row count + estimated bytes.🚨 User Impact 🚨
Currently, there is no user impact. Still waiting on platform changes to implement stats persistence during a sync for user to actually see the progress bar