Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Postgres : Emit estimate trace messages for non-CDC mode #20783

Merged
merged 28 commits into from
Jan 18, 2023

Conversation

akashkulk
Copy link
Contributor

@akashkulk akashkulk commented Dec 21, 2022

What

Closes #19199

Postgres sources implement Progress Bar protocol - AirbyteEstimateTraceMessage
An AirbyteEstimateTraceMessage is emitted at the beginning of each sync with:

  • Estimated row count
  • Estimated byte count

How

Estimated byte count is calculated by asking postgres for the fast table silze select pg_relation_size('table_name') and then scaling by the amount of bytes that correspond to the actual number of rows being transmitted (considering the incremental offset). To better represent the amount of data Airbyte is moving (due to serialization), we are multiplying this by a factor of 2. Each row is estimated to be the same size.

Estimates are sent once at the beginning of each sync. Progress bar for CDC is NOT supported.

Recommended reading order

  1. AbstractDbSource.java : define abstract methods for estimateFullRefreshSyncSize and estimateIncrementalSyncSize & call methods to estimate sync size while creating read iterators. Default behavior is a no-op for all non-Postgres connectors.
  2. PostgresSource.java and PostgresQueryUtils.java : Logic to query Postgres to estimate row count + estimated bytes.
  3. Everything else

🚨 User Impact 🚨

Currently, there is no user impact. Still waiting on platform changes to implement stats persistence during a sync for user to actually see the progress bar

@akashkulk
Copy link
Contributor Author

/test connector=connectors/source-postgres

@akashkulk
Copy link
Contributor Author

/test connector=connectors/source-postgres-strict-encrypt

@github-actions
Copy link
Contributor

github-actions bot commented Dec 21, 2022

Affected Connector Report

NOTE ⚠️ Changes in this PR affect the following connectors. Make sure to do the following as needed:

  • Run integration tests
  • Bump connector or module version
  • Add changelog
  • Publish the new version

✅ Sources (31)

Connector Version Changelog Publish
source-alloydb 1.0.34
source-alloydb-strict-encrypt 1.0.34 🔵
(ignored)
🔵
(ignored)
source-bigquery 0.2.3
source-clickhouse 0.1.14
source-clickhouse-strict-encrypt 0.1.14 🔵
(ignored)
🔵
(ignored)
source-cockroachdb 0.1.18
source-cockroachdb-strict-encrypt 0.1.18 🔵
(ignored)
🔵
(ignored)
source-db2 0.1.16
source-db2-strict-encrypt 0.1.16 🔵
(ignored)
🔵
(ignored)
source-dynamodb 0.1.0
source-e2e-test 2.1.3
source-e2e-test-cloud 2.1.1 🔵
(ignored)
🔵
(ignored)
source-elasticsearch 0.1.1
source-jdbc 0.3.5 🔵
(ignored)
🔵
(ignored)
source-kafka 0.2.3
source-mongodb-strict-encrypt 0.1.19 🔵
(ignored)
🔵
(ignored)
source-mongodb-v2 0.1.19
source-mssql 0.4.26
source-mssql-strict-encrypt 0.4.26 🔵
(ignored)
🔵
(ignored)
source-mysql 1.0.18
source-mysql-strict-encrypt 1.0.18 🔵
(ignored)
🔵
(ignored)
source-oracle 0.3.21
source-oracle-strict-encrypt 0.3.21 🔵
(ignored)
🔵
(ignored)
source-postgres 1.0.37
source-postgres-strict-encrypt 1.0.37 🔵
(ignored)
🔵
(ignored)
source-redshift 0.3.15
source-relational-db 0.3.1 🔵
(ignored)
🔵
(ignored)
source-scaffold-java-jdbc 0.1.0 🔵
(ignored)
🔵
(ignored)
source-sftp 0.1.2
source-snowflake 0.1.28
source-tidb 0.2.1
  • See "Actionable Items" below for how to resolve warnings and errors.

❌ Destinations (47)

Connector Version Changelog Publish
destination-azure-blob-storage 0.1.6
destination-bigquery 1.2.9
destination-bigquery-denormalized 1.2.10
destination-cassandra 0.1.4
destination-clickhouse 0.2.1
destination-clickhouse-strict-encrypt 0.2.1 🔵
(ignored)
🔵
(ignored)
destination-csv 1.0.0
(changelog missing)
destination-databricks 0.3.1
destination-dev-null 0.2.7 🔵
(ignored)
🔵
(ignored)
destination-doris 0.1.0
destination-dynamodb 0.1.7
destination-e2e-test 0.2.4
destination-elasticsearch 0.1.6
destination-elasticsearch-strict-encrypt 0.1.6 🔵
(ignored)
🔵
(ignored)
destination-gcs 0.2.12
destination-iceberg 0.1.0
destination-jdbc 0.3.14 🔵
(ignored)
🔵
(ignored)
destination-kafka 0.1.10
destination-keen 0.2.4
destination-kinesis 0.1.5
destination-local-json 0.2.11
destination-mariadb-columnstore 0.1.7
destination-mongodb 0.1.9
destination-mongodb-strict-encrypt 0.1.9 🔵
(ignored)
🔵
(ignored)
destination-mqtt 0.1.3
destination-mssql 0.1.22
destination-mssql-strict-encrypt 0.1.22 🔵
(ignored)
🔵
(ignored)
destination-mysql 0.1.20
destination-mysql-strict-encrypt 0.1.21
(mismatch: 0.1.20)
🔵
(ignored)
🔵
(ignored)
destination-oracle 0.1.19
destination-oracle-strict-encrypt 0.1.19 🔵
(ignored)
🔵
(ignored)
destination-postgres 0.3.26
destination-postgres-strict-encrypt 0.3.26 🔵
(ignored)
🔵
(ignored)
destination-pubsub 0.2.0
destination-pulsar 0.1.3
destination-r2 0.1.0
destination-redis 0.1.4
destination-redpanda 0.1.0
destination-redshift 0.3.53
destination-rockset 0.1.4
destination-s3 0.3.18
destination-s3-glue 0.1.1
destination-scylla 0.1.3
destination-snowflake 0.4.42
(changelog missing)
destination-teradata 0.1.0
destination-tidb 0.1.0
destination-yugabytedb 0.1.0
  • See "Actionable Items" below for how to resolve warnings and errors.

✅ Other Modules (0)

Actionable Items

(click to expand)

Category Status Actionable Item
Version
mismatch
The version of the connector is different from its normal variant. Please bump the version of the connector.

doc not found
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug.
Changelog
doc not found
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug.

changelog missing
There is no chnagelog for the current version of the connector. If you are the author of the current version, please add a changelog.
Publish
not in seed
The connector is not in the seed file (e.g. source_definitions.yaml), so its publication status cannot be checked. This can be normal (e.g. some connectors are cloud-specific, and only listed in the cloud seed file). Please double-check to make sure that it is not a bug.

diff seed version
The connector exists in the seed file, but the latest version is not listed there. This usually means that the latest version is not published. Please use the /publish command to publish the latest version.

@akashkulk akashkulk temporarily deployed to more-secrets December 21, 2022 23:54 — with GitHub Actions Inactive
@akashkulk akashkulk temporarily deployed to more-secrets December 21, 2022 23:54 — with GitHub Actions Inactive
@akashkulk
Copy link
Contributor Author

/test connector=connectors/source-postgres-strict-encrypt

@akashkulk
Copy link
Contributor Author

/test connector=connectors/source-postgres

@akashkulk akashkulk temporarily deployed to more-secrets December 22, 2022 00:05 — with GitHub Actions Inactive
@akashkulk akashkulk temporarily deployed to more-secrets December 22, 2022 00:06 — with GitHub Actions Inactive
@akashkulk
Copy link
Contributor Author

akashkulk commented Dec 22, 2022

/test connector=connectors/source-postgres-strict-encrypt

🕑 connectors/source-postgres-strict-encrypt https://github.com/airbytehq/airbyte/actions/runs/3759907422
✅ connectors/source-postgres-strict-encrypt https://github.com/airbytehq/airbyte/actions/runs/3759907422
No Python unittests run

Build Passed

Test summary info:

All Passed

@akashkulk
Copy link
Contributor Author

akashkulk commented Dec 22, 2022

/test connector=connectors/source-postgres

🕑 connectors/source-postgres https://github.com/airbytehq/airbyte/actions/runs/3759962270
✅ connectors/source-postgres https://github.com/airbytehq/airbyte/actions/runs/3759962270
Python tests coverage:

	 Name                                                 Stmts   Miss  Cover   Missing
	 ----------------------------------------------------------------------------------
	 source_acceptance_test/base.py                          12      4    67%   16-19
	 source_acceptance_test/config.py                       140      5    96%   87, 93, 238, 242-243
	 source_acceptance_test/conftest.py                     208     92    56%   36, 42-44, 49, 54, 77, 83, 89-91, 110, 115-117, 123-125, 131-132, 137-138, 143, 149, 158-167, 173-178, 193, 217, 248, 254, 262-267, 275-280, 288-301, 306-312, 319-330, 337-353
	 source_acceptance_test/plugin.py                        69     25    64%   22-23, 31, 36, 120-140, 144-148
	 source_acceptance_test/tests/test_core.py              402    115    71%   53, 58, 93-104, 109-116, 120-121, 125-126, 308, 346-363, 376-387, 391-396, 402, 435-440, 478-485, 528-530, 533, 598-606, 618-621, 626, 682-683, 689, 692, 728-738, 751-776
	 source_acceptance_test/tests/test_incremental.py       158     14    91%   52-59, 64-77, 240
	 source_acceptance_test/utils/asserts.py                 39      2    95%   62-63
	 source_acceptance_test/utils/common.py                  94     10    89%   16-17, 32-38, 72, 75
	 source_acceptance_test/utils/compare.py                 62     23    63%   21-51, 68, 97-99
	 source_acceptance_test/utils/connector_runner.py       133     33    75%   24-27, 46-47, 50-54, 57-58, 73-75, 78-80, 83-85, 88-90, 93-95, 124-125, 159-161, 208
	 source_acceptance_test/utils/json_schema_helper.py     107     13    88%   30-31, 38, 41, 65-68, 96, 120, 192-194
	 ----------------------------------------------------------------------------------
	 TOTAL                                                 1603    336    79%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/source_acceptance_test/plugin.py:63: Skipping TestConnection.test_check: not found in the config.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/source_acceptance_test/plugin.py:63: Skipping TestDiscovery.test_discover: not found in the config.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/source_acceptance_test/plugin.py:63: Skipping TestBasicRead.test_read: not found in the config.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/source_acceptance_test/plugin.py:63: Skipping TestFullRefresh.test_sequential_reads: not found in the config.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/source_acceptance_test/plugin.py:63: Skipping TestIncremental.test_two_sequential_reads: not found in the config.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/source_acceptance_test/tests/test_core.py:94: The previous and actual specifications are identical.
================= 13 passed, 6 skipped, 21 warnings in 16.22s ==================

@akashkulk akashkulk marked this pull request as ready for review December 22, 2022 17:51
@akashkulk akashkulk requested a review from a team as a code owner December 22, 2022 17:51
@akashkulk akashkulk temporarily deployed to more-secrets December 22, 2022 17:52 — with GitHub Actions Inactive
@akashkulk akashkulk temporarily deployed to more-secrets December 22, 2022 17:53 — with GitHub Actions Inactive
@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets January 17, 2023 23:13 — with GitHub Actions Inactive
@akashkulk akashkulk enabled auto-merge (squash) January 17, 2023 23:13
@akashkulk akashkulk temporarily deployed to more-secrets January 17, 2023 23:15 — with GitHub Actions Inactive
@akashkulk akashkulk temporarily deployed to more-secrets January 17, 2023 23:16 — with GitHub Actions Inactive
@akashkulk akashkulk temporarily deployed to more-secrets January 18, 2023 00:26 — with GitHub Actions Inactive
@akashkulk akashkulk temporarily deployed to more-secrets January 18, 2023 00:27 — with GitHub Actions Inactive
@akashkulk akashkulk disabled auto-merge January 18, 2023 01:06
@akashkulk akashkulk temporarily deployed to more-secrets January 18, 2023 01:08 — with GitHub Actions Inactive
@akashkulk akashkulk temporarily deployed to more-secrets January 18, 2023 01:08 — with GitHub Actions Inactive
@akashkulk akashkulk enabled auto-merge (squash) January 18, 2023 02:03
@akashkulk akashkulk temporarily deployed to more-secrets January 18, 2023 04:05 — with GitHub Actions Inactive
@akashkulk akashkulk temporarily deployed to more-secrets January 18, 2023 04:05 — with GitHub Actions Inactive
@akashkulk akashkulk temporarily deployed to more-secrets January 18, 2023 06:16 — with GitHub Actions Inactive
@akashkulk akashkulk temporarily deployed to more-secrets January 18, 2023 06:16 — with GitHub Actions Inactive
@akashkulk akashkulk disabled auto-merge January 18, 2023 13:18
@akashkulk akashkulk temporarily deployed to more-secrets January 18, 2023 13:19 — with GitHub Actions Inactive
@akashkulk akashkulk temporarily deployed to more-secrets January 18, 2023 13:20 — with GitHub Actions Inactive
@akashkulk akashkulk enabled auto-merge (squash) January 18, 2023 13:50
@akashkulk akashkulk merged commit 63e4482 into master Jan 18, 2023
@akashkulk akashkulk deleted the message_bar branch January 18, 2023 13:55
@visnup
Copy link
Contributor

visnup commented Jan 20, 2023

I believe something about this PR causes our Postgres sync to hang. For what it's worth, our Postgres database is on Heroku.

Notice below in logs from 1.0.36 that the Queueing query for table: events happens immediately:

2023-01-20 21:16:46 source > Found table: public.pg_stat_statements
2023-01-20 21:16:46 source > Found table: public.item_cache
2023-01-20 21:16:46 source > Found table: public.events_of_interest
2023-01-20 21:16:46 source > using CDC: false
2023-01-20 21:16:46 source > using CDC: false
2023-01-20 21:16:46 source > Queueing query for table: events
2023-01-20 21:16:46 source > Preparing query for table: events
2023-01-20 21:16:46 source > Table events cursor count: expected 1, actual 1

In logs from 1.0.37, the last source message happens a minute or more before I canceled it:

2023-01-20 21:34:08 source > Found table: public.events_of_interest
2023-01-20 21:34:08 source > using CDC: false
2023-01-20 21:34:08 source > using CDC: false
2023-01-20 21:34:09 destination > HikariPool-1 - Added connection net.snowflake.client.jdbc.SnowflakeConnectionV1@7cca01a8
2023-01-20 21:34:09 destination > HikariPool-1 - Start completed.
2023-01-20 21:34:09 destination > closing connection
2023-01-20 21:34:10 destination > Preparing staging area in destination completed for schema AIRBYTE stream analytics_events
2023-01-20 21:34:10 destination > Preparing tmp tables in destination completed.
2023-01-20 21:35:42 INFO i.a.w.t.TemporalAttemptExecution(lambda$getCancellationChecker$5):228 - Running sync worker cancellation...

I've only waited up to 6 minutes before canceling as this is on a sync that normally takes end-to-end 1 minute to run.

Comment on lines +29 to +30
SELECT (SELECT COUNT(*) FROM %s) AS %s,
pg_relation_size('%s') AS %s;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says the intention is to only estimate table sizes using pg_relation_size, but won't this query end up doing a full count also? I believe this is what's causing my sync to hang for minutes (and maybe more) as attempting a count on my 186M row table takes quite a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number of rows is needed, I'd recommend using an estimate via SELECT reltuples AS estimate FROM pg_class WHERE relname = 'table_name'. [1]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, you know about this as part of #21499.

@akashkulk
Copy link
Contributor Author

akashkulk commented Jan 20, 2023

I believe something about this PR causes our Postgres sync to hang. For what it's worth, our Postgres database is on Heroku.

Notice below in logs from 1.0.36 that the Queueing query for table: events happens immediately:

2023-01-20 21:16:46 source > Found table: public.pg_stat_statements
2023-01-20 21:16:46 source > Found table: public.item_cache
2023-01-20 21:16:46 source > Found table: public.events_of_interest
2023-01-20 21:16:46 source > using CDC: false
2023-01-20 21:16:46 source > using CDC: false
2023-01-20 21:16:46 source > Queueing query for table: events
2023-01-20 21:16:46 source > Preparing query for table: events
2023-01-20 21:16:46 source > Table events cursor count: expected 1, actual 1

In logs from 1.0.37, the last source message happens a minute or more before I canceled it:

2023-01-20 21:34:08 source > Found table: public.events_of_interest
2023-01-20 21:34:08 source > using CDC: false
2023-01-20 21:34:08 source > using CDC: false
2023-01-20 21:34:09 destination > HikariPool-1 - Added connection net.snowflake.client.jdbc.SnowflakeConnectionV1@7cca01a8
2023-01-20 21:34:09 destination > HikariPool-1 - Start completed.
2023-01-20 21:34:09 destination > closing connection
2023-01-20 21:34:10 destination > Preparing staging area in destination completed for schema AIRBYTE stream analytics_events
2023-01-20 21:34:10 destination > Preparing tmp tables in destination completed.
2023-01-20 21:35:42 INFO i.a.w.t.TemporalAttemptExecution(lambda$getCancellationChecker$5):228 - Running sync worker cancellation...

I've only waited up to 6 minutes before canceling as this is on a sync that normally takes end-to-end 1 minute to run.

Thanks for pointing this out. I have done some analysis and testing above to determine how much latency is added due to the select count(*) : #20783 (comment)

But it seems like there are still some cases where the latency is a problem if especially if your sync is really fast without the estimation.

I also opened issue
#21499 to track the optimization you mentioned above, although I was experimenting with a different query : select reltuples::int8 as count from pg_class c JOIN pg_catalog.pg_namespace n ON n.oid=c.relnamespace where nspname='schema' AND relname='table_name';

The issue I had encountered with that query is that the estimates were really off for smaller tables. However, I don't think this is too much of a problem : With smaller tables we can omit emitting the trace message itself. So, will start fixing this issue. In the meantime, I suggest using 1.0.36.

FYI I have #21683 open to address this issue

@visnup
Copy link
Contributor

visnup commented Feb 7, 2023

Thanks! Probably obvious, but works for me now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Database Sources (Postgres) implements Progress Bar protocol - AirbyteEstimateTraceMessage
8 participants