Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 8890 Source MySql: Fix large table issue by fetch size #17236

Merged
merged 10 commits into from
Oct 7, 2022

Conversation

suhomud
Copy link
Contributor

@suhomud suhomud commented Sep 27, 2022

What

The combination of useCursorFetch=true and preparedStatement.setFetchSize(10); push to use cursor on MySql DB instance side. This leads to consuming DB instance disk memory when we try to fetch more than aware table size. In the test we did table with 1 000 000 records and near 14GB size. The only way to fix this issue was to increase disk memory from 50GB to 65GB.

How

The other approach will be to use preparedStatement.setFetchSize(Integer.MIN_VALUE); It actually just disables client-side caching of the entire response and gives you responses as they arrive as a result it has no affect on the DB

@github-actions
Copy link
Contributor

NOTE ⚠️ Changes in this PR affect the following connectors. Make sure to run corresponding integration tests:

  • source-mysql-strict-encrypt
  • source-mysql

@suhomud suhomud changed the title 8890 Fix large table issue by fetch size 8890 Source MySql: Fix large table issue by fetch size Sep 27, 2022
@suhomud suhomud changed the title 8890 Source MySql: Fix large table issue by fetch size 🎉 8890 Source MySql: Fix large table issue by fetch size Sep 27, 2022
@suhomud
Copy link
Contributor Author

suhomud commented Sep 27, 2022

/test connector=connectors/source-mysql

🕑 connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/3136630256
✅ connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/3136630256
No Python unittests run

Build Passed

Test summary info:

All Passed

@suhomud
Copy link
Contributor Author

suhomud commented Sep 27, 2022

/test connector=connectors/source-mysql-strict-encrypt

🕑 connectors/source-mysql-strict-encrypt https://github.com/airbytehq/airbyte/actions/runs/3136636475
✅ connectors/source-mysql-strict-encrypt https://github.com/airbytehq/airbyte/actions/runs/3136636475
No Python unittests run

Build Passed

Test summary info:

All Passed

@github-actions
Copy link
Contributor

NOTE ⚠️ Changes in this PR affect the following connectors. Make sure to run corresponding integration tests:

  • source-mysql
  • source-mysql-strict-encrypt

@suhomud
Copy link
Contributor Author

suhomud commented Sep 27, 2022

/test connector=connectors/source-mysql

🕑 connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/3138281129
✅ connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/3138281129
No Python unittests run

Build Passed

Test summary info:

All Passed

@suhomud
Copy link
Contributor Author

suhomud commented Sep 27, 2022

/test connector=connectors/source-mysql-strict-encrypt

🕑 connectors/source-mysql-strict-encrypt https://github.com/airbytehq/airbyte/actions/runs/3138284467
✅ connectors/source-mysql-strict-encrypt https://github.com/airbytehq/airbyte/actions/runs/3138284467
No Python unittests run

Build Passed

Test summary info:

All Passed

@github-actions
Copy link
Contributor

NOTE ⚠️ Changes in this PR affect the following connectors. Make sure to run corresponding integration tests:

  • source-mysql
  • source-mysql-strict-encrypt

@@ -279,7 +279,7 @@ protected void initTests() {
.fullSourceDataType(fullSourceType)
.airbyteType(JsonSchemaType.STRING_TIME_WITHOUT_TIMEZONE)
// JDBC driver can process only "clock"(00:00:00-23:59:59) values.
.addInsertValues("'-22:59:59'", "'23:59:59'", "'00:00:00'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused why it passed before with a negative date. I'm having:
SQLState: S1009, Message: The value '-22:59:59' is an invalid TIME value. JDBC Time objects represent a wall-clock time and not a duration as MySQL treats them. If you are treating this type as a duration, consider retrieving this value as a string and dealing with it according to your requirements.
-22:59:59 looks as invalid value for me or should mysql handle it? Could you explain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should pass, this might mean that the connection parameters that were being applied successfully before are not being applied with your changes which might highlight a bug

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second what Subodh is saying here. The functional change in this PR does not seem related to this test change. However, if this test passes before the functional changes and does not pass after - something is wrong with the functional changes.

Copy link
Contributor Author

@suhomud suhomud Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a bug for me in mysql-connector-java. Checked 8.0.22 and 8.0.30 versions.

So if I use useCursorFetch=true mysql-connector-java will use MysqlBinaryValueDecoder for MysqlType.FIELD_TYPE_TIME which will remove negative value by:

        if (negative) {
            days *= -1;
        } 

But if useCursorFetch=false Decoder will be MysqlTextValueDecoder which also has decodeTime method but it did this transformation twice. First from negative to positive value and then back and in the end it checks if value is negative then throw an Error 🤔

So what options do we have?

  1. Return useCursorFetch=true in this case mysql will create some tmp tables for selects queries. I did not notice the performance change with useCursorFetch true or false and setFetchSize(Integer.MIN_VALUE). Also open new ticket about removing useCursorFetch

  2. Remove test with negative time as for now. But it can leads to issues if someone has a negative time value in db

  3. Getting string from result set and made it custom. So it looks something like:

  @Override
  protected void putTime(final ObjectNode node, final String columnName, final ResultSet resultSet, final int index) throws SQLException {
    node.put(columnName, convertToTime(resultSet.getString(index)));
  }

  private String convertToTime(String time) {
    time = time.startsWith("-") ? time.substring(1) : time;
    return LocalTime.parse(time).format(TIME_FORMATTER);
  }

and

@Override
 public JsonNode rowToJson(final ResultSet queryContext) throws SQLException {
   // the first call communicates with the database. after that the result is cached.
   final java.sql.ResultSetMetaData metadata = queryContext.getMetaData();
   final int columnCount = metadata.getColumnCount();
   final ObjectNode jsonNode = (ObjectNode) Jsons.jsonNode(Collections.emptyMap());

   for (int i = 1; i <= columnCount; i++) {
     final String columnType = metadata.getColumnTypeName(i);

     if (columnType.equalsIgnoreCase("TIME")) {
       // getObject will fail as it tries to parse time with negative value
       queryContext.getString(i);
     } else {
       queryContext.getObject(i);
     }
     if (queryContext.wasNull()) {
       continue;
     }

     // convert to java types that will convert into reasonable json.
     setJsonField(queryContext, i, jsonNode);
   }

   return jsonNode;
 }

In MySqlSourceOperations class

@grishick @subodh1810

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed changes. Please have a look
@grishick @subodh1810

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for digging into this. My preference is to Return useCursorFetch=true since you have not seen this affecting performance negatively while using setFetchSize(Integer.MIN_VALUE). Our test coverage is not exhaustive and so I am more concerned about what other behaviors may change because of switching to MysqlTextValueDecoder instead of MysqlBinaryValueDecoder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grishick returning useCursorFetch=true is also fine for me since I did not notice performance change or DB resource consuming. I added it back 8257491

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is an old thread but this is the most relevant place I could think of mentioning:
I'm running into an issue with a MySQL TIME type, with the following error:

SQLState: S1009, Message: The value '30:00:00' is an invalid TIME value. JDBC Time objects represent a wall-clock time and not a duration as MySQL treats them. If you are treating this type as a duration, consider retrieving this value as a string and dealing with it according to your requirements.

This is regardless of useCursorFetch=true or useCursorFetch=false, happens for both.
I'm currently debugging the issue, but any help or thoughts is most welcome.

@Override
public void initialize(final Connection connection, final Statement preparedStatement) throws SQLException {
connection.setAutoCommit(false);
preparedStatement.setFetchSize(Integer.MIN_VALUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this?
Why is changing fetch size from 10 rows needed?
What is the meaning of putting a negative value (-2^31)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More in the ticket but in short:
We have an issue to fetch data when mysql db table has more than 1 000 000 records (approximately). Because we are using jdbcUrl.append("?useCursorFetch=true") and fetch size 10 it loads MySql resources significantly (disk storage and memory). So if instance has enough resources it will lead to a long waiting before the sync start like in this issue 4-5 hours for 23 355 867 records. In case if instance cannot handle request (in my case it was not enough disk storage 50 GB instance for 14 GB table) it will fail the sync. Also to release instance memory I have to restart the instance.

The setFetchSize(Integer.MIN_VALUE) disables client-side caching of the entire response and gives you responses as they arrive there is no round-trip per row required. With this approach the sync starts immediately with any table size and no memory load on db.

We tested sync between Mysql Source and S3 destination with 1 000 000 record table.
Fetch size 10
7 minutes to start the sync and 6 minutes to load 1 GB from table
Fetch size Integer.MIN_VALUE
Start sync immediately and 6 minutes to load 1GB from table

@suhomud
Copy link
Contributor Author

suhomud commented Oct 4, 2022

/test connector=connectors/source-mysql

🕑 connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/3183640484
✅ connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/3183640484
No Python unittests run

Build Passed

Test summary info:

All Passed

Copy link
Contributor

@alexandr-shegeda alexandr-shegeda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -279,7 +279,7 @@ protected void initTests() {
.fullSourceDataType(fullSourceType)
.airbyteType(JsonSchemaType.STRING_TIME_WITHOUT_TIMEZONE)
// JDBC driver can process only "clock"(00:00:00-23:59:59) values.
.addInsertValues("'-22:59:59'", "'23:59:59'", "'00:00:00'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for digging into this. My preference is to Return useCursorFetch=true since you have not seen this affecting performance negatively while using setFetchSize(Integer.MIN_VALUE). Our test coverage is not exhaustive and so I am more concerned about what other behaviors may change because of switching to MysqlTextValueDecoder instead of MysqlBinaryValueDecoder.

@suhomud
Copy link
Contributor Author

suhomud commented Oct 6, 2022

/test connector=connectors/source-mysql

🕑 connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/3194916543
❌ connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/3194916543
🐛 https://gradle.com/s/mwso2ranefyk6

Build Failed

Test summary info:

Could not find result summary

@github-actions
Copy link
Contributor

github-actions bot commented Oct 6, 2022

NOTE ⚠️ Changes in this PR affect the following connectors. Make sure to run corresponding integration tests:

  • source-mysql
  • source-mysql-strict-encrypt

@suhomud
Copy link
Contributor Author

suhomud commented Oct 6, 2022

Seems like integration tests were broken by this PR #17394

@github-actions
Copy link
Contributor

github-actions bot commented Oct 6, 2022

NOTE ⚠️ Changes in this PR affect the following connectors. Make sure to run corresponding integration tests:

  • source-mysql-strict-encrypt
  • source-mysql

@suhomud
Copy link
Contributor Author

suhomud commented Oct 6, 2022

/test connector=connectors/source-mysql

🕑 connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/3197344803
✅ connectors/source-mysql https://github.com/airbytehq/airbyte/actions/runs/3197344803
No Python unittests run

Build Passed

Test summary info:

All Passed

@suhomud suhomud requested a review from grishick October 6, 2022 12:56
@suhomud
Copy link
Contributor Author

suhomud commented Oct 6, 2022

Seems like integration tests were broken by this PR #17394

Fixed

@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Oct 7, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2022

NOTE ⚠️ Changes in this PR affect the following connectors. Make sure to run corresponding integration tests:

  • source-mysql
  • source-mysql-strict-encrypt

@suhomud
Copy link
Contributor Author

suhomud commented Oct 7, 2022

/publish connector=connectors/source-mysql

🕑 Publishing the following connectors:
connectors/source-mysql
https://github.com/airbytehq/airbyte/actions/runs/3203697389


Connector Did it publish? Were definitions generated?
connectors/source-mysql

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@suhomud
Copy link
Contributor Author

suhomud commented Oct 7, 2022

/publish connector=connectors/source-mysql

🕑 Publishing the following connectors:
connectors/source-mysql
https://github.com/airbytehq/airbyte/actions/runs/3204326284


Connector Did it publish? Were definitions generated?
connectors/source-mysql

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@suhomud suhomud merged commit 3501abe into master Oct 7, 2022
@suhomud suhomud deleted the suhomud/8890_mysql_fetch_size_update branch October 7, 2022 13:25
letiescanciano added a commit that referenced this pull request Oct 7, 2022
…vation

* master: (32 commits)
  fixed octavia position and z-index on onboarding page (#17708)
  Revert "Revert "Do not wait the end of a reset to return an update (#17591)" (#17640)" (#17669)
  source-google-analytics-v4: use hits metric for check (#17717)
  Source linkedin-ads: retry 429/5xx when refreshing access token (#17724)
  🐛 Source Mixpanel: solve cursor field none expected array (#17699)
  🎉 8890 Source MySql: Fix large table issue by fetch size (#17236)
  Test e2e testing tool commands (#17722)
  fixed escape character i18n error (#17706)
  Docs: adds missing " in transformations-with-airbyte.md (#17723)
  Change Osano token to new project (#17720)
  Source Github: improve 502 handling for `comments` stream (#17715)
  #17506 source snapchat marketing: retry failed request for refreshing access token (#17596)
  MongoDb Source: Increase performance of discover (#17614)
  Testing tool commands for run scenarios (#17550)
  Kustomize: Missing NORMALIZATION_JOB_* environment variables in stable-with-resource-limits overlays (#17713)
  Fix console errors (#17696)
  Revert: #17047 Airbyte CDK: Improve error for returning non-iterable from connectors parse_response (#17707)
  #17047 Airbyte CDK: Improve error for returning non-iterable from connectors parse_response (#17626)
  📝 Postgres source: document occasional full refresh under cdc mode (#17705)
  Bump Airbyte version from 0.40.12 to 0.40.13 (#17682)
  ...
@grishick
Copy link
Contributor

grishick commented Oct 7, 2022

/publish connector=connectors/source-mysql-strict-encrypt

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

jhammarstedt pushed a commit to jhammarstedt/airbyte that referenced this pull request Oct 31, 2022
…7236)

* 8890 Fix large table issue by fetch size

* 8890 Bump version

* auto-bump connector version [ci skip]

Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/mysql
Projects
None yet
8 participants