-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 8890 Source MySql: Fix large table issue by fetch size #17236
Conversation
NOTE
|
/test connector=connectors/source-mysql
Build PassedTest summary info:
|
/test connector=connectors/source-mysql-strict-encrypt
Build PassedTest summary info:
|
NOTE
|
/test connector=connectors/source-mysql
Build PassedTest summary info:
|
/test connector=connectors/source-mysql-strict-encrypt
Build PassedTest summary info:
|
NOTE
|
@@ -279,7 +279,7 @@ protected void initTests() { | |||
.fullSourceDataType(fullSourceType) | |||
.airbyteType(JsonSchemaType.STRING_TIME_WITHOUT_TIMEZONE) | |||
// JDBC driver can process only "clock"(00:00:00-23:59:59) values. | |||
.addInsertValues("'-22:59:59'", "'23:59:59'", "'00:00:00'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused why it passed before with a negative date. I'm having:
SQLState: S1009, Message: The value '-22:59:59' is an invalid TIME value. JDBC Time objects represent a wall-clock time and not a duration as MySQL treats them. If you are treating this type as a duration, consider retrieving this value as a string and dealing with it according to your requirements.
-22:59:59
looks as invalid value for me or should mysql handle it? Could you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test should pass, this might mean that the connection parameters that were being applied successfully before are not being applied with your changes which might highlight a bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second what Subodh is saying here. The functional change in this PR does not seem related to this test change. However, if this test passes before the functional changes and does not pass after - something is wrong with the functional changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a bug for me in mysql-connector-java. Checked 8.0.22 and 8.0.30 versions.
So if I use useCursorFetch=true
mysql-connector-java will use MysqlBinaryValueDecoder
for MysqlType.FIELD_TYPE_TIME
which will remove negative value by:
if (negative) {
days *= -1;
}
But if useCursorFetch=false
Decoder will be MysqlTextValueDecoder
which also has decodeTime method but it did this transformation twice. First from negative to positive value and then back and in the end it checks if value is negative then throw an Error 🤔
So what options do we have?
-
Return useCursorFetch=true in this case mysql will create some tmp tables for selects queries. I did not notice the performance change with useCursorFetch true or false and setFetchSize(Integer.MIN_VALUE). Also open new ticket about removing useCursorFetch
-
Remove test with negative time as for now. But it can leads to issues if someone has a negative time value in db
-
Getting string from result set and made it custom. So it looks something like:
@Override
protected void putTime(final ObjectNode node, final String columnName, final ResultSet resultSet, final int index) throws SQLException {
node.put(columnName, convertToTime(resultSet.getString(index)));
}
private String convertToTime(String time) {
time = time.startsWith("-") ? time.substring(1) : time;
return LocalTime.parse(time).format(TIME_FORMATTER);
}
and
@Override
public JsonNode rowToJson(final ResultSet queryContext) throws SQLException {
// the first call communicates with the database. after that the result is cached.
final java.sql.ResultSetMetaData metadata = queryContext.getMetaData();
final int columnCount = metadata.getColumnCount();
final ObjectNode jsonNode = (ObjectNode) Jsons.jsonNode(Collections.emptyMap());
for (int i = 1; i <= columnCount; i++) {
final String columnType = metadata.getColumnTypeName(i);
if (columnType.equalsIgnoreCase("TIME")) {
// getObject will fail as it tries to parse time with negative value
queryContext.getString(i);
} else {
queryContext.getObject(i);
}
if (queryContext.wasNull()) {
continue;
}
// convert to java types that will convert into reasonable json.
setJsonField(queryContext, i, jsonNode);
}
return jsonNode;
}
In MySqlSourceOperations
class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed changes. Please have a look
@grishick @subodh1810
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for digging into this. My preference is to Return useCursorFetch=true
since you have not seen this affecting performance negatively while using setFetchSize(Integer.MIN_VALUE)
. Our test coverage is not exhaustive and so I am more concerned about what other behaviors may change because of switching to MysqlTextValueDecoder
instead of MysqlBinaryValueDecoder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is an old thread but this is the most relevant place I could think of mentioning:
I'm running into an issue with a MySQL TIME type, with the following error:
SQLState: S1009, Message: The value '30:00:00' is an invalid TIME value. JDBC Time objects represent a wall-clock time and not a duration as MySQL treats them. If you are treating this type as a duration, consider retrieving this value as a string and dealing with it according to your requirements.
This is regardless of useCursorFetch=true
or useCursorFetch=false
, happens for both.
I'm currently debugging the issue, but any help or thoughts is most welcome.
@Override | ||
public void initialize(final Connection connection, final Statement preparedStatement) throws SQLException { | ||
connection.setAutoCommit(false); | ||
preparedStatement.setFetchSize(Integer.MIN_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this?
Why is changing fetch size from 10 rows needed?
What is the meaning of putting a negative value (-2^31)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More in the ticket but in short:
We have an issue to fetch data when mysql db table has more than 1 000 000 records (approximately). Because we are using jdbcUrl.append("?useCursorFetch=true")
and fetch size 10 it loads MySql resources significantly (disk storage and memory). So if instance has enough resources it will lead to a long waiting before the sync start like in this issue 4-5 hours for 23 355 867 records. In case if instance cannot handle request (in my case it was not enough disk storage 50 GB instance for 14 GB table) it will fail the sync. Also to release instance memory I have to restart the instance.
The setFetchSize(Integer.MIN_VALUE)
disables client-side caching of the entire response and gives you responses as they arrive there is no round-trip per row required. With this approach the sync starts immediately with any table size and no memory load on db.
We tested sync between Mysql Source and S3 destination with 1 000 000 record table.
Fetch size 10
7 minutes to start the sync and 6 minutes to load 1 GB from table
Fetch size Integer.MIN_VALUE
Start sync immediately and 6 minutes to load 1GB from table
/test connector=connectors/source-mysql
Build PassedTest summary info:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -279,7 +279,7 @@ protected void initTests() { | |||
.fullSourceDataType(fullSourceType) | |||
.airbyteType(JsonSchemaType.STRING_TIME_WITHOUT_TIMEZONE) | |||
// JDBC driver can process only "clock"(00:00:00-23:59:59) values. | |||
.addInsertValues("'-22:59:59'", "'23:59:59'", "'00:00:00'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for digging into this. My preference is to Return useCursorFetch=true
since you have not seen this affecting performance negatively while using setFetchSize(Integer.MIN_VALUE)
. Our test coverage is not exhaustive and so I am more concerned about what other behaviors may change because of switching to MysqlTextValueDecoder
instead of MysqlBinaryValueDecoder
.
/test connector=connectors/source-mysql
Build FailedTest summary info:
|
NOTE
|
Seems like integration tests were broken by this PR #17394 |
…fetch_size_update
NOTE
|
/test connector=connectors/source-mysql
Build PassedTest summary info:
|
Fixed |
NOTE
|
/publish connector=connectors/source-mysql
if you have connectors that successfully published but failed definition generation, follow step 4 here |
/publish connector=connectors/source-mysql
if you have connectors that successfully published but failed definition generation, follow step 4 here |
…vation * master: (32 commits) fixed octavia position and z-index on onboarding page (#17708) Revert "Revert "Do not wait the end of a reset to return an update (#17591)" (#17640)" (#17669) source-google-analytics-v4: use hits metric for check (#17717) Source linkedin-ads: retry 429/5xx when refreshing access token (#17724) 🐛 Source Mixpanel: solve cursor field none expected array (#17699) 🎉 8890 Source MySql: Fix large table issue by fetch size (#17236) Test e2e testing tool commands (#17722) fixed escape character i18n error (#17706) Docs: adds missing " in transformations-with-airbyte.md (#17723) Change Osano token to new project (#17720) Source Github: improve 502 handling for `comments` stream (#17715) #17506 source snapchat marketing: retry failed request for refreshing access token (#17596) MongoDb Source: Increase performance of discover (#17614) Testing tool commands for run scenarios (#17550) Kustomize: Missing NORMALIZATION_JOB_* environment variables in stable-with-resource-limits overlays (#17713) Fix console errors (#17696) Revert: #17047 Airbyte CDK: Improve error for returning non-iterable from connectors parse_response (#17707) #17047 Airbyte CDK: Improve error for returning non-iterable from connectors parse_response (#17626) 📝 Postgres source: document occasional full refresh under cdc mode (#17705) Bump Airbyte version from 0.40.12 to 0.40.13 (#17682) ...
/publish connector=connectors/source-mysql-strict-encrypt if you have connectors that successfully published but failed definition generation, follow step 4 here |
…7236) * 8890 Fix large table issue by fetch size * 8890 Bump version * auto-bump connector version [ci skip] Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
What
The combination of
useCursorFetch=true
andpreparedStatement.setFetchSize(10);
push to use cursor on MySql DB instance side. This leads to consuming DB instance disk memory when we try to fetch more than aware table size. In the test we did table with 1 000 000 records and near 14GB size. The only way to fix this issue was to increase disk memory from 50GB to 65GB.How
The other approach will be to use
preparedStatement.setFetchSize(Integer.MIN_VALUE);
It actually just disables client-side caching of the entire response and gives you responses as they arrive as a result it has no affect on the DB