-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-35167][cdc-connector] Introduce MaxCompute pipeline DataSink #3254
base: master
Are you sure you want to change the base?
Conversation
...ute/src/main/java/org/apache/flink/cdc/connectors/maxcompute/sink/MaxComputeEventWriter.java
Show resolved
Hide resolved
I am wondering how a commercial database sink like MaxCompute to do e2e test? |
I will soon be working on creating a Docker image for a |
hi @loserwang1024 , I recently completed the development and release of a Docker image for the MaxCompute Emulator, and I have also added some related regression cases. Could you help me review the code in this pull request? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution, I've left some comments.
<td>optional</td> | ||
<td style="word-wrap: break-word;">16</td> | ||
<td>Integer</td> | ||
<td>自动创建 MaxCompute Transaction 表时使用的桶数。使用方式可以参考 <a href="ttps://help.aliyun.com/zh/maxcompute/user-guide/table-data-format"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invalid link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// "PostPartition", | ||
// new EventTypeInfo(), | ||
// new PostPartitionOperator(stream.getParallelism())) | ||
// .name("PartitionByBucket"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So PartitionOperator is actually unused?
I've found your jira about this and I think that it's a more versatile and scalable solution, so we can wait for that jira completed and base on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I comment out this part of code for debug use but forget to un-comment.
I think this code will stay here until the runtime optimization your mentioned is passed, and then I will re-implement this feature in that way.
import org.junit.jupiter.api.Test; | ||
|
||
/** e2e test of SchemaEvolutionUtils. */ | ||
public class SchemaEvolutionUtilsTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test class does not actually take effect, can we use maxcompute/maxcompute-emulator:v0.0.3
image to test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
case CHAR: | ||
case VARCHAR: | ||
case TIME_WITHOUT_TIME_ZONE: | ||
return TypeInfoFactory.STRING; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataType
includes information of nullable/notNull, do we lost this information here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct.
Additionally, I discovered that the MaxCompute SDK has an issue with creating tables based on primary keys. This issue results in ignoring the user-specified primary key order during table creation. I plan to fix this next week.
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
org.apache.flink.cdc.connectors.maxcompute.MaxComputeDataSinkFactory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to add a log4j2-test.properties
file under resources for debug or test purpose like other connector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
@Override | ||
public void write(Event element, Context context) throws IOException { | ||
LOG.info("Sink writer {} write {}.", this.context.getSubtaskId(), element); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unnecessary to create so many logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I forget to remove this log for debug use.
* completion status of an executor, allowing the system to determine whether all relevant sessions | ||
* have been processed. | ||
*/ | ||
public class SessionCommitCoordinator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better for us to change it to Manager
or Helper
to distinguish it from Flink's OperatorCoordinator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I didn't think of such a suitable name when I was naming it.
@Override | ||
public void snapshotState(StateSnapshotContext context) throws Exception { | ||
super.snapshotState(context); | ||
sessionCache.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, do we need to request a new session ID after each checkpoint, which may have a performance impact? It is expected that the checkpoint interval will need to be larger, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do.
Each session can no longer be used after being committed, so we must re-create a session and request a new sessionID.
And indeed, the checkpoint interval will need to be larger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dingxin-tech for the nice work, the CI wasn't triggered properly, we need adjust the CI setting[1] when new connector or new module joining.
[1]https://github.com/apache/flink-cdc/blob/master/.github/workflows/flink_cdc.yml
Hi, I added tests for MaxCompute in the CI file. Additionally, I refactored the code to apply the newly released DataSink feature, which allows specifying a HashFunction, and upgraded the ODPS SDK. Could you please review this pr again? |
No description provided.