Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Generate changelog by copying data when records are insert-only #3568

Merged

Conversation

yunfengzhou-hub
Copy link
Contributor

Purpose

Linked issue: #3567

  • Implements custom copy in local filesystem and OSS
  • Adds the optimization to generate changelog by copying data file
  • Triggers the optimization when the following conditions are met
    • The filesystem implements copy
    • The changelog producer is INPUT
    • all records received since last flush are insert-only

Tests

API and Format

This PR does not affect API or storage format.

Documentation

This PR does not add a new feature that requires document.

@yunfengzhou-hub
Copy link
Contributor Author

Hi @JingsongLi could you please take a look at this PR?

Copy link

@Sxnan Sxnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yunfengzhou-hub Thanks for the PR. I left some comments below.

@yunfengzhou-hub yunfengzhou-hub force-pushed the insert-only-changelog-producer branch 2 times, most recently from 019d7ee to 70d1fab Compare June 22, 2024 08:03
@yunfengzhou-hub
Copy link
Contributor Author

Thanks for the comments @JingsongLi. I have updated the PR according to the comments.

@yunfengzhou-hub yunfengzhou-hub force-pushed the insert-only-changelog-producer branch 2 times, most recently from 742b9e2 to fbd68e8 Compare June 24, 2024 08:26
@JingsongLi JingsongLi closed this Jun 25, 2024
@JingsongLi JingsongLi reopened this Jun 25, 2024
// and remove the following code block if proved to be of no regression.
boolean isCopyMethodOverridden;
try {
isCopyMethodOverridden =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just introduce a new FileIO interface?

interface NativeCopyFileIO extends FileIO {
    void nativeCopy(Path src, Path dst);
}

Copy link
Contributor Author

@yunfengzhou-hub yunfengzhou-hub Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to offline discussion, I benchmarked the throughput of the default implementation introduced in FileIO in Aliyun, and it seems that the default implementation has been enough to bring performance improvement. I did not verify against other filesystems, but theoretically speaking copying should also bring performance improvement to them. Given this result, I'll directly add the copy method to FileIO, instead of introducing the new NativeCopyFileIO interface.

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 96a2db7 into apache:master Jun 29, 2024
hzjhjjyy pushed a commit to hzjhjjyy/incubator-paimon that referenced this pull request Jul 1, 2024
@liming30
Copy link
Contributor

liming30 commented Aug 1, 2024

@yunfengzhou-hub hi, when will StoreSinkWrite#withInsertOnly be called? I only see manual calls in the test code, which does not seem to be automatically inferred by the engine.

https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/sink/StoreSinkWrite.java#L49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants