-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider lightweight deleted rows when selecting parts to merge #58223
Conversation
37f500c
to
41d2134
Compare
This is an automated comment for commit be4554b with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
5a03270
to
271c8a7
Compare
271c8a7
to
bb28d2b
Compare
@yakov-olkhovskiy failed integration test looks unrelated to the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't calculate existing_rows_count
on mutate, but rather on delete itself and then update on merge along with raws_count
@yakov-olkhovskiy I don't understand the
also @davenger do you have any comment on this? |
yes, I meant calculate on |
@yakov-olkhovskiy lightweight deleted rows will be physically deleted during merge, so we only need to calculate existing count during mutation on |
6794ea2
to
a63c8cd
Compare
Please consider extending |
@tavplubix It is ok to do that for new parts, but I think we should not change the |
Yes, but we also cannot add another file to an existing part without running a mutation. And it both cases we should take compatibility into account |
@jewelzqiu |
Let me summarize current options (add drawbacks): For existing parts:
For new parts:
Personally I prefer calculating on loading for existing parts, because:
As for the new parts I can live with either option, maybe extending @yakov-olkhovskiy @tavplubix what are your thoughts? |
for existing parts there is another option - left as it is |
Yes, we can just leave old parts as they are. Eventually, they will be merged and the existing rows count will be added |
Just in case, let me clarify what is the most complex part about compatibility: Imagine you have a cluster of multiple replicas, and you do rolling upgrade, upgrading them one by one (the usual scenario). You have upgraded one replica. It gets new INSERT and DELETE queries and execute meres and mutations. It writes a part with a new existing_count.txt file or count.txt in the new format. Parts are being replicated to other replicas (with old version), but other replicas do not expect that. Also, it's impossible to downgrade if something went wrong during upgrading. Possible (the simplest) solution: add a setting to enable/disable the new format |
@tavplubix let's say we have new part with existing_count.txt replicated to old version - wouldn't it just be ignored? or we count it as an error if unknown file is present? |
I'm not sure, need to check |
If so, it doesn't help resolving the original issue: existing large parts with lightweight deleted rows cannot be merged As for ability to downgrade to older version, I think both adding To minimize the impact on existing architecture, we could implement this way:
No files will be add or changed |
hi @yakov-olkhovskiy @tavplubix , can we proceed with this design? it won't change or add any files in data parts. |
the idea was that we gradually get rid of old parts without
it's not only ability to downgrade though - the question is how old version would behave having new parts replicated from new version. Probably this question is the same with ability to downgrade - i.e. how old version behaves having new parts. Overall I think we can proceed with your plan (maybe (1) is not necessary), but we definitely should investigate all these concerns. |
c649f76
to
7d8243a
Compare
@yakov-olkhovskiy I have implemented such method with a setting for step 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how effective this solution will be without existing_rows_count
in metadata...
@jewelzqiu I'm leaving for vacation for a week - let's continue when I'm back. Overall I think the usefulness if this without metadata will be quite limited... |
@yakov-olkhovskiy well, it is useful if |
@jewelzqiu I'm not sure on perspectives of enabling this by default though - it will slow down startup. I will discuss it tomorrow with the team... the PR itself looks good - I'll go through it one more time if we will decide to merge it... |
@tavplubix we abandoned the idea to have it in metadata - now it only calculates on a fly - do you still have any suspicions this feature could not work with SharedMergeTree? |
94c5846
to
f1b37b6
Compare
Co-authored-by: Yakov Olkhovskiy <99031427+yakov-olkhovskiy@users.noreply.github.com>
59e9131
to
05969a3
Compare
@yakov-olkhovskiy Any update? I have rebased on to latest master several days ago |
e4577fe
to
7f97b11
Compare
@jewelzqiu I merged it but maybe you would like to make an additional PR documenting this feature so ppl can use it... |
resolves #56728
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Consider lightweight deleted rows when selecting parts to merge