-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Respect cutoff timestamp during flush #11599
Conversation
d6c9740
to
a62b324
Compare
@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we consider using the timestamps in MemTableListVersion::memlist_history_
? That should be populated since MyRocks already uses max_write_buffer_size_to_maintain > 0
. It wouldn't be easy since lookups don't/can't return results directly from memtable history. But I wonder if we can use the existing history to add minimal support for ReadOptions::timestamp
. That is, ensure the visible keys have timestamps below the query timestamp, but not necessarily tell the user the exact timestamp for those keys.
One possible implementation could involve adding an auxiliary ordered list of (seqno,timestamp) to memtables. Then you could translate a ReadOptions::timestamp
to a particular sequence number and use that for visibility checking. There'd need to be logic to prevent seqno zeroing in the SST files for seqnos above the seqno corresponding to full_history_ts_low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with proceeding with this originally agreed on approach despite my question about the approach above. Had some comments on the implementation though.
a62b324
to
a80d373
Compare
@jowlyzhang has updated the pull request. You must reimport the pull request before landing. |
Thank you for the proposal. Is this for implementing reading the memtable respect |
It's for supporting |
Oh, I see, thank you for the detailed explanation. They will probably be very interested in having this capability since the memory is already spent. I will look into this more. Thanks for the proposal and the pointers. |
a80d373
to
445ebb2
Compare
@jowlyzhang has updated the pull request. You must reimport the pull request before landing. |
445ebb2
to
3f6b5b5
Compare
@jowlyzhang has updated the pull request. You must reimport the pull request before landing. |
3f6b5b5
to
1c556cd
Compare
@jowlyzhang has updated the pull request. You must reimport the pull request before landing. |
@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks very much for addressing all the feedback!
db/memtable.cc
Outdated
@@ -725,8 +728,8 @@ Status MemTable::Add(SequenceNumber s, ValueType type, | |||
} | |||
} | |||
|
|||
size_t ts_sz = GetInternalKeyComparator().user_comparator()->timestamp_size(); | |||
Slice key_without_ts = StripTimestampFromUserKey(key, ts_sz); | |||
MaybeUpdateNewestUDT(key_slice); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might only be able to do this in Add()
in the !allow_concurrent
branch. I know you're already doing it here, but a race condition detector might complain about it at some point. The convention for allow_concurrent
appears to be to apply updates requiring serialization in BatchPostProcess()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out! I have for now move it to only the !allow_concurrent branch and added a TODO for it. Looks like all the member variables that BatchPostProcess()
updates have built-in thread safety support via std::atomic, so currently its invocation is not specifically serialized. A hacky workaround would be to track the newest UDT as uint64_t here since we are only supporting that type of user-defined timestamp for this feature already. Anyways, I mentioned this caveat in the feature description and will follow up with MyRocks on the priority of having this.
1c556cd
to
6fe8ccf
Compare
@jowlyzhang has updated the pull request. You must reimport the pull request before landing. |
6fe8ccf
to
822b885
Compare
@jowlyzhang has updated the pull request. You must reimport the pull request before landing. |
@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@jowlyzhang merged this pull request in 4ea7b79. |
Make flush respect the cutoff timestamp
full_history_ts_low
as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the sameFlushRequest
so a follow up flush job can continue with the check after some interval.This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to
AdvancedColumnFamilyOptions
.For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the
flush_queue_
. There is deduplication done atFlushRequest
enqueueing(SchedulePendingFlush
) and dequeueing time (PopFirstFromFlushQueue
). We hold the db mutex between when aFlushRequest
is popped from the queue and the same FlushRequest get rescheduled, so no otherFlushRequest
with a highermax_memtable_id
can be added to theflush_queue_
blocking us from re-enqueueing the sameFlushRequest
.Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the
max_write_buffer_number
setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase thefull_history_ts_low
, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it.Tet plan: