Respect cutoff timestamp during flush #11599

jowlyzhang · 2023-07-11T02:56:09Z

Make flush respect the cutoff timestamp full_history_ts_low as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same FlushRequest so a follow up flush job can continue with the check after some interval.

This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to AdvancedColumnFamilyOptions.

For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the flush_queue_. There is deduplication done at FlushRequest enqueueing(SchedulePendingFlush) and dequeueing time (PopFirstFromFlushQueue). We hold the db mutex between when a FlushRequest is popped from the queue and the same FlushRequest get rescheduled, so no other FlushRequest with a higher max_memtable_id can be added to the flush_queue_ blocking us from re-enqueueing the same FlushRequest.

Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the max_write_buffer_number setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the full_history_ts_low, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it.

Tet plan:

./column_family_test --gtest_filter="*RetainUDT*"
./memtable_list_test --gtest_filter="*WithTimestamp*"
./flush_job_test --gtest_filter="*WithTimestamp*"

facebook-github-bot · 2023-07-18T19:24:28Z

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ajkr

Did we consider using the timestamps in MemTableListVersion::memlist_history_? That should be populated since MyRocks already uses max_write_buffer_size_to_maintain > 0. It wouldn't be easy since lookups don't/can't return results directly from memtable history. But I wonder if we can use the existing history to add minimal support for ReadOptions::timestamp. That is, ensure the visible keys have timestamps below the query timestamp, but not necessarily tell the user the exact timestamp for those keys.

One possible implementation could involve adding an auxiliary ordered list of (seqno,timestamp) to memtables. Then you could translate a ReadOptions::timestamp to a particular sequence number and use that for visibility checking. There'd need to be logic to prevent seqno zeroing in the SST files for seqnos above the seqno corresponding to full_history_ts_low.

ajkr

I'm fine with proceeding with this originally agreed on approach despite my question about the approach above. Had some comments on the implementation though.

db/column_family.cc

db/db_impl/db_impl_compaction_flush.cc

include/rocksdb/advanced_options.h

db/db_impl/db_impl_compaction_flush.cc

db/flush_job.cc

db/memtable.cc

db/db_impl/db_impl_compaction_flush.cc

facebook-github-bot · 2023-07-24T21:36:32Z

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

jowlyzhang · 2023-07-24T22:52:09Z

Did we consider using the timestamps in MemTableListVersion::memlist_history_? That should be populated since MyRocks already uses max_write_buffer_size_to_maintain > 0. It wouldn't be easy since lookups don't/can't return results directly from memtable history. But I wonder if we can use the existing history to add minimal support for ReadOptions::timestamp. That is, ensure the visible keys have timestamps below the query timestamp, but not necessarily tell the user the exact timestamp for those keys.

Thank you for the proposal. Is this for implementing reading the memtable respect ReadOptions.timestamp? I think "ensure the visible keys have timestamps below the query timestamp" are realized by the internal comparator not placing target beyond invisible keys w.r.t timestamp.

ajkr · 2023-07-25T16:26:26Z

Did we consider using the timestamps in MemTableListVersion::memlist_history_? That should be populated since MyRocks already uses max_write_buffer_size_to_maintain > 0. It wouldn't be easy since lookups don't/can't return results directly from memtable history. But I wonder if we can use the existing history to add minimal support for ReadOptions::timestamp. That is, ensure the visible keys have timestamps below the query timestamp, but not necessarily tell the user the exact timestamp for those keys.

Thank you for the proposal. Is this for implementing reading the memtable respect ReadOptions.timestamp? I think "ensure the visible keys have timestamps below the query timestamp" are realized by the internal comparator not placing target beyond invisible keys w.r.t timestamp.

It's for supporting ReadOptions::timestamps that are older than the oldest memtable accessed by point lookups/iterators. It's possible since MyRocks retains even older memtables that were already flushed according to max_write_buffer_size_to_maintain. Those memtables are available in MemTableListVersion::memlist_history_. However read queries do not access memlist_history_ but instead read that data from SST files so do not see the timestamps today.

jowlyzhang · 2023-07-25T16:43:36Z

Did we consider using the timestamps in MemTableListVersion::memlist_history_? That should be populated since MyRocks already uses max_write_buffer_size_to_maintain > 0. It wouldn't be easy since lookups don't/can't return results directly from memtable history. But I wonder if we can use the existing history to add minimal support for ReadOptions::timestamp. That is, ensure the visible keys have timestamps below the query timestamp, but not necessarily tell the user the exact timestamp for those keys.

Thank you for the proposal. Is this for implementing reading the memtable respect ReadOptions.timestamp? I think "ensure the visible keys have timestamps below the query timestamp" are realized by the internal comparator not placing target beyond invisible keys w.r.t timestamp.

It's for supporting ReadOptions::timestamps that are older than the oldest memtable accessed by point lookups/iterators. It's possible since MyRocks retains even older memtables that were already flushed according to max_write_buffer_size_to_maintain. Those memtables are available in MemTableListVersion::memlist_history_. However read queries do not access memlist_history_ but instead read that data from SST files so do not see the timestamps today.

Oh, I see, thank you for the detailed explanation. They will probably be very interested in having this capability since the memory is already spent. I will look into this more. Thanks for the proposal and the pointers.

db/db_impl/db_impl_compaction_flush.cc

db/memtable.cc

facebook-github-bot · 2023-07-25T20:27:57Z

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2023-07-25T20:32:15Z

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2023-07-25T22:31:51Z

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2023-07-25T23:15:35Z

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ajkr

Looks great! Thanks very much for addressing all the feedback!

ajkr · 2023-07-26T00:42:55Z

db/memtable.cc

@@ -725,8 +728,8 @@ Status MemTable::Add(SequenceNumber s, ValueType type,
 }
 }

- size_t ts_sz = GetInternalKeyComparator().user_comparator()->timestamp_size();
- Slice key_without_ts = StripTimestampFromUserKey(key, ts_sz);
+ MaybeUpdateNewestUDT(key_slice);


You might only be able to do this in Add() in the !allow_concurrent branch. I know you're already doing it here, but a race condition detector might complain about it at some point. The convention for allow_concurrent appears to be to apply updates requiring serialization in BatchPostProcess()

Thank you for pointing this out! I have for now move it to only the !allow_concurrent branch and added a TODO for it. Looks like all the member variables that BatchPostProcess() updates have built-in thread safety support via std::atomic, so currently its invocation is not specifically serialized. A hacky workaround would be to track the newest UDT as uint64_t here since we are only supporting that type of user-defined timestamp for this feature already. Anyways, I mentioned this caveat in the feature description and will follow up with MyRocks on the priority of having this.

facebook-github-bot · 2023-07-26T19:03:15Z

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

… is disabled

facebook-github-bot · 2023-07-26T20:14:52Z

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2023-07-26T21:14:15Z

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-07-26T23:30:25Z

@jowlyzhang merged this pull request in 4ea7b79.

facebook-github-bot added the CLA Signed label Jul 11, 2023

jowlyzhang marked this pull request as draft July 11, 2023 02:56

jowlyzhang force-pushed the flush_eligibility branch 4 times, most recently from d6c9740 to a62b324 Compare July 11, 2023 19:11

jowlyzhang requested a review from ajkr July 11, 2023 19:58

jowlyzhang marked this pull request as ready for review July 11, 2023 19:58

ajkr reviewed Jul 24, 2023

View reviewed changes

ajkr requested changes Jul 24, 2023

View reviewed changes

ajkr reviewed Jul 24, 2023

View reviewed changes

db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved

jowlyzhang force-pushed the flush_eligibility branch from a62b324 to a80d373 Compare July 24, 2023 21:36

jowlyzhang requested a review from ajkr July 24, 2023 22:54

ajkr reviewed Jul 25, 2023

View reviewed changes

db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved

db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved

db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved

ajkr reviewed Jul 25, 2023

View reviewed changes

db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved

db/memtable.cc Outdated Show resolved Hide resolved

jowlyzhang force-pushed the flush_eligibility branch from a80d373 to 445ebb2 Compare July 25, 2023 20:27

jowlyzhang force-pushed the flush_eligibility branch from 445ebb2 to 3f6b5b5 Compare July 25, 2023 20:32

jowlyzhang force-pushed the flush_eligibility branch from 3f6b5b5 to 1c556cd Compare July 25, 2023 22:31

jowlyzhang requested a review from ajkr July 25, 2023 23:15

ajkr approved these changes Jul 26, 2023

View reviewed changes

Respect cutoff timestamp during flush

d96af8d

jowlyzhang added 2 commits July 26, 2023 12:01

Address review comments

aaba378

address review comments

8ca8483

jowlyzhang force-pushed the flush_eligibility branch from 1c556cd to 6fe8ccf Compare July 26, 2023 19:03

track newest UDT in live path only when concurrent memtable insertion…

822b885

… is disabled

jowlyzhang force-pushed the flush_eligibility branch from 6fe8ccf to 822b885 Compare July 26, 2023 20:14

facebook-github-bot closed this in 4ea7b79 Jul 26, 2023

facebook-github-bot added the Merged label Jul 26, 2023

jowlyzhang deleted the flush_eligibility branch July 27, 2023 23:21

igorcanadi mentioned this pull request Jan 17, 2024

[SYS-6913] Upgrade RocksDB-Cloud to 8.9.1 rockset/rocksdb-cloud#315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect cutoff timestamp during flush #11599

Respect cutoff timestamp during flush #11599

jowlyzhang commented Jul 11, 2023 •

edited

Loading

facebook-github-bot commented Jul 18, 2023

ajkr left a comment •

edited

Loading

ajkr left a comment

facebook-github-bot commented Jul 24, 2023

jowlyzhang commented Jul 24, 2023

ajkr commented Jul 25, 2023

jowlyzhang commented Jul 25, 2023

facebook-github-bot commented Jul 25, 2023

facebook-github-bot commented Jul 25, 2023

facebook-github-bot commented Jul 25, 2023

facebook-github-bot commented Jul 25, 2023

ajkr left a comment

ajkr Jul 26, 2023

jowlyzhang Jul 26, 2023

facebook-github-bot commented Jul 26, 2023

facebook-github-bot commented Jul 26, 2023

facebook-github-bot commented Jul 26, 2023

facebook-github-bot commented Jul 26, 2023

Respect cutoff timestamp during flush #11599

Respect cutoff timestamp during flush #11599

Conversation

jowlyzhang commented Jul 11, 2023 • edited Loading

facebook-github-bot commented Jul 18, 2023

ajkr left a comment • edited Loading

Choose a reason for hiding this comment

ajkr left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 24, 2023

jowlyzhang commented Jul 24, 2023

ajkr commented Jul 25, 2023

jowlyzhang commented Jul 25, 2023

facebook-github-bot commented Jul 25, 2023

facebook-github-bot commented Jul 25, 2023

facebook-github-bot commented Jul 25, 2023

facebook-github-bot commented Jul 25, 2023

ajkr left a comment

Choose a reason for hiding this comment

ajkr Jul 26, 2023

Choose a reason for hiding this comment

jowlyzhang Jul 26, 2023

Choose a reason for hiding this comment

facebook-github-bot commented Jul 26, 2023

facebook-github-bot commented Jul 26, 2023

facebook-github-bot commented Jul 26, 2023

facebook-github-bot commented Jul 26, 2023

jowlyzhang commented Jul 11, 2023 •

edited

Loading

ajkr left a comment •

edited

Loading