Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect cutoff timestamp during flush #11599

Closed
wants to merge 4 commits into from

Conversation

jowlyzhang
Copy link
Contributor

@jowlyzhang jowlyzhang commented Jul 11, 2023

Make flush respect the cutoff timestamp full_history_ts_low as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same FlushRequest so a follow up flush job can continue with the check after some interval.

This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to AdvancedColumnFamilyOptions.

For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the flush_queue_. There is deduplication done at FlushRequest enqueueing(SchedulePendingFlush) and dequeueing time (PopFirstFromFlushQueue). We hold the db mutex between when a FlushRequest is popped from the queue and the same FlushRequest get rescheduled, so no other FlushRequest with a higher max_memtable_id can be added to the flush_queue_ blocking us from re-enqueueing the same FlushRequest.

Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the max_write_buffer_number setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the full_history_ts_low, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it.

Tet plan:

./column_family_test --gtest_filter="*RetainUDT*"
./memtable_list_test --gtest_filter="*WithTimestamp*"
./flush_job_test --gtest_filter="*WithTimestamp*"

@jowlyzhang jowlyzhang marked this pull request as draft July 11, 2023 02:56
@jowlyzhang jowlyzhang force-pushed the flush_eligibility branch 4 times, most recently from d6c9740 to a62b324 Compare July 11, 2023 19:11
@jowlyzhang jowlyzhang requested a review from ajkr July 11, 2023 19:58
@jowlyzhang jowlyzhang marked this pull request as ready for review July 11, 2023 19:58
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we consider using the timestamps in MemTableListVersion::memlist_history_? That should be populated since MyRocks already uses max_write_buffer_size_to_maintain > 0. It wouldn't be easy since lookups don't/can't return results directly from memtable history. But I wonder if we can use the existing history to add minimal support for ReadOptions::timestamp. That is, ensure the visible keys have timestamps below the query timestamp, but not necessarily tell the user the exact timestamp for those keys.

One possible implementation could involve adding an auxiliary ordered list of (seqno,timestamp) to memtables. Then you could translate a ReadOptions::timestamp to a particular sequence number and use that for visibility checking. There'd need to be logic to prevent seqno zeroing in the SST files for seqnos above the seqno corresponding to full_history_ts_low.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with proceeding with this originally agreed on approach despite my question about the approach above. Had some comments on the implementation though.

db/column_family.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved
include/rocksdb/advanced_options.h Show resolved Hide resolved
db/db_impl/db_impl_compaction_flush.cc Show resolved Hide resolved
db/flush_job.cc Show resolved Hide resolved
db/memtable.cc Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@jowlyzhang
Copy link
Contributor Author

Did we consider using the timestamps in MemTableListVersion::memlist_history_? That should be populated since MyRocks already uses max_write_buffer_size_to_maintain > 0. It wouldn't be easy since lookups don't/can't return results directly from memtable history. But I wonder if we can use the existing history to add minimal support for ReadOptions::timestamp. That is, ensure the visible keys have timestamps below the query timestamp, but not necessarily tell the user the exact timestamp for those keys.

Thank you for the proposal. Is this for implementing reading the memtable respect ReadOptions.timestamp? I think "ensure the visible keys have timestamps below the query timestamp" are realized by the internal comparator not placing target beyond invisible keys w.r.t timestamp.

@jowlyzhang jowlyzhang requested a review from ajkr July 24, 2023 22:54
@ajkr
Copy link
Contributor

ajkr commented Jul 25, 2023

Did we consider using the timestamps in MemTableListVersion::memlist_history_? That should be populated since MyRocks already uses max_write_buffer_size_to_maintain > 0. It wouldn't be easy since lookups don't/can't return results directly from memtable history. But I wonder if we can use the existing history to add minimal support for ReadOptions::timestamp. That is, ensure the visible keys have timestamps below the query timestamp, but not necessarily tell the user the exact timestamp for those keys.

Thank you for the proposal. Is this for implementing reading the memtable respect ReadOptions.timestamp? I think "ensure the visible keys have timestamps below the query timestamp" are realized by the internal comparator not placing target beyond invisible keys w.r.t timestamp.

It's for supporting ReadOptions::timestamps that are older than the oldest memtable accessed by point lookups/iterators. It's possible since MyRocks retains even older memtables that were already flushed according to max_write_buffer_size_to_maintain. Those memtables are available in MemTableListVersion::memlist_history_. However read queries do not access memlist_history_ but instead read that data from SST files so do not see the timestamps today.

@jowlyzhang
Copy link
Contributor Author

Did we consider using the timestamps in MemTableListVersion::memlist_history_? That should be populated since MyRocks already uses max_write_buffer_size_to_maintain > 0. It wouldn't be easy since lookups don't/can't return results directly from memtable history. But I wonder if we can use the existing history to add minimal support for ReadOptions::timestamp. That is, ensure the visible keys have timestamps below the query timestamp, but not necessarily tell the user the exact timestamp for those keys.

Thank you for the proposal. Is this for implementing reading the memtable respect ReadOptions.timestamp? I think "ensure the visible keys have timestamps below the query timestamp" are realized by the internal comparator not placing target beyond invisible keys w.r.t timestamp.

It's for supporting ReadOptions::timestamps that are older than the oldest memtable accessed by point lookups/iterators. It's possible since MyRocks retains even older memtables that were already flushed according to max_write_buffer_size_to_maintain. Those memtables are available in MemTableListVersion::memlist_history_. However read queries do not access memlist_history_ but instead read that data from SST files so do not see the timestamps today.

Oh, I see, thank you for the detailed explanation. They will probably be very interested in having this capability since the memory is already spent. I will look into this more. Thanks for the proposal and the pointers.

db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl_compaction_flush.cc Outdated Show resolved Hide resolved
db/memtable.cc Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@jowlyzhang jowlyzhang requested a review from ajkr July 25, 2023 23:15
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks very much for addressing all the feedback!

db/memtable.cc Outdated
@@ -725,8 +728,8 @@ Status MemTable::Add(SequenceNumber s, ValueType type,
}
}

size_t ts_sz = GetInternalKeyComparator().user_comparator()->timestamp_size();
Slice key_without_ts = StripTimestampFromUserKey(key, ts_sz);
MaybeUpdateNewestUDT(key_slice);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might only be able to do this in Add() in the !allow_concurrent branch. I know you're already doing it here, but a race condition detector might complain about it at some point. The convention for allow_concurrent appears to be to apply updates requiring serialization in BatchPostProcess()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out! I have for now move it to only the !allow_concurrent branch and added a TODO for it. Looks like all the member variables that BatchPostProcess() updates have built-in thread safety support via std::atomic, so currently its invocation is not specifically serialized. A hacky workaround would be to track the newest UDT as uint64_t here since we are only supporting that type of user-defined timestamp for this feature already. Anyways, I mentioned this caveat in the feature description and will follow up with MyRocks on the priority of having this.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang merged this pull request in 4ea7b79.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants