-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat (tiering): implementing periodic defragmentation for second tier #2595
Conversation
@adiholden For now, I did not scan the reference count table as that table doesn't provide the bin size ... of course, one can always grab a hash, get the iterator, then get the length, and then recalculate bin_size ... but I felt it is just a bit easier to directly obtain it from the PrimeIterator passed to the defrag function from db_slice. |
src/server/db_slice.cc
Outdated
}; | ||
|
||
// Traverse a single segment every time this function is called. | ||
for (int i = 0; i < 10; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this hardcoded to 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... this will be changed to a configurable parameter... right now this section's design may still need to be changed...
src/server/tiered_storage.cc
Outdated
auto prime_it = db_slice_.GetDBTable(db_index)->prime.FindByHash(key_hash); | ||
|
||
// if the key still exists, load the key into memory and reschedule | ||
if (!prime_it.is_done()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the key still exists but its in already in memory? you dont want to call load and schedule offload to this key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add another check to make sure the entry is externalize
src/server/tiered_storage.cc
Outdated
@@ -225,7 +225,7 @@ void TieredStorage::InflightWriteRequest::Add(const PrimeKey& pk, const PrimeVal | |||
unsigned bin_size = kSmallBins[bin_index_]; | |||
unsigned max_entries = NumEntriesInSmallBin(bin_size); | |||
|
|||
char* next_hash = block_start_ + entries_.size(); | |||
char* next_hash = block_start_ + entries_.size() * 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make this 8 a constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
src/server/tiered_storage.cc
Outdated
size_t bin_size = GetBinSize(len); | ||
unsigned max_entries = NumEntriesInSmallBin(bin_size); | ||
float bin_util = (float)(refcnt_it->second) / (float)max_entries; | ||
float defrag_bin_util_threshold = 0.2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constexpr double kDefragBinThreshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it a member variable of the tiered storage class.
src/server/tiered_storage.cc
Outdated
return kSmallBins[bin_index]; | ||
} | ||
|
||
void TieredStorage::Defrag(DbIndex db_index, PrimeIterator it) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time we spoke I suggested to iterate over the page_refcnt_ in defrug task and if you find a page under utilized load its values. Are you goint to change this PR according to this suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just committed the change to implement the above.
src/server/tiered_storage.cc
Outdated
size_t hash_section_len = max_entries * 8; | ||
std::vector<char> hash_section(hash_section_len); | ||
|
||
auto ec = Read(offs_page * kBlockLen, hash_section_len, &hash_section[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should do only one read of the entier page and then populate all the relevant entries.
In the flow right now you issue max_entries times reads which all read from disk the same data but just coppy to your buffer the relevant data from the offset you give
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
src/server/db_slice.cc
Outdated
@@ -1233,6 +1233,12 @@ void DbSlice::ScheduleForOffloadStep(DbIndex db_indx, size_t increase_goal_bytes | |||
} | |||
} | |||
|
|||
void DbSlice::DefragSecondTierStep(DbIndex db_indx) { | |||
DCHECK(shard_owner()->tiered_storage()); | |||
FiberAtomicGuard guard; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to make sure we dont preempt here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have a call to Read inside Defrag which can preempt. how come this guard does not fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right... removed...
src/server/tiered_storage.h
Outdated
absl::flat_hash_map<uint32_t, uint8_t> page_refcnt_; | ||
// absl::flat_hash_map<uint32_t, uint8_t> page_refcnt_; | ||
|
||
absl::flat_hash_map<uint32_t, std::pair<unsigned, unsigned> > page_refcnt_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using a struct instead of std::pair<unsigned, unsigned>
it will make the code more clear what is first and what is second here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please address this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too simple and it is not necessary in my opinion. But I added a comment above this type. That shall clarify everything when somebody wants to change it.
src/server/tiered_storage.h
Outdated
@@ -92,12 +97,17 @@ class TieredStorage { | |||
struct PerDb; | |||
std::vector<PerDb*> db_arr_; | |||
|
|||
absl::flat_hash_map<uint32_t, uint8_t> page_refcnt_; | |||
// absl::flat_hash_map<uint32_t, uint8_t> page_refcnt_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
src/server/tiered_storage.h
Outdated
util::fb2::EventCount throttle_ec_; | ||
TieredStats stats_; | ||
size_t max_file_size_; | ||
size_t allocated_size_ = 0; | ||
bool shutdown_ = false; | ||
|
||
unsigned int num_pages_to_scan_ = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static constexpr for both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's better to be this way such that we can actually config set them later to tune the defragmentation parameters without restarting the server?
src/server/tiered_storage.cc
Outdated
} | ||
|
||
void TieredStorage::Defrag(DbIndex db_index) { | ||
// start scanning from a random position in the ref_cnt_ table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we want to start in a random position?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since if the first few entries never change and they don't have fragmentation, then we will never be able to collect others.
src/server/tiered_storage.cc
Outdated
// start scanning from a random position in the ref_cnt_ table. | ||
auto refcnt_it = std::next(std::begin(page_refcnt_), rand() % page_refcnt_.size()); | ||
|
||
for (unsigned int i = 0; i < num_pages_to_scan_; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think you need to limit the number of entries we check in the map but instead limit the number of read operations we do.
i.e you can go over the entier map non of the pages will be below the threshold and you will finish very fast as there is no operation if no defrag is needed.
while you will do want to limit the number of reads from disk every time this defrag function is called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's certainly another option. It is hard to say which one is better... the defragmentation happens when the CPU is idle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think you need to limit the number of entries we check in the map but instead limit the number of read operations we do. i.e you can go over the entier map non of the pages will be below the threshold and you will finish very fast as there is no operation if no defrag is needed. while you will do want to limit the number of reads from disk every time this defrag function is called
the other question i had on limiting the number of reads is that if all the pages have good utlization and do not need defragmentation, would we keep iterating forever as the number of read is always zero? we do not need to read any page to calculate utilization as we already know the reference count and bin size.
src/server/tiered_storage.cc
Outdated
// if the key still exists, load the key into memory and reschedule offload | ||
if (!prime_it.is_done()) { | ||
PrimeValue* entry = &prime_it->second; | ||
auto [offset, len] = entry->GetExternalSlice(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call GetExternalSlice only you verify entry->IsExternal()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you also need to verify that the offset is offset to this page , as we might have hash colisions of keys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call GetExternalSlice only you verify entry->IsExternal()
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you also need to verify that the offset is offset to this page , as we might have hash colisions of keys
that's a great point! i added a check.
if collision happens, and the offset is not the right one, how do we keep retrieving the next entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be supported by FindByHash api. you can give it and iterator in param, if the it is empty it will do the find as you do now, if it is not empty it will retrieve the next it that has the same hash
src/server/tiered_storage.cc
Outdated
} | ||
// remove this entry from page_refcnt_ | ||
page_refcnt_.erase(refcnt_it); | ||
alloc_.Free(offs_page * kBlockLen, kBlockLen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you continue in the scaning if you free a page? I dont see how this flow is handles as you erase the it from page_refcnt_ and then you continue in the for loop which will access the deleted iterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought the page_refcnt_.erase(refcnt_it) would update the iterator? I'm curious so how come that erase function from absl doesn't return the new iterator? what is the right way for absl containers on erase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you continue in the scaning if you free a page? I dont see how this flow is handles as you erase the it from page_refcnt_ and then you continue in the for loop which will access the deleted iterator
i did a revision, please see if the new change makes sense.
src/server/tiered_storage.h
Outdated
float defrag_bin_util_threshold_ = 0.2; | ||
|
||
// a queue of indicies of pages that need to be defragmented. | ||
std::queue<unsigned> pages_to_defrag_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use a set?
src/server/tiered_storage.cc
Outdated
if (page_it == page_refcnt_.end()) | ||
continue; | ||
|
||
// the page exists, now check if this is still under utilized as this page |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if pages_to_defrag_ was a set you would remove from it on free and will not need to do this checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to use a set
src/server/tiered_storage.h
Outdated
util::fb2::EventCount throttle_ec_; | ||
TieredStats stats_; | ||
size_t max_file_size_; | ||
size_t allocated_size_ = 0; | ||
bool shutdown_ = false; | ||
|
||
float defrag_bin_util_threshold_ = 0.2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this parameter later should be configurable in run time..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if in a later PR you will introduce a flag instead of this var you will remove it, but for now this is a const
src/server/tiered_storage.cc
Outdated
@@ -163,7 +164,7 @@ class TieredStorage::InflightWriteRequest { | |||
void Add(const PrimeKey& pk, const PrimeValue& pv); | |||
|
|||
// returns how many entries were offloaded. | |||
unsigned ExternalizeEntries(PerDb::BinRecord* bin_record, DbSlice* db_slice); | |||
std::pair<unsigned, unsigned> ExternalizeEntries(PerDb::BinRecord* bin_record, DbSlice* db_slice); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the comment above
src/server/tiered_storage.cc
Outdated
return kSmallBins[bin_index]; | ||
} | ||
|
||
void TieredStorage::Defrag(DbIndex db_index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this function is steel called from HeartBeat. you should not forloop untill !pages_to_defrag_.empty() because this can be making this very long step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i remembered as we discussed last time, you mentioned this function will keep being preempted? so it is safe to let it run from beginning to the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only if it runs on a dedicated fiber
when I said this it was because we discussed calling it from a different fiber that will be invoked when we have free cpu as we have with defrug task, this is not the case in the current implementation
please add unit tests to your code |
fixes #2433
This update introduces a new feature that regularly performs a defragmentation process by scanning "external keys". If a key's SSD page has a bin utilization lower than a certain threshold, all the keys' values will be loaded back into memory, and offloading will be rescheduled for each. This process aims to consolidate fragmented keys into a new block, which will be better utilized.
Todo: will move the defrag into a different place rather than abusing the heart beat function.