Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat (tiering): implementing periodic defragmentation for second tier #2595

Closed
wants to merge 0 commits into from

Conversation

theyueli
Copy link
Contributor

@theyueli theyueli commented Feb 15, 2024

fixes #2433

This update introduces a new feature that regularly performs a defragmentation process by scanning "external keys". If a key's SSD page has a bin utilization lower than a certain threshold, all the keys' values will be loaded back into memory, and offloading will be rescheduled for each. This process aims to consolidate fragmented keys into a new block, which will be better utilized.

Todo: will move the defrag into a different place rather than abusing the heart beat function.

@theyueli theyueli added the enhancement New feature or request label Feb 15, 2024
@theyueli
Copy link
Contributor Author

theyueli commented Feb 15, 2024

@adiholden For now, I did not scan the reference count table as that table doesn't provide the bin size ... of course, one can always grab a hash, get the iterator, then get the length, and then recalculate bin_size ... but I felt it is just a bit easier to directly obtain it from the PrimeIterator passed to the defrag function from db_slice.

@theyueli theyueli self-assigned this Feb 15, 2024
};

// Traverse a single segment every time this function is called.
for (int i = 0; i < 10; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this hardcoded to 10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... this will be changed to a configurable parameter... right now this section's design may still need to be changed...

auto prime_it = db_slice_.GetDBTable(db_index)->prime.FindByHash(key_hash);

// if the key still exists, load the key into memory and reschedule
if (!prime_it.is_done()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the key still exists but its in already in memory? you dont want to call load and schedule offload to this key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add another check to make sure the entry is externalize

@@ -225,7 +225,7 @@ void TieredStorage::InflightWriteRequest::Add(const PrimeKey& pk, const PrimeVal
unsigned bin_size = kSmallBins[bin_index_];
unsigned max_entries = NumEntriesInSmallBin(bin_size);

char* next_hash = block_start_ + entries_.size();
char* next_hash = block_start_ + entries_.size() * 8;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make this 8 a constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

size_t bin_size = GetBinSize(len);
unsigned max_entries = NumEntriesInSmallBin(bin_size);
float bin_util = (float)(refcnt_it->second) / (float)max_entries;
float defrag_bin_util_threshold = 0.2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constexpr double kDefragBinThreshold

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it a member variable of the tiered storage class.

return kSmallBins[bin_index];
}

void TieredStorage::Defrag(DbIndex db_index, PrimeIterator it) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time we spoke I suggested to iterate over the page_refcnt_ in defrug task and if you find a page under utilized load its values. Are you goint to change this PR according to this suggestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just committed the change to implement the above.

size_t hash_section_len = max_entries * 8;
std::vector<char> hash_section(hash_section_len);

auto ec = Read(offs_page * kBlockLen, hash_section_len, &hash_section[0]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should do only one read of the entier page and then populate all the relevant entries.
In the flow right now you issue max_entries times reads which all read from disk the same data but just coppy to your buffer the relevant data from the offset you give

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -1233,6 +1233,12 @@ void DbSlice::ScheduleForOffloadStep(DbIndex db_indx, size_t increase_goal_bytes
}
}

void DbSlice::DefragSecondTierStep(DbIndex db_indx) {
DCHECK(shard_owner()->tiered_storage());
FiberAtomicGuard guard;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to make sure we dont preempt here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have a call to Read inside Defrag which can preempt. how come this guard does not fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right... removed...

absl::flat_hash_map<uint32_t, uint8_t> page_refcnt_;
// absl::flat_hash_map<uint32_t, uint8_t> page_refcnt_;

absl::flat_hash_map<uint32_t, std::pair<unsigned, unsigned> > page_refcnt_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using a struct instead of std::pair<unsigned, unsigned>
it will make the code more clear what is first and what is second here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too simple and it is not necessary in my opinion. But I added a comment above this type. That shall clarify everything when somebody wants to change it.

@@ -92,12 +97,17 @@ class TieredStorage {
struct PerDb;
std::vector<PerDb*> db_arr_;

absl::flat_hash_map<uint32_t, uint8_t> page_refcnt_;
// absl::flat_hash_map<uint32_t, uint8_t> page_refcnt_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

util::fb2::EventCount throttle_ec_;
TieredStats stats_;
size_t max_file_size_;
size_t allocated_size_ = 0;
bool shutdown_ = false;

unsigned int num_pages_to_scan_ = 10;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static constexpr for both

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's better to be this way such that we can actually config set them later to tune the defragmentation parameters without restarting the server?

}

void TieredStorage::Defrag(DbIndex db_index) {
// start scanning from a random position in the ref_cnt_ table.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to start in a random position?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since if the first few entries never change and they don't have fragmentation, then we will never be able to collect others.

// start scanning from a random position in the ref_cnt_ table.
auto refcnt_it = std::next(std::begin(page_refcnt_), rand() % page_refcnt_.size());

for (unsigned int i = 0; i < num_pages_to_scan_; ++i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think you need to limit the number of entries we check in the map but instead limit the number of read operations we do.
i.e you can go over the entier map non of the pages will be below the threshold and you will finish very fast as there is no operation if no defrag is needed.
while you will do want to limit the number of reads from disk every time this defrag function is called

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's certainly another option. It is hard to say which one is better... the defragmentation happens when the CPU is idle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think you need to limit the number of entries we check in the map but instead limit the number of read operations we do. i.e you can go over the entier map non of the pages will be below the threshold and you will finish very fast as there is no operation if no defrag is needed. while you will do want to limit the number of reads from disk every time this defrag function is called

the other question i had on limiting the number of reads is that if all the pages have good utlization and do not need defragmentation, would we keep iterating forever as the number of read is always zero? we do not need to read any page to calculate utilization as we already know the reference count and bin size.

// if the key still exists, load the key into memory and reschedule offload
if (!prime_it.is_done()) {
PrimeValue* entry = &prime_it->second;
auto [offset, len] = entry->GetExternalSlice();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call GetExternalSlice only you verify entry->IsExternal()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also need to verify that the offset is offset to this page , as we might have hash colisions of keys

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call GetExternalSlice only you verify entry->IsExternal()

fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also need to verify that the offset is offset to this page , as we might have hash colisions of keys

that's a great point! i added a check.

if collision happens, and the offset is not the right one, how do we keep retrieving the next entry?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be supported by FindByHash api. you can give it and iterator in param, if the it is empty it will do the find as you do now, if it is not empty it will retrieve the next it that has the same hash

}
// remove this entry from page_refcnt_
page_refcnt_.erase(refcnt_it);
alloc_.Free(offs_page * kBlockLen, kBlockLen);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you continue in the scaning if you free a page? I dont see how this flow is handles as you erase the it from page_refcnt_ and then you continue in the for loop which will access the deleted iterator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought the page_refcnt_.erase(refcnt_it) would update the iterator? I'm curious so how come that erase function from absl doesn't return the new iterator? what is the right way for absl containers on erase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you continue in the scaning if you free a page? I dont see how this flow is handles as you erase the it from page_refcnt_ and then you continue in the for loop which will access the deleted iterator

i did a revision, please see if the new change makes sense.

@theyueli theyueli changed the base branch from main to v1.14-branch March 4, 2024 23:27
@theyueli theyueli changed the base branch from v1.14-branch to main March 4, 2024 23:27
float defrag_bin_util_threshold_ = 0.2;

// a queue of indicies of pages that need to be defragmented.
std::queue<unsigned> pages_to_defrag_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use a set?

if (page_it == page_refcnt_.end())
continue;

// the page exists, now check if this is still under utilized as this page
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if pages_to_defrag_ was a set you would remove from it on free and will not need to do this checks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to use a set

util::fb2::EventCount throttle_ec_;
TieredStats stats_;
size_t max_file_size_;
size_t allocated_size_ = 0;
bool shutdown_ = false;

float defrag_bin_util_threshold_ = 0.2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this parameter later should be configurable in run time..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if in a later PR you will introduce a flag instead of this var you will remove it, but for now this is a const

@@ -163,7 +164,7 @@ class TieredStorage::InflightWriteRequest {
void Add(const PrimeKey& pk, const PrimeValue& pv);

// returns how many entries were offloaded.
unsigned ExternalizeEntries(PerDb::BinRecord* bin_record, DbSlice* db_slice);
std::pair<unsigned, unsigned> ExternalizeEntries(PerDb::BinRecord* bin_record, DbSlice* db_slice);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the comment above

return kSmallBins[bin_index];
}

void TieredStorage::Defrag(DbIndex db_index) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this function is steel called from HeartBeat. you should not forloop untill !pages_to_defrag_.empty() because this can be making this very long step

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i remembered as we discussed last time, you mentioned this function will keep being preempted? so it is safe to let it run from beginning to the end?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only if it runs on a dedicated fiber
when I said this it was because we discussed calling it from a different fiber that will be invoked when we have free cpu as we have with defrug task, this is not the case in the current implementation

@adiholden
Copy link
Collaborator

please add unit tests to your code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement page defragmentation
3 participants