feat (tiering): implementing periodic defragmentation for second tier #2595

theyueli · 2024-02-15T03:11:22Z

This update introduces a new feature that regularly performs a defragmentation process by scanning "external keys". If a key's SSD page has a bin utilization lower than a certain threshold, all the keys' values will be loaded back into memory, and offloading will be rescheduled for each. This process aims to consolidate fragmented keys into a new block, which will be better utilized.

Todo: will move the defrag into a different place rather than abusing the heart beat function.

theyueli · 2024-02-15T03:23:09Z

@adiholden For now, I did not scan the reference count table as that table doesn't provide the bin size ... of course, one can always grab a hash, get the iterator, then get the length, and then recalculate bin_size ... but I felt it is just a bit easier to directly obtain it from the PrimeIterator passed to the defrag function from db_slice.

kostasrim · 2024-02-15T11:18:23Z

src/server/db_slice.cc

+  };
+
+  // Traverse a single segment every time this function is called.
+  for (int i = 0; i < 10; ++i) {


Why is this hardcoded to 10?

yeah... this will be changed to a configurable parameter... right now this section's design may still need to be changed...

adiholden · 2024-02-20T14:41:56Z

src/server/tiered_storage.cc

+        auto prime_it = db_slice_.GetDBTable(db_index)->prime.FindByHash(key_hash);
+
+        // if the key still exists, load the key into memory and reschedule
+        if (!prime_it.is_done()) {


what if the key still exists but its in already in memory? you dont want to call load and schedule offload to this key

add another check to make sure the entry is externalize

adiholden · 2024-02-20T14:43:18Z

src/server/tiered_storage.cc

@@ -225,7 +225,7 @@ void TieredStorage::InflightWriteRequest::Add(const PrimeKey& pk, const PrimeVal
  unsigned bin_size = kSmallBins[bin_index_];
  unsigned max_entries = NumEntriesInSmallBin(bin_size);

-  char* next_hash = block_start_ + entries_.size();
+  char* next_hash = block_start_ + entries_.size() * 8;


please make this 8 a constant

adiholden · 2024-02-20T14:47:54Z

src/server/tiered_storage.cc

+    size_t bin_size = GetBinSize(len);
+    unsigned max_entries = NumEntriesInSmallBin(bin_size);
+    float bin_util = (float)(refcnt_it->second) / (float)max_entries;
+    float defrag_bin_util_threshold = 0.2;


constexpr double kDefragBinThreshold

I made it a member variable of the tiered storage class.

adiholden · 2024-02-20T14:49:42Z

src/server/tiered_storage.cc

+  return kSmallBins[bin_index];
+}
+
+void TieredStorage::Defrag(DbIndex db_index, PrimeIterator it) {


Last time we spoke I suggested to iterate over the page_refcnt_ in defrug task and if you find a page under utilized load its values. Are you goint to change this PR according to this suggestion?

I just committed the change to implement the above.

adiholden · 2024-02-20T15:01:15Z

src/server/tiered_storage.cc

+      size_t hash_section_len = max_entries * 8;
+      std::vector<char> hash_section(hash_section_len);
+
+      auto ec = Read(offs_page * kBlockLen, hash_section_len, &hash_section[0]);


you should do only one read of the entier page and then populate all the relevant entries.
In the flow right now you issue max_entries times reads which all read from disk the same data but just coppy to your buffer the relevant data from the offset you give

adiholden · 2024-02-26T15:31:37Z

src/server/db_slice.cc

@@ -1233,6 +1233,12 @@ void DbSlice::ScheduleForOffloadStep(DbIndex db_indx, size_t increase_goal_bytes
  }
 }

+void DbSlice::DefragSecondTierStep(DbIndex db_indx) {
+  DCHECK(shard_owner()->tiered_storage());
+  FiberAtomicGuard guard;


why do you need to make sure we dont preempt here?

you have a call to Read inside Defrag which can preempt. how come this guard does not fail?

right... removed...

adiholden · 2024-02-26T15:39:08Z

src/server/tiered_storage.h

-  absl::flat_hash_map<uint32_t, uint8_t> page_refcnt_;
+  // absl::flat_hash_map<uint32_t, uint8_t> page_refcnt_;
+
+  absl::flat_hash_map<uint32_t, std::pair<unsigned, unsigned> > page_refcnt_;


I suggest using a struct instead of std::pair<unsigned, unsigned>
it will make the code more clear what is first and what is second here

please address this comment

this is too simple and it is not necessary in my opinion. But I added a comment above this type. That shall clarify everything when somebody wants to change it.

adiholden · 2024-02-26T15:39:28Z

src/server/tiered_storage.h

@@ -92,12 +97,17 @@ class TieredStorage {
  struct PerDb;
  std::vector<PerDb*> db_arr_;

-  absl::flat_hash_map<uint32_t, uint8_t> page_refcnt_;
+  // absl::flat_hash_map<uint32_t, uint8_t> page_refcnt_;


adiholden · 2024-02-26T15:42:00Z

src/server/tiered_storage.h

  util::fb2::EventCount throttle_ec_;
  TieredStats stats_;
  size_t max_file_size_;
  size_t allocated_size_ = 0;
  bool shutdown_ = false;
+
+  unsigned int num_pages_to_scan_ = 10;


static constexpr for both

i think it's better to be this way such that we can actually config set them later to tune the defragmentation parameters without restarting the server?

adiholden · 2024-02-26T15:46:50Z

src/server/tiered_storage.cc

+}
+
+void TieredStorage::Defrag(DbIndex db_index) {
+  // start scanning from a random position in the ref_cnt_ table.


why do we want to start in a random position?

since if the first few entries never change and they don't have fragmentation, then we will never be able to collect others.

adiholden · 2024-02-26T15:49:42Z

src/server/tiered_storage.cc

+  // start scanning from a random position in the ref_cnt_ table.
+  auto refcnt_it = std::next(std::begin(page_refcnt_), rand() % page_refcnt_.size());
+
+  for (unsigned int i = 0; i < num_pages_to_scan_; ++i) {


I dont think you need to limit the number of entries we check in the map but instead limit the number of read operations we do.
i.e you can go over the entier map non of the pages will be below the threshold and you will finish very fast as there is no operation if no defrag is needed.
while you will do want to limit the number of reads from disk every time this defrag function is called

that's certainly another option. It is hard to say which one is better... the defragmentation happens when the CPU is idle.

I dont think you need to limit the number of entries we check in the map but instead limit the number of read operations we do. i.e you can go over the entier map non of the pages will be below the threshold and you will finish very fast as there is no operation if no defrag is needed. while you will do want to limit the number of reads from disk every time this defrag function is called

the other question i had on limiting the number of reads is that if all the pages have good utlization and do not need defragmentation, would we keep iterating forever as the number of read is always zero? we do not need to read any page to calculate utilization as we already know the reference count and bin size.

adiholden · 2024-02-26T15:57:34Z

src/server/tiered_storage.cc

+        // if the key still exists, load the key into memory and reschedule offload
+        if (!prime_it.is_done()) {
+          PrimeValue* entry = &prime_it->second;
+          auto [offset, len] = entry->GetExternalSlice();


call GetExternalSlice only you verify entry->IsExternal()

you also need to verify that the offset is offset to this page , as we might have hash colisions of keys

call GetExternalSlice only you verify entry->IsExternal()

fixed.

you also need to verify that the offset is offset to this page , as we might have hash colisions of keys

that's a great point! i added a check.

if collision happens, and the offset is not the right one, how do we keep retrieving the next entry?

this should be supported by FindByHash api. you can give it and iterator in param, if the it is empty it will do the find as you do now, if it is not empty it will retrieve the next it that has the same hash

adiholden · 2024-02-26T16:18:06Z

src/server/tiered_storage.cc

+      }
+      // remove this entry from page_refcnt_
+      page_refcnt_.erase(refcnt_it);
+      alloc_.Free(offs_page * kBlockLen, kBlockLen);


how do you continue in the scaning if you free a page? I dont see how this flow is handles as you erase the it from page_refcnt_ and then you continue in the for loop which will access the deleted iterator

i thought the page_refcnt_.erase(refcnt_it) would update the iterator? I'm curious so how come that erase function from absl doesn't return the new iterator? what is the right way for absl containers on erase?

how do you continue in the scaning if you free a page? I dont see how this flow is handles as you erase the it from page_refcnt_ and then you continue in the for loop which will access the deleted iterator

i did a revision, please see if the new change makes sense.

adiholden · 2024-03-05T08:26:19Z

src/server/tiered_storage.h

+  float defrag_bin_util_threshold_ = 0.2;
+
+  // a queue of indicies of pages that need to be defragmented.
+  std::queue<unsigned> pages_to_defrag_;


why not use a set?

adiholden · 2024-03-05T08:27:17Z

src/server/tiered_storage.cc

+    if (page_it == page_refcnt_.end())
+      continue;
+
+    // the page exists, now check if this is still under utilized as this page


if pages_to_defrag_ was a set you would remove from it on free and will not need to do this checks

changed to use a set

adiholden · 2024-03-05T08:30:05Z

src/server/tiered_storage.h

  util::fb2::EventCount throttle_ec_;
  TieredStats stats_;
  size_t max_file_size_;
  size_t allocated_size_ = 0;
  bool shutdown_ = false;
+
+  float defrag_bin_util_threshold_ = 0.2;


change to const

this parameter later should be configurable in run time..

if in a later PR you will introduce a flag instead of this var you will remove it, but for now this is a const

adiholden · 2024-03-05T08:33:01Z

src/server/tiered_storage.cc

@@ -163,7 +164,7 @@ class TieredStorage::InflightWriteRequest {
  void Add(const PrimeKey& pk, const PrimeValue& pv);

  // returns how many entries were offloaded.
-  unsigned ExternalizeEntries(PerDb::BinRecord* bin_record, DbSlice* db_slice);
+  std::pair<unsigned, unsigned> ExternalizeEntries(PerDb::BinRecord* bin_record, DbSlice* db_slice);


update the comment above

adiholden · 2024-03-05T08:43:19Z

src/server/tiered_storage.cc

+  return kSmallBins[bin_index];
+}
+
+void TieredStorage::Defrag(DbIndex db_index) {


I see this function is steel called from HeartBeat. you should not forloop untill !pages_to_defrag_.empty() because this can be making this very long step

i remembered as we discussed last time, you mentioned this function will keep being preempted? so it is safe to let it run from beginning to the end?

only if it runs on a dedicated fiber
when I said this it was because we discussed calling it from a different fiber that will be invoked when we have free cpu as we have with defrug task, this is not the case in the current implementation

adiholden · 2024-03-05T08:47:38Z

please add unit tests to your code

theyueli added the enhancement New feature or request label Feb 15, 2024

theyueli requested review from romange and adiholden February 15, 2024 03:11

theyueli self-assigned this Feb 15, 2024

kostasrim reviewed Feb 15, 2024

View reviewed changes

adiholden reviewed Feb 20, 2024

View reviewed changes

adiholden reviewed Feb 26, 2024

View reviewed changes

theyueli changed the base branch from main to v1.14-branch March 4, 2024 23:27

theyueli changed the base branch from v1.14-branch to main March 4, 2024 23:27

adiholden reviewed Mar 5, 2024

View reviewed changes

theyueli closed this Mar 13, 2024

theyueli force-pushed the tiering branch from 05cb527 to 12d76dd Compare March 13, 2024 10:10

feat (tiering): implementing periodic defragmentation for second tier #2595

feat (tiering): implementing periodic defragmentation for second tier #2595

Conversation

theyueli commented Feb 15, 2024 • edited Loading

theyueli commented Feb 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adiholden commented Mar 5, 2024

theyueli commented Feb 15, 2024 •

edited

Loading

theyueli commented Feb 15, 2024 •

edited

Loading