Add LRU in MasterService, complexity O(1) #287

zhaoyongke · 2025-04-23T03:28:40Z

We found mooncake store will discard all the put() requests after its capacity reaches the max. This case is not applicable in our online service. We solved this via LRU( Least Recently Used) algorithm.

xiaguan · 2025-04-23T03:59:35Z

Thank you for your contribution. Here are a few points that might warrant consideration:

First, if eviction support were implemented, the garbage collection (GC) related functionalities could potentially be removed.

Second, as reflected in the codebase, our current implementation utilizes lock sharding, wherein each object is safeguarded by the lock corresponding to its shard. Implementing a Least Recently Used (LRU) eviction strategy, however, would likely necessitate a global lock to manage the LRU queue, potentially introducing performance bottlenecks due to contention.

My proposal involves repurposing the existing GC thread into a eviction thread. When eviction becomes necessary—perhaps triggered by monitoring specific watermarks or thresholds—this thread could select a target shard, acquire the lock exclusive to that shard, and subsequently perform eviction operations solely within its confines. This approach aims to circumvent the performance limitations associated with a global lock.

xiaguan · 2025-04-23T04:05:45Z

I've conceived a potentially simpler approach that we could discuss. If we modify the garbage collection (GC) producer's operation from 'get' to 'put_end', would this effectively function as a First-In, First-Out (FIFO) eviction mechanism?

zhaoyongke · 2025-04-23T05:55:47Z

Thank you for your contribution. Here are a few points that might warrant consideration:

First, if eviction support were implemented, the garbage collection (GC) related functionalities could potentially be removed.

Second, as reflected in the codebase, our current implementation utilizes lock sharding, wherein each object is safeguarded by the lock corresponding to its shard. Implementing a Least Recently Used (LRU) eviction strategy, however, would likely necessitate a global lock to manage the LRU queue, potentially introducing performance bottlenecks due to contention.

My proposal involves repurposing the existing GC thread into a eviction thread. When eviction becomes necessary—perhaps triggered by monitoring specific watermarks or thresholds—this thread could select a target shard, acquire the lock exclusive to that shard, and subsequently perform eviction operations solely within its confines. This approach aims to circumvent the performance limitations associated with a global lock.

Thanks for your constructive suggestions!
We noticed the GC option in mooncake master. It simply removes an item after get(), not applicable to single put() multiple get() situations(such as nPmD). we disabled this option by default.

zhaoyongke · 2025-04-23T06:07:35Z

I've conceived a potentially simpler approach that we could discuss. If we modify the garbage collection (GC) producer's operation from 'get' to 'put_end', would this effectively function as a First-In, First-Out (FIFO) eviction mechanism?

Exactly yes, we have tested this method, while in real cases, FIFO is not always the best eviction method, which client will frequently get a subset of a large prefilled datasets.

xiaguan · 2025-04-23T12:03:25Z

Can you ensure that the LRU implementation is thread-safe?

zhaoyongke · 2025-04-23T12:31:31Z

Can you ensure that the LRU implementation is thread-safe?

Work in process, later will fix

Copilot

Pull Request Overview

This PR introduces an LRU mechanism into the MasterService to prevent discarding put() requests when the store reaches maximum capacity. Key changes include:

Initializing and clearing LRU data structures in the MasterService constructor and destructor.
Updating the Get() and PutStart() methods to incorporate LRU update and eviction logic.
Adding preprocessor definitions and corresponding LRU member variables to the header file.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
mooncake-store/src/master_service.cpp	Implements LRU update and eviction in various service methods.
mooncake-store/include/master_service.h	Adds preprocessor flags and LRU data structures.

Comments suppressed due to low confidence (2)

mooncake-store/include/master_service.h:18

[nitpick] Consider defining USE_LRU_MASTER without a string value (e.g. using '#define USE_LRU_MASTER') to better align with standard preprocessor flag practices.

#define USE_LRU_MASTER "ON"

mooncake-store/src/master_service.cpp:143

[nitpick] Consider refactoring the duplicate LRU update logic found in Get() and PutStart() into a helper function to improve code maintainability and reduce redundancy.

all_key_list_.push_front(key);

Copilot · 2025-04-27T02:04:32Z

mooncake-store/src/master_service.cpp

+    }
+    all_key_list_.push_front(key);
+    all_key_idx_map_[key] = all_key_list_.begin();
+    if(all_key_list_.size() >= LRU_MAX_CAPACITY)


Verify that the eviction condition correctly reflects the intended capacity constraint; if LRU_MAX_CAPACITY is the max allowed entries, eviction should occur only when adding a new key would exceed that capacity.

Suggested change

if(all_key_list_.size() >= LRU_MAX_CAPACITY)

if(all_key_list_.size() > LRU_MAX_CAPACITY)

stmatengss · 2025-04-27T02:06:31Z

Hi, @zhaoyongke , Thank you for your contributions, https://github.com/kvcache-ai/Mooncake/actions/runs/14609309219/job/40984273235?pr=287. master_service_test has failed. Please check it.

stmatengss · 2025-04-27T03:19:35Z

mooncake-store/include/master_service.h

 #include "allocator.h"
 #include "types.h"

+#define USE_LRU_MASTER "ON"


Define it in the CMake File?

Fine, I'll fix it

stmatengss · 2025-04-27T03:21:25Z

mooncake-store/include/master_service.h


+#ifdef USE_LRU_MASTER
+    // LRU statistics
+    std::list <std::string> all_key_list_;


Use a class (e.g., class LRUList {....}) to wrap these two LRU lists.

yep, and we'd better add some basic tests for the new class.

doujiang24 · 2025-04-27T11:25:07Z

CMakeLists.txt


+if (USE_LRU_MASTER)
+  add_compile_definitions(USE_LRU_MASTER)
+  add_compile_definitions(LRU_MAX_CAPACITY=1000)


we'd better move it to command arguments, so that we can change it on demand.

yuan-luo · 2025-04-28T16:34:53Z

mooncake-store/include/master_service.h

    std::shared_ptr<BufferAllocatorManager> buffer_allocator_manager_;
    std::shared_ptr<AllocationStrategy> allocation_strategy_;

+#ifdef USE_LRU_MASTER


Instead of adding MACRO here and there. Introducing an EvictStrategy class just like the AllocationStrategy is a good choice. By default is FIFO, LRU is the other option. In this strategy class, wrap all the stuffs.

We'll think about it, thanks for your advice ~

doujiang24 · 2025-05-03T09:22:09Z

CMakeLists.txt

 option(USE_REDIS "option for enable redis as metadata server" OFF)
 option(USE_HTTP "option for enable http as metadata server" ON)
+option(USE_LRU_MASTER "option for using LRU in master service" OFF)
+set(LRU_MAX_CAPACITY 1000)


we'd better configure it at run time, similar to enable_gc

Mooncake/mooncake-store/src/master.cpp

Line 14 in 14af70c

DEFINE_bool(enable_gc, false, "Enable garbage collection");

doujiang24 · 2025-05-03T09:31:21Z

mooncake-store/src/master_service.cpp

+    }
+    all_key_list_.push_front(key);
+    all_key_idx_map_[key] = all_key_list_.begin();
+    if(all_key_list_.size() >= LRU_MAX_CAPACITY)


A global lru size limitation may not good for production, we'd better limit the size at per client/node level.
Since each client could have limited cache size, but the client number could be dynamic scale up/down in production.

doujiang24 · 2025-05-07T14:10:08Z

mooncake-store/src/master_service.cpp


+    LOG(INFO) << "### LRU Update in Put() ###";
+    eviction_strategy_->AddKey(key);
+    if(eviction_strategy_ -> GetSize() >= LRU_MAX_CAPACITY)


Suggested change

if(eviction_strategy_ -> GetSize() >= LRU_MAX_CAPACITY)

if(eviction_strategy_->GetSize() >= LRU_MAX_CAPACITY)

… if more than 80% used, trigger evict

mooncake-common/common.cmake

mooncake-store/src/master_service.cpp

mooncake-store/tests/eviction_strategy_test.cpp

xiaguan · 2025-05-15T07:52:24Z

Looks good to me! Thanks a lot for your contribution — this is an important feature for mooncake_master.

* Fix nvmeof build issue * Add LRU in Master Service, complexity O(1) * Move Macros to CMake Options * fix lru build options * Now we can setup LRU_MAX_CAPACITY with cmake commands * Refactoring LRU to EvictionStrategy Class * Add tests of eviction strategy * fix build issues * Fix eviction strategy test issue * Resolve conflicts with master * resolve conflicts, second part * Change MasterMetrics to get ratio of used storage and total capacity, if more than 80% used, trigger evict (cherry picked from commit d085d86)

zhaoyongke and others added 4 commits April 16, 2025 12:00

Fix nvmeof build issue

00c660a

Merge branch 'kvcache-ai:main' into main

fb71e56

Merge branch 'kvcache-ai:main' into main

16d6f17

Add LRU in Master Service, complexity O(1)

731d23e

stmatengss mentioned this pull request Apr 23, 2025

[RoadMap] Mooncake Roadmap Q1 & Q2 2025 #44

Open

45 tasks

stmatengss requested review from Copilot, stmatengss and xiaguan and removed request for xiaguan April 25, 2025 09:36

Copilot AI reviewed Apr 27, 2025

View reviewed changes

stmatengss reviewed Apr 27, 2025

View reviewed changes

zhaoyongke added 2 commits April 27, 2025 13:47

Move Macros to CMake Options

b9a348b

fix lru build options

e86054a

doujiang24 reviewed Apr 27, 2025

View reviewed changes

Now we can setup LRU_MAX_CAPACITY with cmake commands

14af70c

yuan-luo reviewed Apr 28, 2025

View reviewed changes

doujiang24 reviewed May 3, 2025

View reviewed changes

zhaoyongke added 4 commits May 7, 2025 10:26

Refactoring LRU to EvictionStrategy Class

8492d1f

Add tests of eviction strategy

7d2f451

fix build issues

ae3d416

Fix eviction strategy test issue

4c14565

JasonZhang517 mentioned this pull request May 7, 2025

[RFC]Offloading KVCache to SSD with 3FS #333

Open

zhaoyongke added 3 commits May 7, 2025 18:19

Resolve conflicts with master

eb6bc18

resolve conflicts, second part

771961c

Merge branch 'main' into main

0fd0288

doujiang24 reviewed May 7, 2025

View reviewed changes

zhaoyongke and others added 2 commits May 12, 2025 10:49

Change MasterMetrics to get ratio of used storage and total capacity,…

e75dfd7

… if more than 80% used, trigger evict

Merge branch 'kvcache-ai:main' into main

9db0ce6

xiaguan reviewed May 15, 2025

View reviewed changes

mooncake-common/common.cmake Show resolved Hide resolved

mooncake-store/src/master_service.cpp Show resolved Hide resolved

mooncake-store/tests/eviction_strategy_test.cpp Show resolved Hide resolved

xiaguan merged commit d085d86 into kvcache-ai:main May 15, 2025
26 checks passed

stmatengss mentioned this pull request May 20, 2025

[RoadMap] Mooncake Store V2 #378

Open

29 tasks

	if(all_key_list_.size() >= LRU_MAX_CAPACITY)
	if(all_key_list_.size() > LRU_MAX_CAPACITY)

	if(eviction_strategy_ -> GetSize() >= LRU_MAX_CAPACITY)
	if(eviction_strategy_->GetSize() >= LRU_MAX_CAPACITY)

Add LRU in MasterService, complexity O(1) #287

Add LRU in MasterService, complexity O(1) #287

Conversation

zhaoyongke commented Apr 23, 2025

Uh oh!

xiaguan commented Apr 23, 2025

Uh oh!

xiaguan commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaoyongke commented Apr 23, 2025

Uh oh!

zhaoyongke commented Apr 23, 2025

Uh oh!

xiaguan commented Apr 23, 2025

Uh oh!

zhaoyongke commented Apr 23, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Apr 27, 2025

Choose a reason for hiding this comment

Uh oh!

stmatengss commented Apr 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuan-luo Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xiaguan commented May 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xiaguan commented Apr 23, 2025 •

edited

Loading

yuan-luo Apr 28, 2025 •

edited

Loading