Add remove adapters for LLMpipeline #1852

wenyi5608 · 2025-03-06T03:35:21Z

No description provided.

remove adapters

ilya-lavrenov · 2025-03-06T07:25:37Z

src/cpp/src/continuous_batching_impl.cpp

@@ -150,10 +150,11 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline(
    const ov::AnyMap& properties,
    const std::vector<KVHeadConfig>& kv_cache_config) {
    // apply LoRA
-    auto filtered_properties = extract_adapters_from_properties(properties, &m_generation_config.adapters);
-    if (m_generation_config.adapters) {
-        m_generation_config.adapters->set_tensor_name_prefix("base_model.model.model.");


why have you removed storing adapters in m_generation_config ?

If we used m_generation_config.adapters to record the adapters. When the program returns pipe.generate, the adapter will be recorded in m_generation_config. adapter. Using ~Adater() to release the buffer will not work. In addition, m_generation_config.adapters is not used.

m_generation_config.adapters is not used

Users can get adapters set to ctor and change them later:

Pipeline pipe(path, adapter_config); // a lot of code, etc ... // later on pipe.generate(prompt1) auto new_adapter_config = *pipe.get_generation_config().adapters; new_adapter_config.remove_adapter(..) // any modifications here pipe.generate(prompt2, adapters(new_adapter_config)) // pass updated config

Yes, user can use "new_adapter_config.remove_adapter(..)" to remove adapters in m_generation_config. So this part of code will not be modified.

ilya-lavrenov · 2025-03-06T07:31:19Z

src/cpp/src/lora_adapter.cpp

+            }
+
+            for (const auto& adapter2 : adapters2) {
+                adapter2.~Adapter();


we do we need to call explicit dtor?

Do you have any suggestions?

why does not automatic dtor calling work here?

The adapter objects are allocated by user. generation_config and m_adapter_controller just reference the adapter objects. Should the adapter objects be deallocated by user himself?

sure, adapters hold weights and users should understand it - if they hold adapter, they hold memory.

we should not implicitly destruct user objects.

I have deleted this dtor.

slyalin

What is your use case that demands remove_adapters functionality?

Why not call apply with adapter removed from the config? If it works for you, you can propagate apply to the upper level API like we have already done in the form of set_adapters method implemented for part of the pipelines.

Can we just wait for the next generate call when unused adapters can be deallocated automatically if you don't use them in generation config and the adapter objects are not kept anywhere else?

Will using mmap for adapter content be helpful in your use case and eliminate the need for remove_adapters?

slyalin · 2025-03-06T12:58:43Z

src/cpp/src/lora_adapter.cpp

+                            current_config.remove(adapter1);
+                            need_full_apply = true;


Tensors from the removed adapter may still be used in the model states. This is applicable for cases when only one adapter is attached to the model and no precision change is applied before setting the state. In this case, a tensor set into set_state is created on top of the shared memory loaded in Adapter object, and, depending on a plugin implementation, it may or may not be used from the original destination. The dependency will be removed from the model state when you call apply method. This is a theory, I haven't checked that. CPU guys said they are copying each time. And if they are copying, this is another issue: even if you deallocate memory for the adapter object it is still allocated inside the plugin as a part of the state.

I think the same situation is possible inside apply if there are no other links to the adapter object that is being removed. But it is localized inside apply and cannot interfere with other code that may infer the model and lead to segfault because memory under state is deallocated.

What is your use case that demands remove_adapters functionality?

For example, the customer has four LoRA adapters, namely LoRA_A, LoRA_B, LoRA_C, LoRA_D. They don't want to load four LoRA adapters (LoRA_A, LoRA_B, LoRA_C, LoRA_D) at the same time. When they use LoRA_A pipe.generate, they only load the LoRA_A adapter; when they switch to LoRA_B pipe.generate, they will load LoRA_B and unload LoRA_A to save memory.

Why not call apply with adapter removed from the config? If it works for you, you can propagate apply to the upper level API like we have already done in the form of set_adapters method implemented for part of the pipelines.

Now the form of remove_adapters is the same as the form of set_adapters method implemented for part of the pipelines.

Can we just wait for the next generate call when unused adapters can be deallocated automatically if you don't use them in generation config and the adapter objects are not kept anywhere else?

The adapter objects are allocated by user. generation_config and m_adapter_controller just reference the adapter objects. Should the adapter objects be deallocated by user himself?

Tensors from the removed adapter may still be used in the model states. This is applicable for cases when only one adapter is attached to the model and no precision change is applied before setting the state. In this case, a tensor set into set_state is created on top of the shared memory loaded in Adapter object, and, depending on a plugin implementation, it may or may not be used from the original destination. The dependency will be removed from the model state when you call apply method. This is a theory, I haven't checked that. CPU guys said they are copying each time. And if they are copying, this is another issue: even if you deallocate memory for the adapter object it is still allocated inside the plugin as a part of the state.

When the user calls the apply method, the adapter object is copied to the model state. If the user continues using this LoRA adapter(LoRA_A) to generate, the adapter object(LoRA_A) will not be used and the model state will not be changed. Case one: when the user changes to another LoRA adapter(LoRA_B), the adapter object(LoRA_B) will be copied to the model state, the adapter object(LoRA_A) will not be used. Case two: when the user changes to an empty LoRA, the dynamic dimension of the model state will be set to 0, and then changes to LoRA adapter(LoRA_A), the adapter object(LoRA_A) will be copied to the model state.

What is your use case that demands remove_adapters functionality?

For example, the customer has four LoRA adapters, namely LoRA_A, LoRA_B, LoRA_C, LoRA_D. They don't want to load four LoRA adapters (LoRA_A, LoRA_B, LoRA_C, LoRA_D) at the same time. When they use LoRA_A pipe.generate, they only load the LoRA_A adapter; when they switch to LoRA_B pipe.generate, they will load LoRA_B and unload LoRA_A to save memory.

This is exactly what mmap'ed LoRA adapters will give you: adapter file content will be loaded and unloaded automatically depending on the system memory availability, like OS filesystem cache. In the same way as OpenVINO IR bin file is loaded. The first PR that introduces this functionality is here: openvinotoolkit/openvino#29175. There will be another one in GenAI that will allocate safetensors with that mmaped functionality.

The question is about the period between two generates, one with LORA_A and the next one with LORA_B: do you want to minimize memory consumed by LORA_A in this period, or it doesn't matter? If yes, you need an API like set_adapters that will allow you to (theoretically) deallocate plugin memory for Adapter states.

Can we just wait for the next generate call when unused adapters can be deallocated automatically if you don't use them in generation config and the adapter objects are not kept anywhere else?

The adapter objects are allocated by user. generation_config and m_adapter_controller just reference the adapter objects. Should the adapter objects be deallocated by user himself?

User side should remove any references to the adapter, I believe it is the same for your PR. If user bothers about memory allocated by the adapters, yes, it is required to explicitly get rid of all adapter references, including the adapter object on the user side. Again, with mmap functionality, when it is merged, this amount of memory doesn't affect the amount of memory that can be allocated for other needs in the system and can be counted as free memory.

remove adapters

wenyi5608 added 20 commits March 5, 2025 15:43

Update continuous_batching_pipeline.hpp for remove adapters

cd3452b

Update llm_pipeline.hpp

58d7928

remove adapters

Update lora_adapter.hpp

913086f

remove adapters

Update continuous_batching_adapter.hpp

e10844d

remove adapters

Update continuous_batching_impl.cpp

63a260c

remove adapters

Update continuous_batching_impl.hpp

dc1e445

remove adapters

Update continuous_batching_pipeline.cpp

a61e966

Update icontinuous_batching.cpp

9d03f49

remove adapters

Update icontinuous_batching.hpp

d180c87

remove adapters

Update llm_pipeline.cpp

3fe9c56

remove adapters

Update llm_pipeline_base.hpp

c15f23d

remove adapters

Update llm_pipeline_stateful.cpp

57f75cf

remove adapters

Update llm_pipeline_stateful.hpp

4fd3200

remove adapters

Update llm_pipeline_static.cpp

2052795

remove adapters

Update llm_pipeline_static.hpp

8c35a3f

remove adapters

Update lora_adapter.cpp

78f4be7

remove adapters

Update prompt_lookup_impl.cpp

2b076fc

remove adapters

Update prompt_lookup_impl.hpp

837070b

remove adapters

Update speculative_decoding_impl.cpp

6e14676

remove adapters

Update speculative_decoding_impl.hpp

972527e

remove adapters

ilya-lavrenov requested review from Wovchena and removed request for Wovchena March 6, 2025 07:19

ilya-lavrenov assigned slyalin Mar 6, 2025

ilya-lavrenov reviewed Mar 6, 2025

View reviewed changes

ilya-lavrenov requested a review from iefode March 6, 2025 07:31

slyalin requested changes Mar 6, 2025

View reviewed changes

wenyi5608 added 3 commits March 7, 2025 11:46

Update continuous_batching_impl.cpp

42e5978

remove adapters

Update llm_pipeline_stateful.cpp

d5cbea9

remove adapters

Update lora_adapter.cpp

f5b1b10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add remove adapters for LLMpipeline #1852

Add remove adapters for LLMpipeline #1852

wenyi5608 commented Mar 6, 2025

ilya-lavrenov Mar 6, 2025

wenyi5608 Mar 6, 2025

ilya-lavrenov Mar 6, 2025

wenyi5608 Mar 7, 2025

ilya-lavrenov Mar 6, 2025

wenyi5608 Mar 6, 2025

ilya-lavrenov Mar 6, 2025

wenyi5608 Mar 7, 2025

ilya-lavrenov Mar 7, 2025

wenyi5608 Mar 8, 2025

slyalin left a comment

slyalin Mar 6, 2025

wenyi5608 Mar 7, 2025 •

edited by ilya-lavrenov

Loading

wenyi5608 Mar 7, 2025 •

edited by ilya-lavrenov

Loading

wenyi5608 Mar 7, 2025 •

edited by ilya-lavrenov

Loading

wenyi5608 Mar 11, 2025 •

edited

Loading

slyalin Mar 11, 2025

slyalin Mar 11, 2025

Add remove adapters for LLMpipeline #1852

Are you sure you want to change the base?

Add remove adapters for LLMpipeline #1852

Conversation

wenyi5608 commented Mar 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slyalin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenyi5608 Mar 7, 2025 • edited by ilya-lavrenov Loading

Choose a reason for hiding this comment

wenyi5608 Mar 7, 2025 • edited by ilya-lavrenov Loading

Choose a reason for hiding this comment

wenyi5608 Mar 7, 2025 • edited by ilya-lavrenov Loading

Choose a reason for hiding this comment

wenyi5608 Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenyi5608 Mar 7, 2025 •

edited by ilya-lavrenov

Loading

wenyi5608 Mar 7, 2025 •

edited by ilya-lavrenov

Loading

wenyi5608 Mar 7, 2025 •

edited by ilya-lavrenov

Loading

wenyi5608 Mar 11, 2025 •

edited

Loading