Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add remove adapters for LLMpipeline #1852

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

wenyi5608
Copy link
Contributor

No description provided.

@github-actions github-actions bot added category: continuous batching Continuous batching category: LLM LLM pipeline (stateful, static) category: speculative decoding Speculative decoding category: LoRA Low rank adapters category: GenAI C++ API Changes in GenAI C++ public headers no-match-files category: prompt lookup labels Mar 6, 2025
@ilya-lavrenov ilya-lavrenov requested review from Wovchena and removed request for Wovchena March 6, 2025 07:19
@@ -150,10 +150,11 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline(
const ov::AnyMap& properties,
const std::vector<KVHeadConfig>& kv_cache_config) {
// apply LoRA
auto filtered_properties = extract_adapters_from_properties(properties, &m_generation_config.adapters);
if (m_generation_config.adapters) {
m_generation_config.adapters->set_tensor_name_prefix("base_model.model.model.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have you removed storing adapters in m_generation_config ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we used m_generation_config.adapters to record the adapters. When the program returns pipe.generate, the adapter will be recorded in m_generation_config. adapter. Using ~Adater() to release the buffer will not work. In addition, m_generation_config.adapters is not used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_generation_config.adapters is not used

Users can get adapters set to ctor and change them later:

Pipeline pipe(path, adapter_config);
// a lot of code, etc
...
// later on
pipe.generate(prompt1)
auto new_adapter_config = *pipe.get_generation_config().adapters;
new_adapter_config.remove_adapter(..) // any modifications here
pipe.generate(prompt2, adapters(new_adapter_config)) // pass updated config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, user can use "new_adapter_config.remove_adapter(..)" to remove adapters in m_generation_config. So this part of code will not be modified.

}

for (const auto& adapter2 : adapters2) {
adapter2.~Adapter();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do we need to call explicit dtor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does not automatic dtor calling work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The adapter objects are allocated by user. generation_config and m_adapter_controller just reference the adapter objects. Should the adapter objects be deallocated by user himself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, adapters hold weights and users should understand it - if they hold adapter, they hold memory.

we should not implicitly destruct user objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have deleted this dtor.

@ilya-lavrenov ilya-lavrenov requested a review from iefode March 6, 2025 07:31
Copy link
Collaborator

@slyalin slyalin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your use case that demands remove_adapters functionality?

Why not call apply with adapter removed from the config? If it works for you, you can propagate apply to the upper level API like we have already done in the form of set_adapters method implemented for part of the pipelines.

Can we just wait for the next generate call when unused adapters can be deallocated automatically if you don't use them in generation config and the adapter objects are not kept anywhere else?

Will using mmap for adapter content be helpful in your use case and eliminate the need for remove_adapters?

Comment on lines +1079 to +1080
current_config.remove(adapter1);
need_full_apply = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensors from the removed adapter may still be used in the model states. This is applicable for cases when only one adapter is attached to the model and no precision change is applied before setting the state. In this case, a tensor set into set_state is created on top of the shared memory loaded in Adapter object, and, depending on a plugin implementation, it may or may not be used from the original destination. The dependency will be removed from the model state when you call apply method. This is a theory, I haven't checked that. CPU guys said they are copying each time. And if they are copying, this is another issue: even if you deallocate memory for the adapter object it is still allocated inside the plugin as a part of the state.

I think the same situation is possible inside apply if there are no other links to the adapter object that is being removed. But it is localized inside apply and cannot interfere with other code that may infer the model and lead to segfault because memory under state is deallocated.

Copy link
Contributor Author

@wenyi5608 wenyi5608 Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your use case that demands remove_adapters functionality?

For example, the customer has four LoRA adapters, namely LoRA_A, LoRA_B, LoRA_C, LoRA_D. They don't want to load four LoRA adapters (LoRA_A, LoRA_B, LoRA_C, LoRA_D) at the same time. When they use LoRA_A pipe.generate, they only load the LoRA_A adapter; when they switch to LoRA_B pipe.generate, they will load LoRA_B and unload LoRA_A to save memory.

Copy link
Contributor Author

@wenyi5608 wenyi5608 Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call apply with adapter removed from the config? If it works for you, you can propagate apply to the upper level API like we have already done in the form of set_adapters method implemented for part of the pipelines.

Now the form of remove_adapters is the same as the form of set_adapters method implemented for part of the pipelines.

Copy link
Contributor Author

@wenyi5608 wenyi5608 Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just wait for the next generate call when unused adapters can be deallocated automatically if you don't use them in generation config and the adapter objects are not kept anywhere else?

The adapter objects are allocated by user. generation_config and m_adapter_controller just reference the adapter objects. Should the adapter objects be deallocated by user himself?

Copy link
Contributor Author

@wenyi5608 wenyi5608 Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensors from the removed adapter may still be used in the model states. This is applicable for cases when only one adapter is attached to the model and no precision change is applied before setting the state. In this case, a tensor set into set_state is created on top of the shared memory loaded in Adapter object, and, depending on a plugin implementation, it may or may not be used from the original destination. The dependency will be removed from the model state when you call apply method. This is a theory, I haven't checked that. CPU guys said they are copying each time. And if they are copying, this is another issue: even if you deallocate memory for the adapter object it is still allocated inside the plugin as a part of the state.

When the user calls the apply method, the adapter object is copied to the model state. If the user continues using this LoRA adapter(LoRA_A) to generate, the adapter object(LoRA_A) will not be used and the model state will not be changed. Case one: when the user changes to another LoRA adapter(LoRA_B), the adapter object(LoRA_B) will be copied to the model state, the adapter object(LoRA_A) will not be used. Case two: when the user changes to an empty LoRA, the dynamic dimension of the model state will be set to 0, and then changes to LoRA adapter(LoRA_A), the adapter object(LoRA_A) will be copied to the model state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your use case that demands remove_adapters functionality?

For example, the customer has four LoRA adapters, namely LoRA_A, LoRA_B, LoRA_C, LoRA_D. They don't want to load four LoRA adapters (LoRA_A, LoRA_B, LoRA_C, LoRA_D) at the same time. When they use LoRA_A pipe.generate, they only load the LoRA_A adapter; when they switch to LoRA_B pipe.generate, they will load LoRA_B and unload LoRA_A to save memory.

This is exactly what mmap'ed LoRA adapters will give you: adapter file content will be loaded and unloaded automatically depending on the system memory availability, like OS filesystem cache. In the same way as OpenVINO IR bin file is loaded. The first PR that introduces this functionality is here: openvinotoolkit/openvino#29175. There will be another one in GenAI that will allocate safetensors with that mmaped functionality.

The question is about the period between two generates, one with LORA_A and the next one with LORA_B: do you want to minimize memory consumed by LORA_A in this period, or it doesn't matter? If yes, you need an API like set_adapters that will allow you to (theoretically) deallocate plugin memory for Adapter states.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just wait for the next generate call when unused adapters can be deallocated automatically if you don't use them in generation config and the adapter objects are not kept anywhere else?

The adapter objects are allocated by user. generation_config and m_adapter_controller just reference the adapter objects. Should the adapter objects be deallocated by user himself?

User side should remove any references to the adapter, I believe it is the same for your PR. If user bothers about memory allocated by the adapters, yes, it is required to explicitly get rid of all adapter references, including the adapter object on the user side. Again, with mmap functionality, when it is merged, this amount of memory doesn't affect the amount of memory that can be allocated for other needs in the system and can be counted as free memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: continuous batching Continuous batching category: GenAI C++ API Changes in GenAI C++ public headers category: LLM LLM pipeline (stateful, static) category: LoRA Low rank adapters category: prompt lookup category: speculative decoding Speculative decoding no-match-files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants