-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add remove adapters for LLMpipeline #1852
base: master
Are you sure you want to change the base?
Add remove adapters for LLMpipeline #1852
Conversation
remove adapters
remove adapters
remove adapters
remove adapters
remove adapters
remove adapters
remove adapters
remove adapters
remove adapters
remove adapters
remove adapters
remove adapters
remove adapters
remove adapters
remove adapters
remove adapters
remove adapters
remove adapters
@@ -150,10 +150,11 @@ void ContinuousBatchingPipeline::ContinuousBatchingImpl::initialize_pipeline( | |||
const ov::AnyMap& properties, | |||
const std::vector<KVHeadConfig>& kv_cache_config) { | |||
// apply LoRA | |||
auto filtered_properties = extract_adapters_from_properties(properties, &m_generation_config.adapters); | |||
if (m_generation_config.adapters) { | |||
m_generation_config.adapters->set_tensor_name_prefix("base_model.model.model."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why have you removed storing adapters in m_generation_config
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we used m_generation_config.adapters to record the adapters. When the program returns pipe.generate, the adapter will be recorded in m_generation_config. adapter. Using ~Adater() to release the buffer will not work. In addition, m_generation_config.adapters is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m_generation_config.adapters is not used
Users can get adapters set to ctor and change them later:
Pipeline pipe(path, adapter_config);
// a lot of code, etc
...
// later on
pipe.generate(prompt1)
auto new_adapter_config = *pipe.get_generation_config().adapters;
new_adapter_config.remove_adapter(..) // any modifications here
pipe.generate(prompt2, adapters(new_adapter_config)) // pass updated config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, user can use "new_adapter_config.remove_adapter(..)" to remove adapters in m_generation_config. So this part of code will not be modified.
src/cpp/src/lora_adapter.cpp
Outdated
} | ||
|
||
for (const auto& adapter2 : adapters2) { | ||
adapter2.~Adapter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do we need to call explicit dtor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does not automatic dtor calling work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The adapter objects are allocated by user. generation_config and m_adapter_controller just reference the adapter objects. Should the adapter objects be deallocated by user himself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, adapters hold weights and users should understand it - if they hold adapter, they hold memory.
we should not implicitly destruct user objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have deleted this dtor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your use case that demands remove_adapters
functionality?
Why not call apply
with adapter removed from the config? If it works for you, you can propagate apply
to the upper level API like we have already done in the form of set_adapters
method implemented for part of the pipelines.
Can we just wait for the next generate
call when unused adapters can be deallocated automatically if you don't use them in generation config and the adapter objects are not kept anywhere else?
Will using mmap for adapter content be helpful in your use case and eliminate the need for remove_adapters
?
current_config.remove(adapter1); | ||
need_full_apply = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tensors from the removed adapter may still be used in the model states. This is applicable for cases when only one adapter is attached to the model and no precision change is applied before setting the state. In this case, a tensor set into set_state
is created on top of the shared memory loaded in Adapter object, and, depending on a plugin implementation, it may or may not be used from the original destination. The dependency will be removed from the model state when you call apply
method. This is a theory, I haven't checked that. CPU guys said they are copying each time. And if they are copying, this is another issue: even if you deallocate memory for the adapter object it is still allocated inside the plugin as a part of the state.
I think the same situation is possible inside apply
if there are no other links to the adapter object that is being removed. But it is localized inside apply
and cannot interfere with other code that may infer the model and lead to segfault because memory under state is deallocated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your use case that demands remove_adapters functionality?
For example, the customer has four LoRA adapters, namely LoRA_A, LoRA_B, LoRA_C, LoRA_D. They don't want to load four LoRA adapters (LoRA_A, LoRA_B, LoRA_C, LoRA_D) at the same time. When they use LoRA_A pipe.generate, they only load the LoRA_A adapter; when they switch to LoRA_B pipe.generate, they will load LoRA_B and unload LoRA_A to save memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call apply with adapter removed from the config? If it works for you, you can propagate apply to the upper level API like we have already done in the form of set_adapters method implemented for part of the pipelines.
Now the form of remove_adapters is the same as the form of set_adapters method implemented for part of the pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just wait for the next generate call when unused adapters can be deallocated automatically if you don't use them in generation config and the adapter objects are not kept anywhere else?
The adapter objects are allocated by user. generation_config and m_adapter_controller just reference the adapter objects. Should the adapter objects be deallocated by user himself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tensors from the removed adapter may still be used in the model states. This is applicable for cases when only one adapter is attached to the model and no precision change is applied before setting the state. In this case, a tensor set into
set_state
is created on top of the shared memory loaded in Adapter object, and, depending on a plugin implementation, it may or may not be used from the original destination. The dependency will be removed from the model state when you callapply
method. This is a theory, I haven't checked that. CPU guys said they are copying each time. And if they are copying, this is another issue: even if you deallocate memory for the adapter object it is still allocated inside the plugin as a part of the state.
When the user calls the apply
method, the adapter object is copied to the model state. If the user continues using this LoRA adapter(LoRA_A) to generate, the adapter object(LoRA_A) will not be used and the model state will not be changed. Case one: when the user changes to another LoRA adapter(LoRA_B), the adapter object(LoRA_B) will be copied to the model state, the adapter object(LoRA_A) will not be used. Case two: when the user changes to an empty LoRA, the dynamic dimension of the model state will be set to 0, and then changes to LoRA adapter(LoRA_A), the adapter object(LoRA_A) will be copied to the model state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your use case that demands remove_adapters functionality?
For example, the customer has four LoRA adapters, namely LoRA_A, LoRA_B, LoRA_C, LoRA_D. They don't want to load four LoRA adapters (LoRA_A, LoRA_B, LoRA_C, LoRA_D) at the same time. When they use LoRA_A pipe.generate, they only load the LoRA_A adapter; when they switch to LoRA_B pipe.generate, they will load LoRA_B and unload LoRA_A to save memory.
This is exactly what mmap'ed LoRA adapters will give you: adapter file content will be loaded and unloaded automatically depending on the system memory availability, like OS filesystem cache. In the same way as OpenVINO IR bin file is loaded. The first PR that introduces this functionality is here: openvinotoolkit/openvino#29175. There will be another one in GenAI that will allocate safetensors with that mmaped functionality.
The question is about the period between two generates, one with LORA_A and the next one with LORA_B: do you want to minimize memory consumed by LORA_A in this period, or it doesn't matter? If yes, you need an API like set_adapters
that will allow you to (theoretically) deallocate plugin memory for Adapter states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just wait for the next generate call when unused adapters can be deallocated automatically if you don't use them in generation config and the adapter objects are not kept anywhere else?
The adapter objects are allocated by user. generation_config and m_adapter_controller just reference the adapter objects. Should the adapter objects be deallocated by user himself?
User side should remove any references to the adapter, I believe it is the same for your PR. If user bothers about memory allocated by the adapters, yes, it is required to explicitly get rid of all adapter references, including the adapter object on the user side. Again, with mmap functionality, when it is merged, this amount of memory doesn't affect the amount of memory that can be allocated for other needs in the system and can be counted as free memory.
remove adapters
remove adapters
No description provided.