Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain_huggingface: Fix multiple GPU usage bug in from_model_id function #23628

Merged
merged 17 commits into from
Oct 23, 2024

Conversation

kenchanLOL
Copy link
Contributor

  • Description:
    • pass the device_map into model_kwargs
    • removing the unused device_map variable in the hf_pipeline function call
  • Issue: issue How do I use multiple GPU when using a model with hugging face? #13128
    When using the from_model_id function to load a Hugging Face model for text generation across multiple GPUs, the model defaults to loading on the CPU despite multiple GPUs being available using the expected format
llm = HuggingFacePipeline.from_model_id(
    model_id="model-id",
    task="text-generation",
    device_map="auto",
)

Currently, to enable multiple GPU , we have to pass in variable in this format instead

llm = HuggingFacePipeline.from_model_id(
    model_id="model-id",
    task="text-generation",
    device=None,
    model_kwargs={
        "device_map": "auto",
    }
)

This issue arises due to improper handling of the device and device_map parameters.

  • Explanation:
  1. In from_model_id, the model is created using model_kwargs and passed as the model variable of the pipeline function. So at this moment, to load the model with multiple GPUs, "device_map" needs to be set to "auto" within model_kwargs. Otherwise, the model defaults to loading on the CPU.
  2. The device_map variable in from_model_id is not utilized correctly. In the pipeline function's source code of tnansformer:
  • The device_map variable is stored in the model_kwargs dictionary (lines 867-878 of transformers/src/transformers/pipelines/_init_.py).
    if device_map is not None:
        ......
        model_kwargs["device_map"] = device_map
  • The model is constructed with model_kwargs containing the device_map value ONLY IF it is a string (lines 893-903 of transformers/src/transformers/pipelines/_init_.py).
    if isinstance(model, str) or framework is None:
        model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]}
        framework, model = infer_framework_load_model( ... , **model_kwargs, )
  • Consequently, since a model object is already passed to the pipeline function, the device_map variable from from_model_id is never used.
  1. The device_map variable in from_model_id not only appears unused but also causes errors. Without explicitly setting device=None, attempting to load the model on multiple GPUs may result in the following error:
 Device has 2 GPUs available. Provide device={deviceId} to `from_model_id` to use available GPUs for execution. deviceId is -1 (default) for CPU and can be a positive integer associated with CUDA device id.
 Traceback (most recent call last):
   File "foo.py", line 15, in <module>
     llm = HuggingFacePipeline.from_model_id(
   File "foo\site-packages\langchain_huggingface\llms\huggingface_pipeline.py", line 217, in from_model_id
     pipeline = hf_pipeline(
   File "foo\lib\site-packages\transformers\pipelines\__init__.py", line 1108, in pipeline
     return pipeline_class(model=model, framework=framework, task=task, **kwargs)
   File "foo\lib\site-packages\transformers\pipelines\text_generation.py", line 96, in __init__
     super().__init__(*args, **kwargs)
   File "foo\lib\site-packages\transformers\pipelines\base.py", line 835, in __init__
     raise ValueError(
 ValueError: The model has been loaded with `accelerate` and therefore cannot be moved to a specific device. Please discard the `device` argument when creating your pipeline object.

This error occurs because, in from_model_id, the default values in from_model_id for device and device_map are -1 and None, respectively. It would passes the statement (device_map is not None and device < 0) and keep the device as -1 so the pipeline function later raises an error when trying to move a GPU-loaded model back to the CPU.

if device_map is not None and device < 0:
device = None
if device is not None and device < 0 and cuda_device_count > 0:
logger.warning(
"Device has %d GPUs available. "
"Provide device={deviceId} to `from_model_id` to use available"
"GPUs for execution. deviceId is -1 (default) for CPU and "
"can be a positive integer associated with CUDA device id.",
cuda_device_count,
)

If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

kenchanLOL and others added 3 commits June 28, 2024 04:48
- Avoid the ambiguity of device_map variable by raising error when people try to pass device_map as key-value pair in model_kwargs
- removing the unused device_map variable in the hf_pipeline function call
@efriis efriis added the partner label Jun 28, 2024
@efriis efriis self-assigned this Jun 28, 2024
@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jun 28, 2024
Copy link

vercel bot commented Jun 28, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Oct 22, 2024 6:48pm

@dosubot dosubot bot added 🔌: huggingface Primarily related to HuggingFace integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jun 28, 2024
Copy link

@SaladDay SaladDay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@kenchanLOL kenchanLOL requested a review from SaladDay June 30, 2024 00:31
@kenchanLOL kenchanLOL marked this pull request as draft July 5, 2024 09:58
@kenchanLOL kenchanLOL marked this pull request as ready for review July 5, 2024 09:59
@efriis
Copy link
Member

efriis commented Jul 12, 2024

@Jofthomas could you take a look and mark ready to review by the langchain team if it looks good on the huggingface side?

@efriis efriis marked this pull request as draft July 12, 2024 23:25
@kenchanLOL
Copy link
Contributor Author

@efriis I wanted to kindly follow up on the pull request I submitted over a month ago. I understand that everyone is busy, but I would greatly appreciate it if you could take a moment to check it and let me know if there's anything further I need to address.

@Jofthomas
Copy link
Collaborator

Hey, Sorry for the wait, I will check it

@efriis
Copy link
Member

efriis commented Aug 23, 2024

Looking at the transformers.hf_pipeline implementation, I actually think the existing implementation is correct, as device_map is a full parameter in this function and probably shouldn't be plumbed through model_kwargs.

If this is incorrect, should probably fix that in transformers instead of langchain.

https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L564

@efriis efriis closed this Aug 23, 2024
@kenchanLOL
Copy link
Contributor Author

kenchanLOL commented Aug 24, 2024

Thanks for the reply. Actually, my PR addresses the issue you mentioned. The current implementation has a problem where the device_map parameter doesn't work as expected unless it is plumbed through model_kwargs to correctly assign model weights based on the user setting. This issue arises from a tricky detail in the transformers.pipeline implementation.

The model parameter in transformers.pipeline accepts either a string or a PreTrainedModel object. When model is a string, the device_map parameter is referenced and store as key value pair of model_kwargs, which is then passed into the from_pretrained() method to create a model object. But when the model is an object, device_map is never used. You can see this in https://github.com/huggingface/transformers/blob/0a7af19f4dc868bafc82f35eb7e8d13bac87a594/src/transformers/pipelines/__init__.py#L867-L903
However, in the current from_model_id implementation, the model is first created using model_id and model_kwargs, resulting in a PreTrainedModel object that is then passed into transformers.pipeline. This means that even if the device_map parameter from this function is also passed into transformers.pipeline , it is never actually utilized by the pipeline as we expected. Therefore, in current implementation, if user really want to set the device_map for their model, they need to plumb through model_kwargs instead of the device_map. And My PR is trying to fix it and make the device_map parameter of this function works as we expected.

I tried my best to clarify the reasoning, but the overlap in variable names between these two functions might make it a bit confusing.

@efriis
Copy link
Member

efriis commented Aug 30, 2024

doesn't this seem like a bug in transformers though? If so, I don't want logic bypassing their interface.

As a holdover, can't you just pass model_kwargs={"device_map": ...} manually without this PR?

@kenchanLOL
Copy link
Contributor Author

kenchanLOL commented Aug 30, 2024

@efriis I don't believe this is a bug in transformers because it makes sense not to recreate the model object when the user already passes a model object into the function. However, while the current implementation allows for passing model_kwargs={"device_map": ...} as a workaround, this approach is not well-documented. The device_map parameter, as mentioned in the official documentation (https://python.langchain.com/v0.2/docs/integrations/llms/huggingface_pipelines/), cannot be used as described. The proper solution, which involves passing model_kwargs, is only mentioned in the issue I linked in my initial PR message. This lack of clear and accessible documentation makes it confusing for users like myself to utilize the device_map parameter for multi-GPU setups, despite following the official guide.

Additionally, I believe my implementation won't negatively impact existing users who pass device_map as model_kwargs={"device_map": ...} since all I'm doing is to overwrite the device_map value within model_kwargs. There are no significant logic changes, and the existing functionality remains intact.

GPU Inference
When running on a machine with GPU, you can specify the device=n parameter to put the model on the specified device. Defaults to -1 for CPU inference.

If you have multiple-GPUs and/or the model is too large for a single GPU, you can specify device_map="auto", which requires and uses the [Accelerate](https://huggingface.co/docs/accelerate/index) library to automatically determine how to load the model weights.

Note: both device and device_map should not be specified together and can lead to unexpected behavior.

gpu_llm = HuggingFacePipeline.from_model_id(
    model_id="gpt2",
    task="text-generation",
    device=0,  # replace with device_map="auto" to use the accelerate library.
    pipeline_kwargs={"max_new_tokens": 10},
)

gpu_chain = prompt | gpu_llm

question = "What is electroencephalography?"

print(gpu_chain.invoke({"question": question}))

@efriis
Copy link
Member

efriis commented Sep 3, 2024

yes but doesn't this documentation make device_map seem like it should be injected into model_kwargs["device_map"] by transformers.pipeline?

https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.pipeline

@efriis
Copy link
Member

efriis commented Sep 3, 2024

(not recreating the model object - just the routing of the parameter)

@kenchanLOL
Copy link
Contributor Author

@efriis, I’m not entirely clear on the point you made. Both model_kwargs and device_map are only utilized during the model initialization stage. In the documentation you shared, all examples pass a string as the model input parameter, so the model initialization occurs within the pipeline function. However, in the current implementation of from_model_id, this initialization happens before calling transformers.pipeline, where the model input parameter is already a model object.

As a result, regardless of the value or data type of model_kwargs or device_map passed to transformers.pipeline, they have no impact on the code since they are not used at that stage. The reason why passing device_map as model_kwargs["device_map"] seems to work is that model_kwargs is utilized by the from_pretrained method to create model before these values are passed to transformers.pipeline.

I understand that device_map is just a shortcut for setting model_kwargs["device_map"]. While everything works fine without this PR, I hope to replicate this shortcut from transformers to provide developers with a consistent experience between langchain_huggingface and transformers.

In summary, the existing device_map parameter in from_model_id is ineffective and misleading as currently implemented. It should either be removed or fixed to function as described in the documentation.

_model_kwargs = model_kwargs or {}
tokenizer = AutoTokenizer.from_pretrained(model_id, **_model_kwargs)
try:
if task == "text-generation":
if backend == "openvino":
try:
from optimum.intel.openvino import ( # type: ignore[import]
OVModelForCausalLM,
)
except ImportError:
raise ValueError(
"Could not import optimum-intel python package. "
"Please install it with: "
"pip install 'optimum[openvino,nncf]' "
)
try:
# use local model
model = OVModelForCausalLM.from_pretrained(
model_id, **_model_kwargs
)
except Exception:
# use remote model
model = OVModelForCausalLM.from_pretrained(
model_id, export=True, **_model_kwargs
)
else:
model = AutoModelForCausalLM.from_pretrained(
model_id, **_model_kwargs
)

pipeline = hf_pipeline(
task=task,
model=model,
tokenizer=tokenizer,
device=device,
device_map=device_map,
batch_size=batch_size,
model_kwargs=_model_kwargs,
**_pipeline_kwargs,
)

@efriis
Copy link
Member

efriis commented Sep 8, 2024

I think you might be confusing the tokenizer and the pipeline object.

In order to help me understand your case, could you write 2 code snippets:

  1. an initialization of transformers pipeline(model_obj, device_map="auto") that works
  2. an initialization of HuggingFacePipeline.from_model_id(device_map="auto") that doesn't work as expected

pipeline (renamed hf_pipeline in this implementation) takes the same device_map argument in this instance, so I think the behavior should be identical to what transformers does when passed a device_map right?

@kenchanLOL
Copy link
Contributor Author

@efriis First of all, i didn't mix up the tokenizer and pipeline. The tokenizer is included for the sake of code completeness and I just double checked the source code of transformers to make sure that its not related to the bug we are discussing . All my discussion is around the initialization and data type difference of the model variable only.
Second. Here are the code snippets.

  1. an initialization of transformers pipeline(model_obj, device_map="auto") that works
from transformers import pipeline
pipe = pipeline(model="meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto")
# checking GPU memory usage by nvidia-smi, the model is properly assigned across multiple GPU
  1. an initialization of HuggingFacePipeline.from_model_id(device_map="auto") that doesn't work as expected
from langchain_huggingface import HuggingFacePipeline
pipe = HuggingFacePipeline.from_model_id(
    model_id = "meta-llama/Meta-Llama-3-8B-Instruct",
    task="text-generation",
    device_map="auto"
)
#  following message are printed in terminal showing that no device is used
#  Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.

Lastly, for your question, the answer is no for current implementation and that's exactly why i am making this PR to fix this counter-intuitive bug. Also, I realized that this bug persists across 2 HuggingFacePipeline implementations , one from langchain_huggingface and the other from langchain_community.

@efriis
Copy link
Member

efriis commented Sep 11, 2024

thank you for clarifying! I misunderstood the original point - sorry about that!

if we pass a model object instead of a model string to pipeline(model=model_obj, device_map="auto") it will fail to set device_map (and that's the behavior we see in from_model_id because we're initializing from_pretrained model objects instead of passing the model string directly.

I'll reopen and plan on merging after 0.3 changes go in. main thing that would be great to get into the 0.3 releases would be to raise an exception instead of a warning for when both are set

@efriis efriis reopened this Sep 11, 2024
@efriis efriis marked this pull request as ready for review September 11, 2024 23:25
@dosubot dosubot bot added the langchain Related to the langchain package label Sep 11, 2024
if model_kwargs is None:
model_kwargs = {}
model_kwargs["device_map"] = device_map
device = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
device = None

Copy link
Contributor Author

@kenchanLOL kenchanLOL Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as the comment above, this line was used as a temporary fix. If the default value of device is changed, this line is no longer needed


if device_map is not None:
if device is not None:
logger.warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.warning(
raise ValueError(

Copy link
Contributor Author

@kenchanLOL kenchanLOL Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we simply escalate it from warning to error, it would always be triggered when we set device_map as the default value of device is -1(cpu) in current implementation. I checked the documentation of transformer and realized that the default value of device for pipeline changed from -1 to None. I tested both value and they have same behavior loading the model into CPU, the only difference is their logging message in terminal

pipe = HuggingFacePipeline.from_model_id(model_id = "microsoft/phi-2",task="text-generation",device=None)
# message : Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.

pipe = HuggingFacePipeline.from_model_id(model_id = "microsoft/phi-2",task="text-generation",device=-1)
# message : Device has 1 GPUs available. Provide device={deviceId} to `from_model_id` to use availableGPUs for execution. deviceId is -1 (default) for CPU and can be a positive integer associated with CUDA device id.

Despite the log message for device=-1 looks better, given that it is a legacy value that could cause conflicts , updating the default to None will align the behavior with the latest practices and prevent unexpected errors in the future. So, I think we should also change the default value of device in line 77 while changing this from warning to error.

@@ -218,7 +230,6 @@ def from_model_id(
model=model,
tokenizer=tokenizer,
device=device,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
device=device,
device=device,
device_map=device_map,

would it make sense to keep this passthrough incase transformers uses it in future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar reason as above. There is also a similar parameter validation to check if both device and device_map are not None in pipeline. So, It would always be triggered as the current default value for device is -1 and device_map is also not None (either be 'auto' or a dictionary). So, we cant simply keep it without considering the device parameter value.

kenchanLOL and others added 2 commits September 12, 2024 11:42
…ce_pipeline.py

Co-authored-by: Erick Friis <erickfriis@gmail.com>
Co-authored-by: Erick Friis <erickfriis@gmail.com>
@kenchanLOL
Copy link
Contributor Author

@efriis sorry i am not very familiar with git and accidentally commit all your suggestions. In short, your suggestions might cause bug without changing the default value of another parameter device. Please check my comments on your suggestions for details

@ccurme ccurme removed the langchain Related to the langchain package label Sep 20, 2024
@kenchanLOL
Copy link
Contributor Author

@efriis I’ve reviewed your suggestions and made the necessary changes accordingly. Please have a look and let me know if anything else needs further adjustment.

@kenchanLOL
Copy link
Contributor Author

@efriis
I wanted to kindly follow up on this PR. I believe it's nearly complete and ready for review. When you have a moment, I'd appreciate any feedback or approval. I understand you may be busy, so I just wanted to ensure this doesn't slip through the cracks.

@efriis
Copy link
Member

efriis commented Oct 18, 2024

hey! sorry for the delay here - having some trouble getting a hold of the huggingface team regarding their open PRs, and I'm not as familiar with huggingface's multiple GPU setup, which also doesn't get tested in our CI to confirm this works or not.

@Jofthomas if you could please take a look. Otherwise going to ship it as-is early next week

@kenchanLOL
Copy link
Contributor Author

Thanks for the update.

@efriis efriis assigned vbarda and unassigned efriis Oct 21, 2024
Copy link

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accelerate/transoformers maintainer here! LGTM! As you said, we need to pass the device_map in the _model_kwargs if we are going to load the model before passing it to the pipeline.

@vbarda vbarda merged commit 6d2a76a into langchain-ai:master Oct 23, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: huggingface Primarily related to HuggingFace integrations partner size:S This PR changes 10-29 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

8 participants