Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argument passing for TP degree #772

Open
VenkateshPasumarti opened this issue Jan 29, 2025 · 5 comments
Open

Argument passing for TP degree #772

VenkateshPasumarti opened this issue Jan 29, 2025 · 5 comments

Comments

@VenkateshPasumarti
Copy link

VenkateshPasumarti commented Jan 29, 2025

Feature request

I was trying for generating text-embeddings for mistral based model using sentence transformers, but facing an issue with memory since it is trying to download the complete model in one core and throwing memory constraint issues , since mistral model requires 16GB and one neuron core is of size is 16GB. So, i wanted to activate multiple cores using an argument in order to generate using optimum neuron.

Motivation

Need to activate multiple cores and also such that i can run two models in parellel using different cores

Your contribution

Was able to run smaller models, but for larger models facing issues.

@dacorvo
Copy link
Collaborator

dacorvo commented Jan 29, 2025

@VenkateshPasumarti thank you for your feedback. You can use the num_cores argument to increase the number of cores on which the model is deployed.

@dacorvo
Copy link
Collaborator

dacorvo commented Jan 29, 2025

https://huggingface.co/docs/optimum-neuron/guides/export_model#exporting-llms-to-neuron

@VenkateshPasumarti
Copy link
Author

Thanks for the reply @dacorvo , can i know how can i run different models in parallel, something like locking those cores for a certain task
ex: If i take inf2.48x large i will be having multiple cores so for first 2 cores i wanted to run one sentence transformers based embeddings model and then in next multiple cores i wanted to run a re-ranking model

@dacorvo
Copy link
Collaborator

dacorvo commented Jan 30, 2025

@VenkateshPasumarti you can restrict the number of visible cores by using environment variables, but for that each model must run in a separate process (please refer to the AWS Neuron SDK documentation to see how).
Alternatively, you can run each model in a container, mapping only some devices to each container (you can see an example of that using docker compose in the benchmark/text-generation-inference/performance subdirectory).

@VenkateshPasumarti
Copy link
Author

Thanks for the reply and information @dacorvo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants