-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open source export and deploy modules #8743
Conversation
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some major and minor comments that require addressing, however the PR overall is very well done.
from .deploy_base import DeployBase | ||
|
||
|
||
class DeployPyTriton(DeployBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The classname has insufficient context. Deploy Pytriton with which model? We are planning to deploy pytriton for streaming asr and tts too as new tools, so please call this class DeployPytritonLLM
or something that denotes what it is deploying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for a specific model. The goal is to support the deployment of any model. You can either pass a nemo ckpt using this param checkpoint_path
or in memory model using this model
param. The in memory model support will be added fully later though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so you're saying the same code for streaming will apply to both LLM and ASR? Then no need to act on this comment, but let's keep it un resolved so ASR team can look into it later
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
for more information, see https://pre-commit.ci
@titu1994 Thanks for your review and all of the suggestions. Tried to address all of your comments. I agree with most of them and updated the code based on your suggestion. Some of them, I left comment. Can you please do another review? Would love to merge this asap because there are many tasks depend on this PR. @ericharper Please let us know what you think as well. |
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
…/NeMo into oss-export-deploy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add these folders to ignore in setup.py, then the PR is ready to merge
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
@titu1994 Added the deploy and the export to exclude package list here https://github.com/oyilmaz-nvidia/NeMo/blob/oss-export-deploy/setup.py#L237 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for now, thanks for all the changes !
Jenkins |
jenkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
|
||
# Similar to split_save_weight but done on GPU for performance | ||
@torch.no_grad() | ||
def save_weight_torch(tp_rank, saved_dir, split_factor, key, vals, storage_type, act_range, config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We convert the nemo weights to TRTLLM format, and we have two functions to do this, one using numpy and one using torch, I wonder if these could share some code, instead of using a chain of if else statements the mapping of nemo names to TRTLLM names could be stored in some dict, and keys index into that. It would also make maintaining this code easier since nemo param names have been known to change.
"""Returns the tensor_parallel_group config based on tensor_parallel.""" | ||
from mpi4py import MPI | ||
|
||
mpi_rank = MPI.COMM_WORLD.Get_rank() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we use tensorrt_llm.mpi_rank()
for self consistency?
index += 1 | ||
|
||
|
||
def rename_key(old_key: str, pp_rank: int, num_layers: int, pp_size: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do need to rename keys here? Dont we already map the nemo param name to TRTLLM name inside split_and_save_weight()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall great PR, just some minor comments. Could you take a pass through the CodeQL, there's a lot of unused imports that need to be cleaned up.
Also, will you follow up with a PR for developer docs? It will be helpful for nemo developers that want to use these new modules.
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
share_embedding_table = False | ||
share_weight = None | ||
if share_embedding_table: | ||
share_weight = self.embedding.vocab_embedding.weight |
Check warning
Code scanning / CodeQL
Unreachable code Warning
nemo/export/trt_llm/nemo_utils.py
Outdated
import shutil | ||
import sys | ||
import tempfile | ||
import typing |
Check notice
Code scanning / CodeQL
Module is imported with 'import' and 'import from' Note
from pathlib import Path | ||
|
||
import numpy as np | ||
import tensorstore # this is important even though not used |
Check notice
Code scanning / CodeQL
Unused import Note
nemo/export/tensorrt_llm.py
Outdated
) | ||
|
||
return weights.cpu().detach() | ||
return None |
Check warning
Code scanning / CodeQL
Unreachable code Warning
tests/export/test_nemo_export.py
Outdated
( | ||
trtllm_accuracy, | ||
trtllm_accuracy_relaxed, | ||
trtllm_deployed_accuracy, | ||
trtllm_deployed_accuracy_relaxed, | ||
) = run_trt_llm_inference( | ||
model_name=args.model_name, | ||
model_type=args.model_type, | ||
prompt=prompt_template, | ||
checkpoint_path=args.checkpoint_dir, | ||
trt_llm_model_dir=args.trt_llm_model_dir, | ||
n_gpu=n_gpus, | ||
max_batch_size=args.max_batch_size, | ||
max_input_token=args.max_input_token, | ||
max_output_token=args.max_output_token, | ||
ptuning=args.ptuning, | ||
p_tuning_checkpoint=args.p_tuning_checkpoint, | ||
lora=args.lora, | ||
lora_checkpoint=args.lora_checkpoint, | ||
tp_size=args.tp_size, | ||
pp_size=args.pp_size, | ||
top_k=args.top_k, | ||
top_p=args.top_p, | ||
temperature=args.temperature, | ||
run_accuracy=args.run_accuracy, | ||
debug=args.debug, | ||
streaming=args.streaming, | ||
test_deployment=args.test_deployment, | ||
) |
Check failure
Code scanning / CodeQL
Mismatch in multiple assignment Error test
nemo/export/__init__.py
Outdated
# except Exception as e: | ||
# LOGGER.warning("TensorRTLLM could not be imported.") |
Check notice
Code scanning / CodeQL
Commented-out code Note
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
jenkins |
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
jenkins |
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
…/NeMo into oss-export-deploy
jenkins |
1 similar comment
jenkins |
jenkins |
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
jenkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
* export and deploy modules Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Add export tests Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address PR reviews Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Add try except Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Moved query_llm to nlp folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed lambada.json Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Reverting the Jenkinsfile Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Exclude deploy and export from the pip Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Address the CodeQL issues Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Addressing reviews Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove deploy test for now Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Addressing CodeQL comments Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * wrap imports with try except Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Add test data param and fix codeql issue Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>
* export and deploy modules Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Add export tests Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address PR reviews Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Add try except Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Moved query_llm to nlp folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed lambada.json Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Reverting the Jenkinsfile Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Exclude deploy and export from the pip Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Address the CodeQL issues Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Addressing reviews Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove deploy test for now Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Addressing CodeQL comments Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * wrap imports with try except Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Add test data param and fix codeql issue Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Ao Tang <aot@nvidia.com>
* export and deploy modules Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Add export tests Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address PR reviews Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Add try except Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Moved query_llm to nlp folder Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed lambada.json Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Reverting the Jenkinsfile Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Exclude deploy and export from the pip Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Address the CodeQL issues Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Addressing reviews Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove deploy test for now Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Addressing CodeQL comments Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * wrap imports with try except Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Add test data param and fix codeql issue Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>
What does this PR do ?
Open sourcing the export and deploy module that exports a nemo checkpoint to TensorRT-LLM and serves it using PyTriton.
Collection: [None]
Changelog
Usage
Jenkins CI
To run Jenkins, a NeMo User with write access must comment
jenkins
on the PR.Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.