Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TRT-LLM params like max_num_tokens and opt_num_tokens #9210

Merged
merged 14 commits into from
May 21, 2024

Conversation

oyilmaz-nvidia
Copy link
Collaborator

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

oyilmaz-nvidia and others added 4 commits May 15, 2024 17:51
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
@oyilmaz-nvidia oyilmaz-nvidia marked this pull request as ready for review May 17, 2024 14:21
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
@github-actions github-actions bot removed the NLP label May 17, 2024
@github-actions github-actions bot removed the NLP label May 20, 2024
oyilmaz-nvidia and others added 2 commits May 20, 2024 15:27
Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
max_lora_rank (int): maximum lora rank.
max_num_tokens (int):
opt_num_tokens (int):
save_nemo_model_config (bool):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing description for last three params

"-dcf",
"--disable_context_fmha",
"-drip",
"--disable_remove_input_padding",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rather use --remove_input_padding instead of --disable_remove_input_padding with negations?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would require providing --remove_input_padding in customer code to keep the current behavior unchanged (i.e. to use remove_input_padding=True when no parameter are specified).

It is only worthwhile if it leads to better naming consistency with TensorRT-LLM code.

@@ -214,6 +214,8 @@ def run_trt_llm_inference(
max_prompt_embedding_table_size=max_prompt_embedding_table_size,
use_lora_plugin=use_lora_plugin,
lora_target_modules=lora_target_modules,
max_num_tokens=int(max_input_token * max_batch_size * 0.2),
Copy link
Collaborator

@janekl janekl May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we possibly have a comment in this line why max_input_token * max_batch_size * 0.2, e.g. what 0.2 stands for and how is this derived in general?

@@ -214,6 +214,8 @@ def run_trt_llm_inference(
max_prompt_embedding_table_size=max_prompt_embedding_table_size,
use_lora_plugin=use_lora_plugin,
lora_target_modules=lora_target_modules,
max_num_tokens=int(max_input_token * max_batch_size * 0.2),
opt_num_tokens=60,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have this 60 in the place where all the defaults are? All other params are defined elsewhere

paged_kv_cache: bool = True,
remove_input_padding: bool = True,
max_num_tokens: int = None,
opt_num_tokens: int = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formally max_num_tokens and opt_num_tokens parameters should have Optional[int] typing

use_lora_plugin=args.use_lora_plugin,
lora_target_modules=args.lora_target_modules,
max_lora_rank=args.max_lora_rank,
save_nemo_model_config=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also here:

  • save_nemo_model_config=True -> args.save_nemo_model_config
  • pipeline_parallel_size=1 -> args.pipeline_parallel_size (if it is possible to change this at all currently)

@janekl
Copy link
Collaborator

janekl commented May 21, 2024

Are all argparse args in https://github.com/NVIDIA/NeMo/blob/main/scripts/export/export_to_trt_llm.py#L24 and https://github.com/NVIDIA/NeMo/blob/main/scripts/deploy/nlp/deploy_triton.py#L28 the same?

If yes, or a reasonable common subset exists, then we could perhaps define them in a single location somewhere in nemo.export and import elsewhere for transparency (fine to do this outside this MR).

@janekl
Copy link
Collaborator

janekl commented May 21, 2024

I left some comments, rather minor things like params default organization, naming, etc.

@oyilmaz-nvidia oyilmaz-nvidia merged commit 2e1814c into NVIDIA:main May 21, 2024
130 checks passed
pablo-garay added a commit that referenced this pull request May 22, 2024
* Add params like max_num_tokens and opt_num_tokens

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* remove padding param added

* update params like max_num_token

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

* remove context context_fmha param for now

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* add params like max num token to the script

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

---------

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: Pablo Garay <palenq@gmail.com>
BoxiangW pushed a commit to BoxiangW/NeMo that referenced this pull request Jun 5, 2024
* Add params like max_num_tokens and opt_num_tokens

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* remove padding param added

* update params like max_num_token

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

* remove context context_fmha param for now

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* add params like max num token to the script

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

---------

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: Pablo Garay <palenq@gmail.com>
Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
janekl pushed a commit that referenced this pull request Jun 12, 2024
* Add params like max_num_tokens and opt_num_tokens

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* remove padding param added

* update params like max_num_token

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

* remove context context_fmha param for now

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* add params like max num token to the script

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

---------

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: Pablo Garay <palenq@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* Add params like max_num_tokens and opt_num_tokens

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* remove padding param added

* update params like max_num_token

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

* remove context context_fmha param for now

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* add params like max num token to the script

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

---------

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: Pablo Garay <palenq@gmail.com>
@ko3n1g ko3n1g mentioned this pull request Jul 18, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants