-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prompt formatter API and canary transcribe tensor input support #9206
Conversation
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial comments
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
…atting issues Signed-off-by: Piotr Żelasko <petezor@gmail.com>
…, add tests for aggtok Signed-off-by: Piotr Żelasko <petezor@gmail.com>
…and drop pipes everywhere except template definition. Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
|
||
tokens, prompts = [], [] | ||
prompts_with_answers, prompts = [], [] | ||
for cut in cuts: | ||
if isinstance(cut, MixedCut): | ||
cut = cut._first_non_padding_cut | ||
assert isinstance(cut, MonoCut), "Expected MonoCut." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better change to raising TypeError and saying something like "expected input audio to have single channel", since users might not know what "MonoCut" means
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
prompt = prompt.replace(_mangled(slot), value) | ||
return self._apply_tokenizer(prompt, lang=slot_values.get(self.PROMPT_LANGUAGE_SLOT)) | ||
|
||
def encode_dialog(self, turns: list[dict]) -> dict[str, torch.Tensor]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If i understand correctly, this PR is for encoder-decoder models like canary/bestow, where all the (multi-turn) dialogue should be in text.
can we think little bit about supporting audio modality also in slot values. (may be we should keep audio slots untokenized and replace it with "audio features" later. one way is for prompt formatter to return something like list of lists)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this and Piotr told me he is planning on this as v2. maybe we can resume the discussion at the time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg.
@pzelasko if possible, lets try to put the skeleton in place, e.g. if slot value needs to re-defined as (value, modality) tuple, return needs to be list of lists/tuples etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krishnacpuvvada I'm thinking for multimodal we'll add a method that returns a "formatted prompt" as a sequence of embeddings instead. the benefit of using embeddings rather than token IDs is that we can support models with non-discrete latent spaces in addition to discretized. there are a few options:
- initialize it with/register post-init a dict of {modality: nn.Module} that is used internally to convert "raw" modality input to sequence of embeddings; then, the prompt formatter is used at the beginning of forward step so that you can train these modules.
- provide sequence of embeddings directly, but even then you still need to use the formatter in forward step as it's unlikely you'll embed audio/images/video in the dataloader process on a CPU fast enough.
in terms of skeletons, I've already put in the Modality type with a single type text that's used in slot schema definition and validation that a value "is" from a given modality. I 90% believe it'll be sufficient to extend to other modalities in V2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg.
Agreed. any audio encoder (especially our 600M ones) has to be run on GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a thought.. can we also a sample.py/simple.py template with the simplest possible template and add few comments about which routines need to be defined. (This is mainly coming from - if a user wants to create their own custom template; I know there are plenty of examples already.. )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, yeah a canonical form of template to copy paste and directly modify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks really good now, minor comments from me, lets address the rest and merge
|
||
tokens, prompts = [], [] | ||
prompts_with_answers, prompts = [], [] | ||
for cut in cuts: | ||
if isinstance(cut, MixedCut): | ||
cut = cut._first_non_padding_cut | ||
assert isinstance(cut, MonoCut), "Expected MonoCut." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -134,6 +131,12 @@ def __init__(self, cfg: DictConfig, trainer: Trainer = None): | |||
|
|||
super().__init__(cfg=cfg, trainer=trainer) | |||
|
|||
prompt_cls = PromptFormatter.resolve(self.prompt_format) | |||
self.prompt = prompt_cls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not important for this PR, but I was thinking of serializing the keys of the prompt format into config for user visibility.
@@ -977,3 +1002,78 @@ def predict_step(self, batch, batch_idx=0, dataloader_idx=0, has_processed_signa | |||
|
|||
text = [self.decoding.strip_special_tokens(t) for t in text] | |||
return text | |||
|
|||
|
|||
def parse_multitask_prompt(prompt: dict | None) -> list[dict]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, yeah a canonical form of template to copy paste and directly modify
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
…sure Llama2 format gives identical results with the reference implementation Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work~! LGTM
…IA#9206) * Apply CanaryPromptFormatter in dataset/inference Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Working inference with CanaryPromptFormatter Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Minimum working example of Canary.transcribe() with tensors Signed-off-by: Piotr Żelasko <petezor@gmail.com> * training fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Update to the new 'chat' based prompt formatting API Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Prompt formatters for popular models and partial unit test coverage Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Updated documentation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improved test coverage + proper preamble support Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix usage of PromptFormatter for MT-AED class + fix tokenization/formatting issues Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Move some canary hacks to canary prompt formatter, improve validation, add tests for aggtok Signed-off-by: Piotr Żelasko <petezor@gmail.com> * aed_model.transcribe(**slots) support, rename all slots to lowercase and drop pipes everywhere except template definition. Signed-off-by: Piotr Żelasko <petezor@gmail.com> * truly generic version Signed-off-by: Piotr Żelasko <petezor@gmail.com> * making transcribe_speech.py work prompt slots + syntactic sugar Signed-off-by: Piotr Żelasko <petezor@gmail.com> * update streaming_utils.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * code review: partial Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Accept multi-turn, single-turn, and legacy prompt format in transcribe() and transcribe_speech.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Address code reviews Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add support for SPE special tokens bos/eos in prompt templates and ensure Llama2 format gives identical results with the reference implementation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix tests and add llama2 prompt formatter tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
* Apply CanaryPromptFormatter in dataset/inference Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Working inference with CanaryPromptFormatter Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Minimum working example of Canary.transcribe() with tensors Signed-off-by: Piotr Żelasko <petezor@gmail.com> * training fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Update to the new 'chat' based prompt formatting API Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Prompt formatters for popular models and partial unit test coverage Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Updated documentation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improved test coverage + proper preamble support Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix usage of PromptFormatter for MT-AED class + fix tokenization/formatting issues Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Move some canary hacks to canary prompt formatter, improve validation, add tests for aggtok Signed-off-by: Piotr Żelasko <petezor@gmail.com> * aed_model.transcribe(**slots) support, rename all slots to lowercase and drop pipes everywhere except template definition. Signed-off-by: Piotr Żelasko <petezor@gmail.com> * truly generic version Signed-off-by: Piotr Żelasko <petezor@gmail.com> * making transcribe_speech.py work prompt slots + syntactic sugar Signed-off-by: Piotr Żelasko <petezor@gmail.com> * update streaming_utils.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * code review: partial Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Accept multi-turn, single-turn, and legacy prompt format in transcribe() and transcribe_speech.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Address code reviews Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add support for SPE special tokens bos/eos in prompt templates and ensure Llama2 format gives identical results with the reference implementation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix tests and add llama2 prompt formatter tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
…IA#9206) * Apply CanaryPromptFormatter in dataset/inference Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Working inference with CanaryPromptFormatter Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Minimum working example of Canary.transcribe() with tensors Signed-off-by: Piotr Żelasko <petezor@gmail.com> * training fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Update to the new 'chat' based prompt formatting API Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Prompt formatters for popular models and partial unit test coverage Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Updated documentation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improved test coverage + proper preamble support Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix usage of PromptFormatter for MT-AED class + fix tokenization/formatting issues Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Move some canary hacks to canary prompt formatter, improve validation, add tests for aggtok Signed-off-by: Piotr Żelasko <petezor@gmail.com> * aed_model.transcribe(**slots) support, rename all slots to lowercase and drop pipes everywhere except template definition. Signed-off-by: Piotr Żelasko <petezor@gmail.com> * truly generic version Signed-off-by: Piotr Żelasko <petezor@gmail.com> * making transcribe_speech.py work prompt slots + syntactic sugar Signed-off-by: Piotr Żelasko <petezor@gmail.com> * update streaming_utils.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * code review: partial Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Accept multi-turn, single-turn, and legacy prompt format in transcribe() and transcribe_speech.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Address code reviews Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add support for SPE special tokens bos/eos in prompt templates and ensure Llama2 format gives identical results with the reference implementation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix tests and add llama2 prompt formatter tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com>
…IA#9206) * Apply CanaryPromptFormatter in dataset/inference Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Working inference with CanaryPromptFormatter Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Minimum working example of Canary.transcribe() with tensors Signed-off-by: Piotr Żelasko <petezor@gmail.com> * training fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Update to the new 'chat' based prompt formatting API Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Prompt formatters for popular models and partial unit test coverage Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Updated documentation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improved test coverage + proper preamble support Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix usage of PromptFormatter for MT-AED class + fix tokenization/formatting issues Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Move some canary hacks to canary prompt formatter, improve validation, add tests for aggtok Signed-off-by: Piotr Żelasko <petezor@gmail.com> * aed_model.transcribe(**slots) support, rename all slots to lowercase and drop pipes everywhere except template definition. Signed-off-by: Piotr Żelasko <petezor@gmail.com> * truly generic version Signed-off-by: Piotr Żelasko <petezor@gmail.com> * making transcribe_speech.py work prompt slots + syntactic sugar Signed-off-by: Piotr Żelasko <petezor@gmail.com> * update streaming_utils.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix Signed-off-by: Piotr Żelasko <petezor@gmail.com> * code review: partial Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Accept multi-turn, single-turn, and legacy prompt format in transcribe() and transcribe_speech.py Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Address code reviews Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Add support for SPE special tokens bos/eos in prompt templates and ensure Llama2 format gives identical results with the reference implementation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix tests and add llama2 prompt formatter tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix tests Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com>
What does this PR do ?
Generic prompt formatter for text modality with several out-of-the-box prompt format definitions. See the class documentation for more details.
Also, enables support for tensor/array inputs in Canary. Example snippet:
We can also now provide these values dynamically to
transcribe_speech.py
from CLI. Example:Collection: ASR
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information