Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specaug speedup #6347

Merged
merged 5 commits into from
Apr 6, 2023
Merged

specaug speedup #6347

merged 5 commits into from
Apr 6, 2023

Conversation

1-800-BAD-CODE
Copy link
Contributor

What does this PR do ?

Faster implementation of non-numba specaug

Collection: ASR

Changelog

Rather than repeatedly modify the features tensor in-place, build a mask and fill the features tensor once. By my measurements, about 20x faster on GPU.

It could be faster still, but it gets hard to compare output exactly to the original implementation to verify correctness.

Also, the original implementation seems to be biased away from the upper freq bins, but again fixing it would be make it impossible to compare the outputs to the original implementation.

Usage

Running this snippet will run both the original and this implementation, verify similar output, and print latencies.

Test snippet
import random
import time
import torch

from nemo.collections.asr.parts.submodules.spectr_augment import SpecAugment
from nemo.core.classes import typecheck


# Copy of original `SpecAugment.forward` for comparison
@typecheck()
@torch.no_grad()
def forward_original(self, input_spec, length):
    sh = input_spec.shape

    for idx in range(sh[0]):
        for i in range(self.freq_masks):
            x_left = self._rng.randint(0, sh[1] - self.freq_width)

            w = self._rng.randint(0, self.freq_width)

            input_spec[idx, x_left : x_left + w, :] = self.mask_value

        for i in range(self.time_masks):
            if self.adaptive_temporal_width:
                time_width = max(1, int(length[idx] * self.time_width))
            else:
                time_width = self.time_width

            y_left = self._rng.randint(0, max(1, length[idx] - time_width))

            w = self._rng.randint(0, time_width)

            input_spec[idx, :, y_left : y_left + w] = self.mask_value

    return input_spec


seed = 12345
batch_size = 128
feat_dim = 80
max_length = 1000
device = "cuda"
num_iterations = 100

# Generate some inputs
spec = torch.randn(size=[batch_size, feat_dim, max_length], device=device)
lengths = torch.randint(low=max_length // 10, high=max_length, size=[batch_size], device=device)
# Usually, at least one element in a batch is the max length.
lengths[0] = max_length
print(f"Testing with input features shape {spec.shape}")

# This version has the new forward
augmentor: SpecAugment = SpecAugment(
    freq_masks=2, time_masks=10, freq_width=27, time_width=0.05, rng=random.Random(seed)
).to(device)
# cache the first output for comparison (need to be careful of RNG values)
first_output_new = augmentor(input_spec=spec, length=lengths)

# Warm up
augmented_spec = augmentor(input_spec=spec, length=lengths)
# Loop and record time
start = time.time()
for _ in range(num_iterations):
    augmented_spec = augmentor(input_spec=spec, length=lengths)
    if device == "cuda":
        torch.cuda.synchronize()
stop = time.time()
mean_duration = (stop - start) / num_iterations
print(f"Mean duration for new implementation: {mean_duration * 1000:0.1f} ms")

# Re-instantiate a spec aug with same RNG
augmentor: SpecAugment = SpecAugment(
    freq_masks=2, time_masks=10, freq_width=27, time_width=0.05, rng=random.Random(seed)
).to(device)
# Set class's forward method to original one
setattr(SpecAugment, "forward", forward_original)
# Get baseline output. Note that the original implementation modifies in-place and returns a reference to `spec`, so
# use a clone of `spec` since we'll call forward some more for getting times
first_output_old = augmentor(input_spec=spec.clone(), length=lengths)

# Warm up
augmented_spec = augmentor(input_spec=spec, length=lengths)
# Loop and record time
start = time.time()
for _ in range(num_iterations):
    augmented_spec = augmentor(input_spec=spec, length=lengths)
    if device == "cuda":
        torch.cuda.synchronize()
stop = time.time()
mean_duration = (stop - start) / num_iterations
print(f"Mean duration for original specaug: {mean_duration * 1000:0.1f} ms")

# Compare the output of each module. Can use hard tolerance since op is a constant fill.
print("Asserting old and new augmentations are all close...")
torch.testing.assert_close(actual=first_output_new, expected=first_output_old, rtol=0.0, atol=0.0)
print("Old and new outputs match")

Expected output:

# With GPU
Testing with input features shape torch.Size([128, 80, 1000])
Mean duration for new implementation: 5.0 ms
Mean duration for original specaug: 100.1 ms
Asserting old and new augmentations are all close...
Old and new outputs match

# with CPU
Testing with input features shape torch.Size([128, 80, 1000])
Mean duration for new implementation: 8.5 ms
Mean duration for original specaug: 42.5 ms
Asserting old and new augmentations are all close...
Old and new outputs match

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

@github-actions github-actions bot added the ASR label Apr 2, 2023
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool pr ! The speedup pushes this on par with the Numba method (Numba is maybe a few milliseconds faster but it doesn't matter for CPU use and the speedup is substantial for GPU anymore)

Pr looks good, but I'll ask @VahidooX for final review

@@ -44,15 +45,15 @@ def input_types(self):
"""Returns definitions of module input types
"""
return {
"input_spec": NeuralType(('B', 'D', 'T'), SpectrogramType()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you revert these changes ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done... this was blacks default (string normalization) that was done automatically

nemo/collections/asr/parts/submodules/spectr_augment.py Outdated Show resolved Hide resolved
nemo/collections/asr/parts/submodules/spectr_augment.py Outdated Show resolved Hide resolved
@@ -120,13 +126,13 @@ class SpecCutout(nn.Module, Typing):
def input_types(self):
"""Returns definitions of module input types
"""
return {"input_spec": NeuralType(('B', 'D', 'T'), SpectrogramType())}
return {"input_spec": NeuralType(("B", "D", "T"), SpectrogramType())}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

VahidooX
VahidooX previously approved these changes Apr 4, 2023
Copy link
Collaborator

@VahidooX VahidooX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution!

@titu1994
Copy link
Collaborator

titu1994 commented Apr 4, 2023

Seems like you need to sign your commits - https://github.com/NVIDIA/NeMo/pull/6347/checks?check_run_id=12461951009

titu1994 and others added 4 commits April 4, 2023 19:17
…VIDIA#6346)

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: shane carroll <shane.carroll@utsa.edu>
Signed-off-by: shane carroll <shane.carroll@utsa.edu>
Signed-off-by: shane carroll <shane.carroll@utsa.edu>
for more information, see https://pre-commit.ci

Signed-off-by: shane carroll <shane.carroll@utsa.edu>
@raman-r-4978
Copy link

Just a suggestion, please have a look at torchaudio.functional.mask_along_axis_iid which I hope doing the same thing

@github-actions github-actions bot removed core Changes to NeMo Core NLP labels Apr 5, 2023
@titu1994
Copy link
Collaborator

titu1994 commented Apr 5, 2023

Torchaudio is an optional dependency in NeMo, and not installed automatically due to its dependency on fixed pytorch version and compilation. This is a more generic PR that has no requirements other than torch.

@titu1994 titu1994 merged commit 8aec729 into NVIDIA:main Apr 6, 2023
@titu1994
Copy link
Collaborator

titu1994 commented Apr 6, 2023

@1-800-BAD-CODE thanks for the PR !

hsiehjackson pushed a commit to hsiehjackson/NeMo that referenced this pull request Jun 2, 2023
* [Core] return_config=True now extracts just config, not full tarfile (NVIDIA#6346)

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: shane carroll <shane.carroll@utsa.edu>

* specaug speedup

Signed-off-by: shane carroll <shane.carroll@utsa.edu>

* comments

Signed-off-by: shane carroll <shane.carroll@utsa.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: shane carroll <shane.carroll@utsa.edu>

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: shane carroll <shane.carroll@utsa.edu>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants