Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update and move convention section to CONTRIBUTING.md #1635

Merged
merged 1 commit into from
Jul 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,39 @@ make html

The built docs should now be available in `docs/build/html`

## Conventions

As a good software development practice, we try to stick to existing variable
names and shape (for tensors).
The following are some of the conventions that we follow.

- We use an ellipsis "..." as a placeholder for the rest of the dimensions of a
tensor, e.g. optional batching and channel dimensions. If batching, the
"batch" dimension should come in the first diemension.
- Tensors are assumed to have "channel" dimension coming before the "time"
dimension. The bins in frequency domain (freq and mel) are assumed to come
before the "time" dimension but after the "channel" dimension. These
ordering makes the tensors consistent with PyTorch's dimensions.
- For size names, the prefix `n_` is used (e.g. "a tensor of size (`n_freq`,
`n_mels`)") whereas dimension names do not have this prefix (e.g. "a tensor of
dimension (channel, time)")

Here are some of the examples of commonly used variables with thier names,
meanings, and shapes (or units):

* `waveform`: a tensor of audio samples with dimensions (..., channel, time)
* `sample_rate`: the rate of audio dimensions (samples per second)
* `specgram`: a tensor of spectrogram with dimensions (..., channel, freq, time)
* `mel_specgram`: a mel spectrogram with dimensions (..., channel, mel, time)
* `hop_length`: the number of samples between the starts of consecutive frames
* `n_fft`: the number of Fourier bins
* `n_mels`, `n_mfcc`: the number of mel and MFCC bins
* `n_freq`: the number of bins in a linear spectrogram
* `f_min`: the lowest frequency of the lowest band in a spectrogram
* `f_max`: the highest frequency of the highest band in a spectrogram
* `win_length`: the length of the STFT window
* `window_fn`: for functions that creates windows e.g. `torch.hann_window`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code, the explanation, and your comment I think only waveform, and sample_rate is standard and relevant across the code base. If there are only one or two usage, then it's not really convention or standard. And when a new contributor add a new feature, and when we review it, as a good software development practice, we naturally try to stick to existing variable names. So I doubt we need this.
Rather, this list of convention is rather confusing and limiting because IIRC there are cases where waveform have batch dimension in addition to channel and time, which seems like not following the convention.
How about removing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we should keep it. I think this short table can help for a quick reference. I've updated the batch dimension problem and rewrite this paragraph a bit.
Please let me know how you think about this new version.
Thanks.


## License

By contributing to Torchaudio, you agree that your contributions will be licensed
Expand Down
39 changes: 0 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,45 +138,6 @@ API Reference

API Reference is located here: http://pytorch.org/audio/

Conventions
-----------

With torchaudio being a machine learning library and built on top of PyTorch,
torchaudio is standardized around the following naming conventions. Tensors are
assumed to have "channel" as the first dimension and time as the last
dimension (when applicable). Both of these dimensions make the tensors consistent with PyTorch's dimensions.
For size names, the prefix `n_` is used (e.g. "a tensor of size (`n_freq`, `n_mel`)")
whereas dimension names do not have this prefix (e.g. "a tensor of
dimension (channel, time)")

* `waveform`: a tensor of audio samples with dimensions (channel, time)
* `sample_rate`: the rate of audio dimensions (samples per second)
* `specgram`: a tensor of spectrogram with dimensions (channel, freq, time)
* `mel_specgram`: a mel spectrogram with dimensions (channel, mel, time)
* `hop_length`: the number of samples between the starts of consecutive frames
* `n_fft`: the number of Fourier bins
* `n_mel`, `n_mfcc`: the number of mel and MFCC bins
* `n_freq`: the number of bins in a linear spectrogram
* `min_freq`: the lowest frequency of the lowest band in a spectrogram
* `max_freq`: the highest frequency of the highest band in a spectrogram
* `win_length`: the length of the STFT window
* `window_fn`: for functions that creates windows e.g. `torch.hann_window`

Transforms expect and return the following dimensions.

* `Spectrogram`: (channel, time) -> (channel, freq, time)
* `AmplitudeToDB`: (channel, freq, time) -> (channel, freq, time)
* `MelScale`: (channel, freq, time) -> (channel, mel, time)
* `MelSpectrogram`: (channel, time) -> (channel, mel, time)
* `MFCC`: (channel, time) -> (channel, mfcc, time)
* `MuLawEncode`: (channel, time) -> (channel, time)
* `MuLawDecode`: (channel, time) -> (channel, time)
* `Resample`: (channel, time) -> (channel, time)
* `Fade`: (channel, time) -> (channel, time)
* `Vol`: (channel, time) -> (channel, time)

Here, and in the documentation, we use an ellipsis "..." as a placeholder for the rest of the dimensions of a tensor, e.g. optional batching and channel dimensions.

Contributing Guidelines
-----------------------

Expand Down