Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -403,6 +403,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://huggingface.co/papers/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (from Google AI) released with the paper [Scaling Open-Vocabulary Object Detection](https://huggingface.co/papers/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby.
1. **[PaliGemma](https://huggingface.co/docs/transformers/main/model_doc/paligemma)** (from Google) released with the papers [PaliGemma: A versatile 3B VLM for transfer](https://huggingface.co/papers/2407.07726) and [PaliGemma 2: A Family of Versatile VLMs for Transfer](https://huggingface.co/papers/2412.03555) by the PaliGemma Google team.
1. **[Parakeet](https://huggingface.co/docs/transformers/main/model_doc/parakeet)** (from NVIDIA) released with the blog post [Introducing the Parakeet ASR family](https://developer.nvidia.com/blog/pushing-the-boundaries-of-speech-recognition-with-nemo-parakeet-asr-models/) by the NVIDIA NeMo team.
1. **[PatchTSMixer](https://huggingface.co/docs/transformers/main/model_doc/patchtsmixer)** (from IBM) released with the paper [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://huggingface.co/papers/2306.09364) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
1. **[PatchTST](https://huggingface.co/docs/transformers/main/model_doc/patchtst)** (from Princeton University, IBM) released with the paper [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://huggingface.co/papers/2211.14730) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
1. **[Phi](https://huggingface.co/docs/transformers/main/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://huggingface.co/papers/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://huggingface.co/papers/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://huggingface.co/papers/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (from Google AI) released with the paper [Scaling Open-Vocabulary Object Detection](https://huggingface.co/papers/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby.
1. **[PaliGemma](https://huggingface.co/docs/transformers/main/model_doc/paligemma)** (from Google) released with the papers [PaliGemma: A versatile 3B VLM for transfer](https://huggingface.co/papers/2407.07726) and [PaliGemma 2: A Family of Versatile VLMs for Transfer](https://huggingface.co/papers/2412.03555) by the PaliGemma Google team.
1. **[Parakeet](https://huggingface.co/docs/transformers/main/model_doc/parakeet)** (from NVIDIA) released with the blog post [Introducing the Parakeet ASR family](https://developer.nvidia.com/blog/pushing-the-boundaries-of-speech-recognition-with-nemo-parakeet-asr-models/) by the NVIDIA NeMo team.
1. **[PatchTSMixer](https://huggingface.co/docs/transformers/main/model_doc/patchtsmixer)** (from IBM) released with the paper [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://huggingface.co/papers/2306.09364) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
1. **[PatchTST](https://huggingface.co/docs/transformers/main/model_doc/patchtst)** (from Princeton University, IBM) released with the paper [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://huggingface.co/papers/2211.14730) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
1. **[Phi](https://huggingface.co/docs/transformers/main/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://huggingface.co/papers/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://huggingface.co/papers/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
Expand Down
16 changes: 16 additions & 0 deletions src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -6155,6 +6155,21 @@ export class Wav2Vec2ForAudioFrameClassification extends Wav2Vec2PreTrainedModel
}
//////////////////////////////////////////////////

//////////////////////////////////////////////////
// Parakeet models
export class ParakeetPreTrainedModel extends PreTrainedModel { };
export class ParakeetForCTC extends ParakeetPreTrainedModel {
/**
* @param {Object} model_inputs
* @param {Tensor} model_inputs.input_values Float values of input raw speech waveform.
* @param {Tensor} model_inputs.attention_mask Mask to avoid performing convolution and attention on padding token indices. Mask values selected in [0, 1]
*/
async _call(model_inputs) {
return new CausalLMOutput(await super._call(model_inputs));
}
}
//////////////////////////////////////////////////


//////////////////////////////////////////////////
// PyAnnote models
Expand Down Expand Up @@ -8140,6 +8155,7 @@ const MODEL_FOR_CTC_MAPPING_NAMES = new Map([
['unispeech-sat', ['UniSpeechSatForCTC', UniSpeechSatForCTC]],
['wavlm', ['WavLMForCTC', WavLMForCTC]],
['hubert', ['HubertForCTC', HubertForCTC]],
['parakeet_ctc', ['ParakeetForCTC', ParakeetForCTC]],
]);

const MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES = new Map([
Expand Down
1 change: 1 addition & 0 deletions src/models/feature_extractors.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ export * from './clap/feature_extraction_clap.js';
export * from './dac/feature_extraction_dac.js';
export * from './gemma3n/feature_extraction_gemma3n.js';
export * from './moonshine/feature_extraction_moonshine.js';
export * from './parakeet/feature_extraction_parakeet.js';
export * from './pyannote/feature_extraction_pyannote.js';
export * from './seamless_m4t/feature_extraction_seamless_m4t.js';
export * from './snac/feature_extraction_snac.js';
Expand Down
121 changes: 121 additions & 0 deletions src/models/parakeet/feature_extraction_parakeet.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
import { FeatureExtractor, validate_audio_inputs } from '../../base/feature_extraction_utils.js';
import { Tensor } from '../../utils/tensor.js';
import { mel_filter_bank, spectrogram, window_function } from '../../utils/audio.js';

const EPSILON = 1e-5;

export class ParakeetFeatureExtractor extends FeatureExtractor {

constructor(config) {
super(config);

// Prefer given `mel_filters` from preprocessor_config.json, or calculate them if they don't exist.
this.config.mel_filters ??= mel_filter_bank(
Math.floor(1 + this.config.n_fft / 2), // num_frequency_bins
this.config.feature_size, // num_mel_filters
0.0, // min_frequency
this.config.sampling_rate / 2, // max_frequency
this.config.sampling_rate, // sampling_rate
"slaney", // norm
"slaney", // mel_scale
);

const window = window_function(this.config.win_length, 'hann', {
periodic: false,
});

this.window = new Float64Array(this.config.n_fft);
const offset = Math.floor((this.config.n_fft - this.config.win_length) / 2);
this.window.set(window, offset);
}

/**
* Computes the log-Mel spectrogram of the provided audio waveform.
* @param {Float32Array|Float64Array} waveform The audio waveform to process.
* @returns {Promise<Tensor>} An object containing the log-Mel spectrogram data as a Float32Array and its dimensions as an array of numbers.
*/
async _extract_fbank_features(waveform) {
// Parakeet uses a custom preemphasis strategy: Apply preemphasis to entire waveform at once
const preemphasis = this.config.preemphasis;
waveform = new Float64Array(waveform); // Clone to avoid destructive changes
for (let j = waveform.length - 1; j >= 1; --j) {
waveform[j] -= preemphasis * waveform[j - 1];
}

const features = await spectrogram(
waveform,
this.window, // window
this.window.length, // frame_length
this.config.hop_length, // hop_length
{
fft_length: this.config.n_fft,
power: 2.0,
mel_filters: this.config.mel_filters,
log_mel: 'log',
mel_floor: -Infinity,
pad_mode: 'constant',
center: true,

// Custom
transpose: true,
mel_offset: 2 ** -24,
}
)

return features;
}

/**
* Asynchronously extracts features from a given audio using the provided configuration.
* @param {Float32Array|Float64Array} audio The audio data as a Float32Array/Float64Array.
* @returns {Promise<{ input_features: Tensor; attention_mask: Tensor; }>} A Promise resolving to an object containing the extracted input features as a Tensor.
*/
async _call(audio) {
validate_audio_inputs(audio, 'ParakeetFeatureExtractor');

const features = await this._extract_fbank_features(audio);

const features_length = Math.floor(
(audio.length + Math.floor(this.config.n_fft / 2) * 2 - this.config.n_fft) / this.config.hop_length
);

const features_data = /** @type {Float32Array} */ (features.data);
features_data.fill(0, features_length * features.dims[1]);

// normalize mel features, ignoring padding
const [num_frames, num_features] = features.dims;
const sum = new Float64Array(num_features);
const sum_sq = new Float64Array(num_features);

for (let i = 0; i < features_length; ++i) {
const offset = i * num_features;
for (let j = 0; j < num_features; ++j) {
const val = features_data[offset + j];
sum[j] += val;
sum_sq[j] += val * val;
}
}

// Calculate mean and standard deviation, then normalize
const divisor = features_length > 1 ? features_length - 1 : 1;
for (let j = 0; j < num_features; ++j) {
const mean = sum[j] / features_length;
const variance = (sum_sq[j] - features_length * mean * mean) / divisor;
const std = Math.sqrt(variance) + EPSILON;
const inv_std = 1 / std;

for (let i = 0; i < features_length; ++i) {
const index = i * num_features + j;
features_data[index] = (features_data[index] - mean) * inv_std;
}
}

const mask_data = new BigInt64Array(num_frames);
mask_data.fill(1n, 0, features_length);

return {
input_features: features.unsqueeze_(0),
attention_mask: new Tensor('int64', mask_data, [1, num_frames]),
};
}
}
3 changes: 2 additions & 1 deletion src/pipelines.js
Original file line number Diff line number Diff line change
Expand Up @@ -1750,6 +1750,7 @@ export class AutomaticSpeechRecognitionPipeline extends (/** @type {new (options
case 'unispeech':
case 'unispeech-sat':
case 'hubert':
case 'parakeet_ctc':
return this._call_wav2vec2(audio, kwargs)
case 'moonshine':
return this._call_moonshine(audio, kwargs)
Expand Down Expand Up @@ -1790,7 +1791,7 @@ export class AutomaticSpeechRecognitionPipeline extends (/** @type {new (options
for (const item of logits) {
predicted_ids.push(max(item.data)[1])
}
const predicted_sentences = this.tokenizer.decode(predicted_ids)
const predicted_sentences = this.tokenizer.decode(predicted_ids, { skip_special_tokens: true }).trim();
toReturn.push({ text: predicted_sentences })
}
return single ? toReturn[0] : toReturn;
Expand Down
24 changes: 19 additions & 5 deletions src/utils/audio.js
Original file line number Diff line number Diff line change
Expand Up @@ -470,6 +470,7 @@ function power_to_db(spectrogram, reference = 1.0, min_value = 1e-10, db_range =
* @param {number} [options.min_num_frames=null] If provided, ensures the number of frames to compute is at least this value.
* @param {boolean} [options.do_pad=true] If `true`, pads the output spectrogram to have `max_num_frames` frames.
* @param {boolean} [options.transpose=false] If `true`, the returned spectrogram will have shape `(num_frames, num_frequency_bins/num_mel_filters)`. If `false`, the returned spectrogram will have shape `(num_frequency_bins/num_mel_filters, num_frames)`.
* @param {number} [options.mel_offset=0] Offset to add to the mel spectrogram to avoid taking the log of zero.
* @returns {Promise<Tensor>} Spectrogram of shape `(num_frequency_bins, length)` (regular spectrogram) or shape `(num_mel_filters, length)` (mel spectrogram).
*/
export async function spectrogram(
Expand Down Expand Up @@ -498,6 +499,7 @@ export async function spectrogram(
max_num_frames = null,
do_pad = true,
transpose = false,
mel_offset = 0,
} = {}
) {
const window_length = window.length;
Expand Down Expand Up @@ -530,11 +532,23 @@ export async function spectrogram(
}

if (center) {
if (pad_mode !== 'reflect') {
throw new Error(`pad_mode="${pad_mode}" not implemented yet.`)
switch (pad_mode) {
case 'reflect': {
const half_window = Math.floor((fft_length - 1) / 2) + 1;
waveform = padReflect(waveform, half_window, half_window);
break;
}
case 'constant': {
const padding = Math.floor(fft_length / 2);
// @ts-expect-error ts(2351)
const padded = new waveform.constructor(waveform.length + 2 * padding);
padded.set(waveform, padding);
waveform = padded;
break;
}
default:
throw new Error(`pad_mode="${pad_mode}" not implemented yet.`);
}
const half_window = Math.floor((fft_length - 1) / 2) + 1;
waveform = padReflect(waveform, half_window, half_window);
}

// split waveform into frames of frame_length size
Expand Down Expand Up @@ -641,7 +655,7 @@ export async function spectrogram(

const mel_spec_data = /** @type {Float32Array} */(mel_spec.data);
for (let i = 0; i < mel_spec_data.length; ++i) {
mel_spec_data[i] = Math.max(mel_floor, mel_spec_data[i]);
mel_spec_data[i] = mel_offset + Math.max(mel_floor, mel_spec_data[i]);
}

if (power !== null && log_mel !== null) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ export default () => {
"default",
async () => {
const output = await pipe(audios[0], { max_new_tokens });
const target = { text: "<unk>K" };
const target = { text: "K" };
expect(output).toEqual(target);
},
MAX_TEST_EXECUTION_TIME,
Expand Down