How to obtain moshi response using API #157

treya-lin · 2024-11-21T09:11:22Z

Due diligence

I have done my due diligence in trying to find the answer myself.

Topic

The pytorch implementation

Question

Hello, thanks for your great work. I am trying the python API to see if I can use existing audio files to simulate streaming input obtain moshi's reply, but it didn't work as expected so I assume I am not using it the proper way. Could you kindly take a look?

my main question:

when I have an existing audio and I want moshi to listen and respond to it, it always respond with greeting first, and then maybe remain silent, or it may say something (sometimes). does it mean I need to add a very long pause to wait for it to reply?What is the best practice to make it reply to a piece of given speech?

Some other questions:
2. if I want to use my earlier input, moshi's reply, my new input to get a new round of reply from it, how should I form my input? (like how should I hack it so that moshi will know what she replied earlier?)
3. can I control more on how it replies? say, if I have a script already, can I make moshi to follow that script to converse with me?

my code that I used when I tried to solve question 1:

mostly borrowed from moshi's readme
modifications:
(1) I padded my input audio file to make sure the number of samples the multiple of 1920.
(2) I put the models in a local dir so I changed the default_repo
(3) I added 4 seconds of silence at the end of my audio. Initially I didn't add silence, and moshi didn't produce the reply, so I thought maybe I need to simulate human pause, but either way it didn't work properly.

loaders.DEFAULT_REPO = "/data/resources/models/kyutai/moshika-pytorch-bf16/"
device = "cuda"
mimi_weight = os.path.join(loaders.DEFAULT_REPO, loaders.MIMI_NAME)
mimi = loaders.get_mimi(mimi_weight, device=device)
mimi.set_num_codebooks(8)  # up to 32 for mimi, but limited to 8 for moshi.

def padding(wav: np.ndarray, multiple: int = 1920) -> np.ndarray:
    """
    Pads the audio signal to make its length a multiple of the specified value.
    
    Parameters:
    - wav (np.ndarray): The input audio signal, shape: [T].
    - multiple (int): The target multiple to pad the length to.
    

    """
    if not isinstance(wav, np.ndarray):
        raise ValueError("Input wav must be a NumPy array.")

    if multiple <= 0:
        raise ValueError("Multiple must be a positive integer.")

    # Calculate the current length and the padding needed
    current_length = wav.shape[0]
    padding_length = (multiple - (current_length % multiple)) % multiple

    # Add zero-padding to the end of the audio
    if padding_length > 0:
        wav = np.pad(wav, (0, padding_length), mode='constant', constant_values=0)

    return wav

# create an input data of a speech audio and add 4s silence at the end.
wav, sr = librosa.load(wavpath, sr = 24000,mono=True)
silence_duration = 4  # 
silence = np.zeros(int(sr * silence_duration), dtype=np.float32)
wav = np.concatenate((wav, silence))
wav = padding(wav) 
wav = torch.tensor(wav).unsqueeze(0).unsqueeze(0).to(device)  # Shape: [B=1, C=1, T]

# encode the input
with torch.no_grad():
    nonstream_codes = mimi.encode(wav)  # [B, K = 8, T]
    non_stream_decoded = mimi.decode(nonstream_codes)

    # Supports streaming too.
    frame_size = int(mimi.sample_rate / mimi.frame_rate) # 1920
    all_codes = []
    with mimi.streaming(batch_size=1):
        for offset in range(0, wav.shape[-1], frame_size):
            frame = wav[:, :, offset: offset + frame_size]
            codes = mimi.encode(frame)
            assert codes.shape[-1] == 1, codes.shape
            all_codes.append(codes)

import gc
def clear_cache():
    gc.collect()
    torch.cuda.empty_cache()
    
out_wav_chunks = []
# Now we will stream over both Moshi I/O, and decode on the fly with Mimi.
with torch.no_grad(), lm_gen.streaming(1), mimi.streaming(1):
    for idx, code in enumerate(all_codes):
        tokens_out = lm_gen.step(code.cuda())
        # tokens_out is [B, 1 + 8, 1], with tokens_out[:, 1] representing the text token.
        if tokens_out is not None:
            wav_chunk = mimi.decode(tokens_out[:, 1:])
            out_wav_chunks.append(wav_chunk)
        print(idx, end='\r')
out_wav = torch.cat(out_wav_chunks, dim=-1)
clear_cache()

# save the output file
out_wav_np = out_wav.squeeze().cpu().numpy()
output_path = "output_moshi.wav"
torchaudio.save(output_path, torch.tensor(out_wav_np).unsqueeze(0), sample_rate=24000)

I tried many times with many audio of different length but it always just returned moshi saying something like "hey what'up" or "hey how's it going". There is once or twice that it replied something meaningful after greeting, but still, I hope it can just "listen " to my words and reply without always greating first . I am trying to look into the code too, but I think I am not doing it the proper way. Could you please give more guide on how to use the API to play around it? Thank you! Any suggestion is much appreciated!

The text was updated successfully, but these errors were encountered:

treya-lin · 2024-11-21T10:08:31Z

Examples:(github does not accept wav so I had to upload as webm sorry...)

in this example, moshi only greets but didn't reply meaningful content, it greets and then remained silent till the end
A_4.webm
output_moshi_4.webm
This seems to be the very few times when it did reply, but it does not consistently respond like this, sometimes it only greets. And I don't understand why it greets when the input is talking?
A_0.webm
output_moshi.webm

treya-lin added the question Further information is requested label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to obtain moshi response using API #157

How to obtain moshi response using API #157

treya-lin commented Nov 21, 2024 •

edited

Loading

treya-lin commented Nov 21, 2024 •

edited

Loading

How to obtain moshi response using API #157

How to obtain moshi response using API #157

Comments

treya-lin commented Nov 21, 2024 • edited Loading

Due diligence

Topic

Question

treya-lin commented Nov 21, 2024 • edited Loading

treya-lin commented Nov 21, 2024 •

edited

Loading

treya-lin commented Nov 21, 2024 •

edited

Loading