Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGM removal unusable #424

Closed
starloreh opened this issue Dec 15, 2024 · 8 comments
Closed

BGM removal unusable #424

starloreh opened this issue Dec 15, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@starloreh
Copy link

Which OS are you using?

  • OS: Colab
    It stays at 0% and then consumes all RAM and finally it says Error:
 /usr/local/lib/python3.10/dist-packages/onnx2pytorch/convert/layer.py:30: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
  layer.weight.data = torch.from_numpy(numpy_helper.to_array(weight))
/usr/local/lib/python3.10/dist-packages/uvr/utils/fastio.py:46: UserWarning: PySoundFile failed. Trying audioread instead.
  signal, sampling_rate = librosa.load(path, sr=None, mono=False)
/usr/local/lib/python3.10/dist-packages/librosa/core/audio.py:184: FutureWarning: librosa.core.audio.__audioread_load
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
/usr/local/lib/python3.10/dist-packages/uvr/models_dir/mdx/mdx_interface.py:254: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
  mix_part = torch.tensor([mix_part_], dtype=torch.float32).to(device)
/usr/local/lib/python3.10/dist-packages/torch/functional.py:704: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:873.)
  return _VF.stft(  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/uvr/models_dir/mdx/mdx_interface.py:189: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  return stft.inverse(torch.tensor(spec_pred).to(device)).cpu().detach().numpy()

in My cpuonly 32G RAM 5700G, the same thing happened. What could be the cause. Just in case I converted the opus audio into mp3 and ogg vorbis in my CPU and the same thing still happened.

@starloreh starloreh added the bug Something isn't working label Dec 15, 2024
@jhj0517
Copy link
Owner

jhj0517 commented Dec 15, 2024

Hi @starloreh. The stacktrace doesn't say errors, it just says warnings, which you can ignore.

The reason the progress is stuck at 0% is because it doesn't actually track the progress of the job, I just set it to 0% until the separation job is fully completed.

This confusion should be avoided, I should have said so explicitly in the progress message.

Actually, the model was probably working for separation, but very slowly on the CPU.

The UVR models are recommended to run on GPU, not CPU. I believe the speed difference is almost more than x10.
If you test it with just 2 or 3 seconds of audio, you will see that the job is done within 2 minutes on the CPU.

UVR models are super slow on the CPU, it's recommended to use the GPU only.

If you're using Colab, you can try T4 GPU runtime (free runtime).

@starloreh
Copy link
Author

But I was using cuda with T4!
Though the Japanese video lasts 1 hour 40 min. The System RAM gets full, but the GPU RAM doesn't seem to budge.
T4
cuda
Great WebUI, btw! faster-whisper-large-v3-turbo-ct2 is pretty fast on my cpu, and good enough for Western European languages transcriptions.
I wanted to use faster-whisper-large-v2 with silero and BGM removal to improve japanese transcription as well translation to English.

@jhj0517
Copy link
Owner

jhj0517 commented Dec 15, 2024

I wanted to use faster-whisper-large-v2 with silero and BGM removal

There's a suspicious bug that seems to be related to this,
If you try to run really long audio ( more than 1hour, just like yours ) with VAD, it makes OOM error.

So I guess it's related to the VAD, not the BGM separation.

This is now fixed in faster-whisper, but the new version hasn't really been released yet.

I'm considering about installing faster-whisper directly from the repository.

@jhj0517
Copy link
Owner

jhj0517 commented Dec 18, 2024

This should be fixed in #428.

Please feel free to reopen!

@jhj0517 jhj0517 closed this as completed Dec 18, 2024
@starloreh
Copy link
Author

Actually, it still doesn't work. Just tried it on the new colab, and it runs out of RAM without touching the GPU. Even though cuda option was chosen.
notenoughram

@jhj0517
Copy link
Owner

jhj0517 commented Dec 21, 2024

@starloreh I just tried to reproduce it myself with 2 hours of video, but it was not reproducible. Everything worked fine on my end, even with 2 hours of video.

I'm not sure what caused the problem.
Just to be sure, are you sure you're using the latest version of the notebook here?

@starloreh
Copy link
Author

Yes. To compare I tried https://github.com/Eddycrack864/UVR5-UI with UVR-MDX-NET-Inst_HQ_5, and it almost finished till it crashed but it was another audio of 2h 20 m
Screenshot from 2024-12-21 11-01-02
Screenshot from 2024-12-21 10-57-32
I'll keep on trying and report back.

@starloreh
Copy link
Author

Now I've tried it with shorter audio, and it does work. Sadly the outcome of fasterwhisper largev2 japanese is still pretty disappointing even with BGM removed. 🥲
Anyway, great WebUI. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants