Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

f5-tts is making weird dubbing , you can see in provided audio and srt its horrible, but its working fine in WebView, why cant it create audio properly pyvideotrans ? #636

Open
abhijeet12s opened this issue Nov 25, 2024 · 25 comments

Comments

@abhijeet12s
Copy link

abhijeet12s commented Nov 25, 2024

出错信息
f5-tts is making weird dubbing , you can see in provided audio and srt its horrible. but its working fine in WebView, why cant it create audio properly pyvideotrans ?

srt : 1
00:00:00,000 --> 00:00:02,366
In a world where everyone has awakened, a world of advanced talents,

2
00:00:02,500 --> 00:00:05,716
a man chooses to become a jobless wanderer. His classmates mercilessly mock him,

3
00:00:05,783 --> 00:00:08,366
saying this talent can't compare to the advanced skills gained after a job change.

4
00:00:08,433 --> 00:00:09,933
They tell Yun Chen to quickly find a place to work.

5
00:00:09,933 --> 00:00:12,783
Even Teacher Rose advises Yun Chen to choose a professional talent soon,

6
00:00:12,833 --> 00:00:14,816
because the benefits after changing jobs are much greater.

audio link created by f5-tts
https://drive.google.com/file/d/1ZRgKFunyf-LQiLfpvqs5fhCC3kNj2v-x/view?usp=sharing

复现步骤

  1. 使用的哪个功能
  2. faster模式/openai模式?
  3. 使用的模型名

操作系统

@abhijeet12s abhijeet12s changed the title f5-tts is making weird dubbing , you can see in provided audio and srt its horrible f5-tts is making weird dubbing , you can see in provided audio and srt its horrible, but its working fine in WebView, why cant it create audio properly pyvideotrans ? Nov 25, 2024
@jianchang512
Copy link
Owner

Figures in English are not normalized. Will change it later

@abhijeet12s
Copy link
Author

Figures in English are not normalized. Will change it later

still not working after the update of 3.20

@abhijeet12s
Copy link
Author

still not working in pyvideotrans version 3.21 as you can hear in provided audio link below. when will it fixed ?
srt : 1
00:00:00,000 --> 00:00:02,366
In a world where everyone has awakened, a world of advanced talents,

2
00:00:02,500 --> 00:00:05,716
a man chooses to become a jobless wanderer. His classmates mercilessly mock him,

3
00:00:05,783 --> 00:00:08,366
saying this talent can't compare to the advanced skills gained after a job change.

4
00:00:08,433 --> 00:00:09,933
They tell Yun Chen to quickly find a place to work.

5
00:00:09,933 --> 00:00:12,783
Even Teacher Rose advises Yun Chen to choose a professional talent soon,

6
00:00:12,833 --> 00:00:14,816
because the benefits after changing jobs are much greater.

audio created : https://drive.google.com/file/d/1uDIe3hgjYU2vt1XKkFIid3-C_jKKeXou/view?usp=sharing

@jianchang512
Copy link
Owner

Please use plain text or valid srt subtitles for dubbing, instead of adding other characters before the subtitles, which will dub out the timestamps as well.

Directly use the import function to import locally available legal srt files for dubbing.

@abhijeet12s
Copy link
Author

abhijeet12s commented Nov 26, 2024

not working i am doing everything correctly i have uploaded video you can see. please do something ? you can hear audio that it created at 2 : 34
link : https://drive.google.com/file/d/1nkhXfwMBTCQrj5E_Tnabs533AgsYL8Rr/view?usp=sharing

Recording.2024-11-26.172353.mp4

@jianchang512
Copy link
Owner

Explain in words what the problem is

Is it reading out the line numbers and the time lines as well?

@jianchang512
Copy link
Owner

jianchang512 commented Nov 26, 2024

del <b> and other html tag from srt file

@abhijeet12s
Copy link
Author

no i have both shown audio created by tag and without < b >tag ,plane srt but its generating wierd sounds instead of reading the srt.

@jianchang512
Copy link
Owner

Make sure the srt is legal and there are no html tags etc in it, then rename the subtitle to exp-01.srt and test it again!

@abhijeet12s
Copy link
Author

abhijeet12s commented Nov 26, 2024

i have tried it again with what you said you can listen the sound it created at 02:58 . is there any other format than srt it supports ?
you can listen it :

video2_2.mp4

srt :
0
00:00:00,000 --> 00:00:02,366
In a world where everyone has awakened, a world of advanced talents,

1
00:00:02,500 --> 00:00:05,716
a man chooses to become a jobless wanderer. His classmates mercilessly mock him,

2
00:00:05,783 --> 00:00:08,366
saying this talent can't compare to the advanced skills gained after a job change.

3
00:00:08,433 --> 00:00:09,933
They tell Yun Chen to quickly find a place to work.

4
00:00:09,933 --> 00:00:12,783
Even Teacher Rose advises Yun Chen to choose a professional talent soon,

5
00:00:12,833 --> 00:00:14,816
because the benefits after changing jobs are much greater.

@abhijeet12s
Copy link
Author

i have tried it again with what you said you can listen the sound it created at 02:58 . is there any other format than srt it supports ? you can listen it :

video2_2.mp4
srt : 0 00:00:00,000 --> 00:00:02,366 In a world where everyone has awakened, a world of advanced talents,

1 00:00:02,500 --> 00:00:05,716 a man chooses to become a jobless wanderer. His classmates mercilessly mock him,

2 00:00:05,783 --> 00:00:08,366 saying this talent can't compare to the advanced skills gained after a job change.

3 00:00:08,433 --> 00:00:09,933 They tell Yun Chen to quickly find a place to work.

4 00:00:09,933 --> 00:00:12,783 Even Teacher Rose advises Yun Chen to choose a professional talent soon,

5 00:00:12,833 --> 00:00:14,816 because the benefits after changing jobs are much greater.

its always sounds like English and French mixed sound

@jianchang512
Copy link
Owner

You could have just typed the text in like this.

image

If it's not a formatting problem, but just a pronunciation problem, that won't solve it.

Or you can open the api.py file under f5-tts-api and refer to the source code to modify it.

@abhijeet12s
Copy link
Author

abhijeet12s commented Nov 26, 2024 via email

@jianchang512
Copy link
Owner

image
You can directly enter text for dubbing

If the audio is fine after dubbing, it's just not pronounced correctly like you said, like a mix of English and French, then it's not an error.

@abhijeet12s
Copy link
Author

abhijeet12s commented Nov 26, 2024

i did that but it still doesn't work and makes audio that sounds weird , does it sounds all right on your pc ?
here you can listen the audio it created :

audio.mp4

@jianchang512
Copy link
Owner

1
00:00:00,000 --> 00:00:02,366
In a world where everyone has awakened, a world of advanced talents,

2
00:00:02,500 --> 00:00:05,716
a man chooses to become a jobless wanderer. His classmates mercilessly mock him,

3
00:00:05,783 --> 00:00:08,366
saying this talent can't compare to the advanced skills gained after a job change.

4
00:00:08,433 --> 00:00:09,933
They tell Yun Chen to quickly find a place to work.

5
00:00:09,933 --> 00:00:12,783
Even Teacher Rose advises Yun Chen to choose a professional talent soon,

6
00:00:12,833 --> 00:00:14,816
because the benefits after changing jobs are much greater.

image

I test no problem

@abhijeet12s
Copy link
Author

abhijeet12s commented Nov 28, 2024

I tested the cloned voice using f5-tts in pyvideotrans 3.25, but the issue of the voice being unrecognizable is still unresolved. Interestingly, the same voice works perfectly in WEBUI, but not in pyvideotrans.

To rule out system-specific issues, I also tested it on my friend's PC. Unfortunately, it didn’t work in pyvideotrans there either, though it still worked fine in WEBUI.

@jianchang512
Copy link
Owner

The webui interface is recognized using openai-whisper's large-v3-turbo model, and the audio is cut using vad before recognition.

The api is recognized in pyvideotrans using the specified model, and the audio is cut differently.

It's normal that there are differences between the two, the models are different, the cutting parameters are different, how can they be the same.

@abhijeet12s
Copy link
Author

when this issue will be solved ? because i cant clone voice in pyvideotrans.

@jianchang512
Copy link
Owner

Don't understand what you mean, if you mean: works well in webui and poorly using api, then it's normal.

If you mean: it works fine in the webui, and the sound cloned using the api doesn't correspond at all to the actual text, then I didn't test it!

@abhijeet12s
Copy link
Author

abhijeet12s commented Nov 29, 2024 via email

@abhijeet12s
Copy link
Author

abhijeet12s commented Nov 29, 2024 via email

@abhijeet12s
Copy link
Author

abhijeet12s commented Nov 29, 2024

i gave it only 22 seconds srt to create sound but it created voice made up of repeated nonsense up to 2 minute 26 seconds
::::::::::::::::::::::::::::::: : the srt i gave to clone voice : ::::::::::::::::::::::::::::::
1
00:00:00,000 --> 00:00:05,660
在入学典礼上,一群充满期待的学生挤满了体育场。

2
00:00:06,020 --> 00:00:08,440
他们的注意力全都集中在舞台上。

3
00:00:08,600 --> 00:00:14,240
一位宿舍老师在台上宣布,让我们热烈欢迎军队武术教官——

4
00:00:14,240 --> 00:00:18,680
张凯教官上台。

5
00:00:18,880 --> 00:00:22,060
他既是你们的校长,也是你们的老师。

:::::::::::::::::::::::::::::::::::::::::: : transcription of cloned-voice it created : :::::::::::::::::::::::::::::::::
1
00:00:00,000 --> 00:00:10,940
一位宿舍是内安中,

2
00:00:11,799 --> 00:00:12,740
Hello, my friend,

3
00:00:12,740 --> 00:00:13,980
我是盲学生店里上进。

4
00:00:16,400 --> 00:00:17,860
并发财能出,

5
00:00:18,440 --> 00:00:21,440
仅满的学生充满学育群。

6
00:00:22,020 --> 00:00:23,640
又存在my friend,

7
00:00:24,520 --> 00:00:27,160
一台在入学的入学店里的

8
00:00:27,160 --> 00:00:27,840
老朋友。

9
00:00:30,000 --> 00:00:30,960
他需要联系

10
00:00:30,960 --> 00:00:33,360
我们这一封说愿素脚本

11
00:00:33,360 --> 00:00:35,320
但牛顿需要确定的词

12
00:00:36,140 --> 00:00:36,700

13
00:00:37,699 --> 00:00:39,060
How do you

14
00:00:40,239 --> 00:00:42,340
Is Lynn all my friend?

15
00:00:43,640 --> 00:00:44,720
Cassandra 应

16
00:00:45,400 --> 00:00:46,900
一全都集中在

17
00:00:47,980 --> 00:00:50,540
他们居然一全在舞台上

18
00:00:51,300 --> 00:00:52,740
天然的人生

19
00:00:52,740 --> 00:00:54,420
You know my dear Freddie Frank

20
00:00:54,420 --> 00:00:54,640
E

21
00:00:56,339 --> 00:00:57,680
The English Beeman

22
00:00:57,680 --> 00:00:59,880
一郎出现了

23
00:01:00,540 --> 00:01:01,920
一位宿舍是

24
00:01:02,900 --> 00:01:03,820
内安中

25
00:01:04,660 --> 00:01:06,860
Hello 吗是盲学生店里上进

26
00:01:09,060 --> 00:01:11,200
被引发财云抽而上

27
00:01:11,200 --> 00:01:12,720
挤满了学生

28
00:01:12,720 --> 00:01:14,280
充满了山雨群

29
00:01:14,979 --> 00:01:16,520
又存在my friend

@jianchang512
Copy link
Owner

jianchang512 commented Nov 29, 2024

en.mp4
1
00:00:01,950 --> 00:00:04,430
Several molecules have been found in the Five Elder Star Systems,

2
00:00:04,720 --> 00:00:06,780
We are still a long way from the third kind of contact.

3
00:00:07,260 --> 00:00:09,880
We have really started the photography mission on Weibo for a year,

4
00:00:10,140 --> 00:00:12,920
Recently,many photos that were difficult to take in the past have been uploaded.

5
00:00:13,440 --> 00:00:17,500
In early June,astronomers published this photo in Nature Periodicals,


I tested it without any problem。

Please make sure that f5-tts-api has downloaded the patch package and upgraded pyvideotrans to 3.26,and please make sure the reference audio and reference text are correct。

It is normal for the subtitle duration to be inconsistent with the dubbing duration。

image

The 5s.wav in the above picture is the reference audio, and the text after # is the corresponding text of the reference audio.

5s.wav is stored in the f5-tts folder in the same directory as sp.exe

ff5-tts-api patch update

https://github.com/jianchang512/f5-tts-api/releases/tag/v0.1

https://github.com/jianchang512/f5-tts-api/releases/download/v0.1/2024-1127-buding.7z

@abhijeet12s
Copy link
Author

abhijeet12s commented Nov 29, 2024

It's solved! I realized I was making one fatal mistake, which is why the audio was pronouncing words out of recognition when cloning. The mistake was that after the #, I was putting whatever I wanted. I thought it was just for testing whether the API worked or not.

However, I realized from the recent solution you provided that the text after the # should correspond to the reference audio.

I'm sorry for causing extra work due to my mistake.😔😔😭
Screenshot 2024-11-29 221615

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants