Add WORLD pitch estimators and F0 range as hyperparameters (openvpi#149)

* Added WORLD pitch estimators i also removed hardcoded F0 ranges because what the heck is that 800 Hz max pitch in parselmouth that is way too low * Update README.md okay maybe don't add ur own flare in the readme if u actually want to create a pull req * Apply dtype change i saw it in the parselmouth thing might as well put it in to make sure * Update pw.py oops * fix dtype mismatch for some reason pyworld only likes float64? * add f0 range as a hyperparameter why isn't it a hyperparameter in the first place * move pad_frames to pw i think world is p accurate with the frames stuff but it's just to ensure * change padding algorithm it's just to be similar to the parselmouth one.. it makes sense to not center the F0 after all * remove duplicate line yeah * Remove DIO, change default range, add docs * Add notice for F0 range
UtaUtaUtau · Nov 20, 2023 · 931df27 · 931df27
1 parent fbff2e8
commit 931df27
Show file tree

Hide file tree

Showing 5 changed files with 48 additions and 3 deletions.
diff --git a/configs/base.yaml b/configs/base.yaml
@@ -76,6 +76,8 @@ train_set_name: 'train'
 valid_set_name: 'valid'
 pe: 'parselmouth'
 pe_ckpt: ''
+f0_min: 65
+f0_max: 800
 vocoder: ''
 vocoder_ckpt: ''
 num_valid_plots: 10

diff --git a/docs/BestPractices.md b/docs/BestPractices.md
@@ -206,6 +206,21 @@ pe: rmvpe
 pe_ckpt: checkpoints/rmvpe/model.pt
 ```
 
+### Harvest
+
+[Harvest](https://github.com/mmorise/World) (Harvest: A high-performance fundamental frequency estimator from speech signals) is the recommended pitch extractor from Masanori Morise's WORLD, a free software for high-quality speech analysis, manipulation and synthesis. It is a state-of-the-art algorithmic pitch estimator designed for speech, but has seen use in singing voice synthesis. It runs the slowest compared to the others, but provides very accurate F0 on clean and normal recordings compared to parselmouth.
+
+To use Harvest, simply include the following line in your configuration file:
+```yaml
+pe: harvest
+```
+
+**Note:** It is also recommended to change the F0 detection range for Harvest with accordance to your dataset, as they are hard boundaries for this algorithm and the defaults might not suffice for most use cases. To change the F0 detection range, you may include or edit this part in the configuration file:
+```yaml
+f0_min: 65 # Minimum F0 to detect
+f0_max: 800 # Maximum F0 to detect
+```
+
 ## Performance tuning
 
 This section is about accelerating training and utilizing hardware.

diff --git a/modules/pe/__init__.py b/modules/pe/__init__.py
@@ -1,6 +1,7 @@
 from utils import hparams
 
 from .pm import ParselmouthPE
+from .pw import HarvestPE, DioPE
 from .rmvpe import RMVPE
 
 
@@ -11,5 +12,7 @@ def initialize_pe():
         return ParselmouthPE()
     elif pe == 'rmvpe':
         return RMVPE(pe_ckpt)
+    elif pe == 'harvest':
+        return HarvestPE()
     else:
         raise ValueError(f" [x] Unknown f0 extractor: {pe}")
diff --git a/modules/pe/pw.py b/modules/pe/pw.py
@@ -0,0 +1,25 @@
+from basics.base_pe import BasePE
+import numpy as np
+import pyworld as pw
+from utils.pitch_utils import interp_f0
+
+class HarvestPE(BasePE):
+    def get_pitch(self, waveform, length, hparams, interp_uv=False, speed=1):
+        hop_size = int(np.round(hparams['hop_size'] * speed))
+
+        time_step = 1000 * hop_size / hparams['audio_sample_rate']
+        f0_floor = hparams['f0_min']
+        f0_ceil = hparams['f0_max']
+
+        f0, _ = pw.harvest(waveform.astype(np.float64), hparams['audio_sample_rate'], f0_floor=f0_floor, f0_ceil=f0_ceil, frame_period=time_step)
+        f0 = f0.astype(np.float32)
+
+        if f0.size < length:
+            f0 = np.pad(f0, (0, length - f0.size))
+        f0 = f0[:length]
+        uv = f0 == 0
+
+        if interp_uv:
+            f0, uv = interp_f0(f0, uv)
+        return f0, uv
+
diff --git a/utils/binarizer_utils.py b/utils/binarizer_utils.py
@@ -32,13 +32,13 @@ def get_pitch_parselmouth(wav_data, length, hparams, speed=1, interp_uv=False):
     """
     hop_size = int(np.round(hparams['hop_size'] * speed))
     time_step = hop_size / hparams['audio_sample_rate']
-    f0_min = 65
-    f0_max = 800
+    f0_min = hparams['f0_min']
+    f0_max = hparams['f0_max']
 
     l_pad = int(np.ceil(1.5 / f0_min * hparams['audio_sample_rate']))
     r_pad = hop_size * ((len(wav_data) - 1) // hop_size + 1) - len(wav_data) + l_pad + 1
     wav_data = np.pad(wav_data, (l_pad, r_pad))
-    
+
     # noinspection PyArgumentList
     s = parselmouth.Sound(wav_data, sampling_frequency=hparams['audio_sample_rate']).to_pitch_ac(
         time_step=time_step, voicing_threshold=0.6,