What to do about misattribution? #310

ErfolgreichCharismatisch · 2025-02-28T22:43:33Z

Is there a threshold that can be set to improve correct attribution?

MahmoudAshraf97 · 2025-03-01T09:58:31Z

These are the settings you can play with to change the diarization output:

whisper-diarization/nemo_msdd_configs/diar_infer_telephonic.yaml

Lines 47 to 67 in 6c9047d

    
           clustering: 
        
             parameters: 
        
               oracle_num_speakers: False # If True, use num of speakers value provided in manifest file. 
        
               max_num_speakers: 8 # Max number of speakers for each recording. If an oracle number of speakers is passed, this value is ignored. 
        
               enhanced_count_thres: 80 # If the number of segments is lower than this number, enhanced speaker counting is activated. 
        
               max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold.  
        
               sparse_search_volume: 30 # The higher the number, the more values will be examined with more time.  
        
               maj_vote_spk_count: False  # If True, take a majority vote on multiple p-values to estimate the number of speakers. 
        
               chunk_cluster_count: 50 # Number of forced clusters (overclustering) per unit chunk in long-form audio clustering. 
        
               embeddings_per_chunk: 10000 # Number of embeddings in each chunk for long-form audio clustering. Adjust based on GPU memory capacity. (default: 10000, approximately 40 mins of audio)  
        
           msdd_model: 
        
             model_path: diar_msdd_telephonic # .nemo local model path or pretrained model name for multiscale diarization decoder (MSDD) 
        
             parameters: 
        
               use_speaker_model_from_ckpt: True # If True, use speaker embedding model in checkpoint. If False, the provided speaker embedding model in config will be used. 
        
               infer_batch_size: 25 # Batch size for MSDD inference. 
        
               sigmoid_threshold: [0.7] # Sigmoid threshold for generating binarized speaker labels. The smaller the more generous on detecting overlaps. 
        
               seq_eval_mode: False # If True, use oracle number of speaker and evaluate F1 score for the given speaker sequences. Default is False. 
        
               split_infer: True # If True, break the input audio clip to short sequences and calculate cluster average embeddings for inference. 
        
               diar_window_length: 50 # The length of split short sequence when split_infer is True. 
        
               overlap_infer_spk_limit: 5 # If the estimated number of speakers are larger than this number, overlap speech is not estimated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What to do about misattribution? #310

What to do about misattribution? #310

ErfolgreichCharismatisch commented Feb 28, 2025

MahmoudAshraf97 commented Mar 1, 2025

What to do about misattribution? #310

What to do about misattribution? #310

Comments

ErfolgreichCharismatisch commented Feb 28, 2025

MahmoudAshraf97 commented Mar 1, 2025