Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What to do about misattribution? #310

Open
ErfolgreichCharismatisch opened this issue Feb 28, 2025 · 1 comment
Open

What to do about misattribution? #310

ErfolgreichCharismatisch opened this issue Feb 28, 2025 · 1 comment

Comments

@ErfolgreichCharismatisch

Is there a threshold that can be set to improve correct attribution?

@MahmoudAshraf97
Copy link
Owner

These are the settings you can play with to change the diarization output:

clustering:
parameters:
oracle_num_speakers: False # If True, use num of speakers value provided in manifest file.
max_num_speakers: 8 # Max number of speakers for each recording. If an oracle number of speakers is passed, this value is ignored.
enhanced_count_thres: 80 # If the number of segments is lower than this number, enhanced speaker counting is activated.
max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold.
sparse_search_volume: 30 # The higher the number, the more values will be examined with more time.
maj_vote_spk_count: False # If True, take a majority vote on multiple p-values to estimate the number of speakers.
chunk_cluster_count: 50 # Number of forced clusters (overclustering) per unit chunk in long-form audio clustering.
embeddings_per_chunk: 10000 # Number of embeddings in each chunk for long-form audio clustering. Adjust based on GPU memory capacity. (default: 10000, approximately 40 mins of audio)
msdd_model:
model_path: diar_msdd_telephonic # .nemo local model path or pretrained model name for multiscale diarization decoder (MSDD)
parameters:
use_speaker_model_from_ckpt: True # If True, use speaker embedding model in checkpoint. If False, the provided speaker embedding model in config will be used.
infer_batch_size: 25 # Batch size for MSDD inference.
sigmoid_threshold: [0.7] # Sigmoid threshold for generating binarized speaker labels. The smaller the more generous on detecting overlaps.
seq_eval_mode: False # If True, use oracle number of speaker and evaluate F1 score for the given speaker sequences. Default is False.
split_infer: True # If True, break the input audio clip to short sequences and calculate cluster average embeddings for inference.
diar_window_length: 50 # The length of split short sequence when split_infer is True.
overlap_infer_spk_limit: 5 # If the estimated number of speakers are larger than this number, overlap speech is not estimated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants