Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutox classifier #44

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

David-OC17
Copy link
Contributor

Elements to include in PR

  • Migrate Mutox code from Seamless Communication
  • Add unit tests for the pipeline
  • (Optional) add softmax computation on top of classifier

Note: A follow-up PR will be needed in the Seamless Communication repository to account for the migration being made here.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 26, 2024
mutox/__init__.py Outdated Show resolved Hide resolved
mutox/speech_pipeline.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
mutox/cli/README.md Outdated Show resolved Hide resolved
mutox/cli/README.md Outdated Show resolved Hide resolved
sonar/models/mutox/classifier.py Show resolved Hide resolved
) -> None:
super().__init__(encoder)
self.model.to(device).eval()
self.mutox_classifier = mutox_classifier.to(device).eval()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the type annotation, mutox_classifier can be either a str or a MutoxClassifier. The latter would work fine here, but for the former, we'll have to load the classifier using the string name.

The same holds for the model part one line above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@David-OC17 you marked this conversation as resolved, but the type annotation for mutox_classifier is still not totally consistent with the method body: either remove str from the annotation or add a load_mutox_model to the body if the input is a string.

sonar/inference_pipelines/mutox_speech.py Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
"name": "stdout",
"output_type": "stream",
"text": [
"/tmp/tmpqasvhgx6/commonvoice_example_en_clocks.wav\t-42.40079116821289\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These scores, -42 and -47, are not straightforward to interpret.
We should either mention in the top of the notebook that the model has been trained with a "Binary Cross Entropy loss with logits" objective (according) to the paper), and thus, to turn its output into a probability, we need to feed it to a sigmoid layer.

Copy link
Contributor

@avidale avidale Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same thing: you resolved the comment, but the notebook still doesn't feature a code change or an explanation of what kind of output the current code generates.

README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
x = classifier(emb.to(device).to(dtype)) # tensor([[-19.7812]], device='cuda:0', dtype=torch.float16)

with torch.inference_mode():
emb = t2vec_model.predict(["She worked hard and made a significant contribution to the team."], source_lang='fra_Latn')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source_lang='eng_Latn'?

@@ -142,6 +142,49 @@ print(blaser_qe(src=src_embs, mt=mt_embs).item()) # 4.708
Detailed model cards with more examples: [facebook/blaser-2.0-ref](https://huggingface.co/facebook/blaser-2.0-ref),
[facebook/blaser-2.0-qe](https://huggingface.co/facebook/blaser-2.0-qe).

### Classifying the toxicity of sentences with MuTox

[MuTox](https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/toxicity/mutox), the first highly multilingual audio-based classifier (binary) and dataset with toxicity labels. The dataset consists of 20k audio utterances for English and Spanish, and 4k for the other 19 languages, and uses the multi-model and multilingual encoders from SONAR. The output of the MuTox classifier is a probability of the evaluated being _"toxic"_, according to the definition adopted in the corresponding dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph says that "The output of the MuTox classifier is a probability", but the outputs in your examples are tensors like -58.0625 which are clearly not probabilities.
We should either adjust the description (by saying that the output is a logit) or the output (by feeding it to a sigmoid function to turn it into a probability).

model_type: mutox_classifier
model_arch: mutox
checkpoint: "https://dl.fbaipublicfiles.com/seamless/models/mutox.pt"
input_size: 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_size: int

# add sigmoid as last layer to output probability
output_prob: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably make output_prob not a property of the config but an optional argument in the MutoxClassifier.forward method (which defaults to False, which is the current behavior).

) -> None:
super().__init__(encoder)
self.model.to(device).eval()
self.mutox_classifier = mutox_classifier.to(device).eval()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@David-OC17 you marked this conversation as resolved, but the type annotation for mutox_classifier is still not totally consistent with the method body: either remove str from the annotation or add a load_mutox_model to the body if the input is a string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants