Add safety checking infrastructure for text generation #41776

rice-e · 2025-10-22T08:03:16Z

Draft PR: Core implementation complete. Seeking feedback on design and approach before finalizing. Thank you!

What does this PR do?

Adds safety checking infrastructure for text generation. Provides base classes, configuration, and processors that integrate with the generation pipeline. Users implement their own safety checkers for specific needs (harm, bias, PII, etc.).

Fixes #41740

Motivation

As stated in the issue I opened, while production LLMs have built-in safety moderation systems, they are often insufficient and can lead to unexpected harmful behavior, especially over long conversations. As open-source text generation models become more capable and widely used, mitigating harm and ensuring user safety is a feature that should be built in. As far as I am aware, there is currently no built-in infrastructure to support this. The most effective approaches involve moderation during inference, which is a non-trivial feature for Transformers users to implement on their own. In addition, allowing for the configuration of safety with custom settings and classifiers can allow users to avoid harm in more specialized contexts than commercial LLMs currently address.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. Currently being discussed in Safety Checking Infrastructure for Text Generation #41740
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Provides infrastructure for runtime safety checking via safety_config parameter. Includes base classes, configuration, and processors. Users implement concrete checkers for their specific needs.

Add safety checking infrastructure for text generation

f964dca

Provides infrastructure for runtime safety checking via safety_config parameter. Includes base classes, configuration, and processors. Users implement concrete checkers for their specific needs.

rice-e mentioned this pull request Oct 22, 2025

Safety Checking Infrastructure for Text Generation #41740

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add safety checking infrastructure for text generation #41776

Add safety checking infrastructure for text generation #41776

rice-e commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add safety checking infrastructure for text generation #41776

Are you sure you want to change the base?

Add safety checking infrastructure for text generation #41776

Conversation

rice-e commented Oct 22, 2025

What does this PR do?

Motivation

Before submitting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant