Skip to content

Conversation

rice-e
Copy link

@rice-e rice-e commented Oct 22, 2025

Draft PR: Core implementation complete. Seeking feedback on design and approach before finalizing. Thank you!

What does this PR do?

Adds safety checking infrastructure for text generation. Provides base classes, configuration, and processors that integrate with the generation pipeline. Users implement their own safety checkers for specific needs (harm, bias, PII, etc.).

Fixes #41740

Motivation

As stated in the issue I opened, while production LLMs have built-in safety moderation systems, they are often insufficient and can lead to unexpected harmful behavior, especially over long conversations. As open-source text generation models become more capable and widely used, mitigating harm and ensuring user safety is a feature that should be built in. As far as I am aware, there is currently no built-in infrastructure to support this. The most effective approaches involve moderation during inference, which is a non-trivial feature for Transformers users to implement on their own. In addition, allowing for the configuration of safety with custom settings and classifiers can allow users to avoid harm in more specialized contexts than commercial LLMs currently address.

Before submitting

Provides infrastructure for runtime safety checking via safety_config parameter.
Includes base classes, configuration, and processors.
Users implement concrete checkers for their specific needs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Safety Checking Infrastructure for Text Generation

1 participant