Safety Checking Infrastructure for Text Generation

### Feature request

Add moderation to text generation models, similar to the approaches used in production LLMs (e.g. ChatGPT, Gemini, Claude) so that open-source models can have built-in content filtering.

This could provide
- Abstract base class for implementing safety checkers/classifiers
- Integration hooks during generation (logits processing and stopping criteria)
- Pipeline support for safe generation
- Configuration for safety settings

### Motivation

While production LLMs have built-in safety moderation systems, they are often insufficient and can lead to unexpected harmful behavior, especially over long conversations. As open-source text generation models become more capable and widely used, mitigating harm and ensuring user safety is a feature that should be built in. As far as I am aware, there is currently no built-in infrastructure to support this. The most effective approaches involve moderation during inference, which is a non-trivial feature for Transformers users to implement on their own. In addition, allowing for the configuration of safety with custom settings and classifiers can allow users to avoid harm in more specialized contexts than commercial LLMs currently address.

### Your contribution

I would like to work on a PR for this feature if there is interest. It seems that Diffusers' `StableDiffusionSafetyChecker` provides precedent, and `LogitsProcessor` and `StoppingCriteria` would support the development of this feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Safety Checking Infrastructure for Text Generation #41740

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Safety Checking Infrastructure for Text Generation #41740

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions