Skip to content

Safety Checking Infrastructure for Text Generation #41740

@rice-e

Description

@rice-e

Feature request

Add moderation to text generation models, similar to the approaches used in production LLMs (e.g. ChatGPT, Gemini, Claude) so that open-source models can have built-in content filtering.

This could provide

  • Abstract base class for implementing safety checkers/classifiers
  • Integration hooks during generation (logits processing and stopping criteria)
  • Pipeline support for safe generation
  • Configuration for safety settings

Motivation

While production LLMs have built-in safety moderation systems, they are often insufficient and can lead to unexpected harmful behavior, especially over long conversations. As open-source text generation models become more capable and widely used, mitigating harm and ensuring user safety is a feature that should be built in. As far as I am aware, there is currently no built-in infrastructure to support this. The most effective approaches involve moderation during inference, which is a non-trivial feature for Transformers users to implement on their own. In addition, allowing for the configuration of safety with custom settings and classifiers can allow users to avoid harm in more specialized contexts than commercial LLMs currently address.

Your contribution

I would like to work on a PR for this feature if there is interest. It seems that Diffusers' StableDiffusionSafetyChecker provides precedent, and LogitsProcessor and StoppingCriteria would support the development of this feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions