-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Description
Feature request
Add moderation to text generation models, similar to the approaches used in production LLMs (e.g. ChatGPT, Gemini, Claude) so that open-source models can have built-in content filtering.
This could provide
- Abstract base class for implementing safety checkers/classifiers
- Integration hooks during generation (logits processing and stopping criteria)
- Pipeline support for safe generation
- Configuration for safety settings
Motivation
While production LLMs have built-in safety moderation systems, they are often insufficient and can lead to unexpected harmful behavior, especially over long conversations. As open-source text generation models become more capable and widely used, mitigating harm and ensuring user safety is a feature that should be built in. As far as I am aware, there is currently no built-in infrastructure to support this. The most effective approaches involve moderation during inference, which is a non-trivial feature for Transformers users to implement on their own. In addition, allowing for the configuration of safety with custom settings and classifiers can allow users to avoid harm in more specialized contexts than commercial LLMs currently address.
Your contribution
I would like to work on a PR for this feature if there is interest. It seems that Diffusers' StableDiffusionSafetyChecker provides precedent, and LogitsProcessor and StoppingCriteria would support the development of this feature.