|
| 1 | +# These are the default prompts released by Meta, except for policy O7, which was added to address direct insults. |
| 2 | +prompts: |
| 3 | + |
| 4 | + - task: content_safety_check_input $model=content_safety |
| 5 | + content: | |
| 6 | + Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below. |
| 7 | +
|
| 8 | + <BEGIN UNSAFE CONTENT CATEGORIES> |
| 9 | + S1: Violence. |
| 10 | + S2: Sexual. |
| 11 | + S3: Criminal Planning/Confessions. |
| 12 | + S4: Guns and Illegal Weapons. |
| 13 | + S5: Controlled/Regulated Substances. |
| 14 | + S6: Suicide and Self Harm. |
| 15 | + S7: Sexual (minor). |
| 16 | + S8: Hate/Identity Hate. |
| 17 | + S9: PII/Privacy. |
| 18 | + S10: Harassment. |
| 19 | + S11: Threat. |
| 20 | + S12: Profanity. |
| 21 | + S13: Needs Caution. |
| 22 | + S14: Other. |
| 23 | + S15: Manipulation. |
| 24 | + S16: Fraud/Deception. |
| 25 | + S17: Malware. |
| 26 | + S18: High Risk Gov Decision Making. |
| 27 | + S19: Political/Misinformation/Conspiracy. |
| 28 | + S20: Copyright/Trademark/Plagiarism. |
| 29 | + S21: Unauthorized Advice. |
| 30 | + S22: Illegal Activity. |
| 31 | + S23: Immoral/Unethical. |
| 32 | + <END UNSAFE CONTENT CATEGORIES> |
| 33 | +
|
| 34 | + <BEGIN CONVERSATION> |
| 35 | +
|
| 36 | + user: {{ user_input }} |
| 37 | +
|
| 38 | + <END CONVERSATION> |
| 39 | +
|
| 40 | + Provide your safety assessment for the above conversation in the below output JSON format: |
| 41 | + {"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.} |
| 42 | +
|
| 43 | + Do not include anything other than the output JSON in your response. |
| 44 | + Output JSON: |
| 45 | + output_parser: nemoguard_parse_prompt_safety |
| 46 | + max_tokens: 50 |
| 47 | + |
| 48 | + - task: content_safety_check_output $model=content_safety |
| 49 | + content: | |
| 50 | + Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below. |
| 51 | +
|
| 52 | + <BEGIN UNSAFE CONTENT CATEGORIES> |
| 53 | + S1: Violence. |
| 54 | + S2: Sexual. |
| 55 | + S3: Criminal Planning/Confessions. |
| 56 | + S4: Guns and Illegal Weapons. |
| 57 | + S5: Controlled/Regulated Substances. |
| 58 | + S6: Suicide and Self Harm. |
| 59 | + S7: Sexual (minor). |
| 60 | + S8: Hate/Identity Hate. |
| 61 | + S9: PII/Privacy. |
| 62 | + S10: Harassment. |
| 63 | + S11: Threat. |
| 64 | + S12: Profanity. |
| 65 | + S13: Needs Caution. |
| 66 | + S14: Other. |
| 67 | + S15: Manipulation. |
| 68 | + S16: Fraud/Deception. |
| 69 | + S17: Malware. |
| 70 | + S18: High Risk Gov Decision Making. |
| 71 | + S19: Political/Misinformation/Conspiracy. |
| 72 | + S20: Copyright/Trademark/Plagiarism. |
| 73 | + S21: Unauthorized Advice. |
| 74 | + S22: Illegal Activity. |
| 75 | + S23: Immoral/Unethical. |
| 76 | + <END UNSAFE CONTENT CATEGORIES> |
| 77 | +
|
| 78 | + <BEGIN CONVERSATION> |
| 79 | +
|
| 80 | + user: {{ user_input }} |
| 81 | +
|
| 82 | + response: agent: {{ bot_response }} |
| 83 | +
|
| 84 | + <END CONVERSATION> |
| 85 | +
|
| 86 | + Provide your safety assessment for the above conversation in the below output JSON format: |
| 87 | + {"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.} |
| 88 | +
|
| 89 | + Do not include anything other than the output JSON in your response. |
| 90 | + Output JSON: |
| 91 | + output_parser: nemoguard_parse_response_safety |
| 92 | + max_tokens: 50 |
| 93 | + |
| 94 | + - task: content_safety_check_input $model=llama_guard |
| 95 | + content: | |
| 96 | + <s>[INST] Task: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories. |
| 97 | + <BEGIN UNSAFE CONTENT CATEGORIES> |
| 98 | + O1: Violence and Hate. |
| 99 | + Should not |
| 100 | + - Help people plan or engage in violence. |
| 101 | + - Encourage people to engage in violence. |
| 102 | + - Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics like their race, color, religion, national origin, sexual orientation, gender, gender identity, or disability. |
| 103 | + - Encourage discrimination against people based on those characteristics and should not use slurs or create insults based on those characteristics. |
| 104 | + Can |
| 105 | + - Provide information on violence and discrimination. |
| 106 | + - Discuss the topics of hate and violence, and can discuss historical events involving violence. |
| 107 | + O2: Sexual Content. |
| 108 | + Should not |
| 109 | + - Engage in sexually explicit (i.e., erotic) conversations and should not encourage people to engage in sexual activities. |
| 110 | + Can |
| 111 | + - Discuss topics related to sexuality and sexual education. |
| 112 | + - Provide general information regarding sexual activities. |
| 113 | + - Provide guidance on topics related to sexual health. |
| 114 | + O3: Criminal Planning. |
| 115 | + Should not |
| 116 | + - Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them. |
| 117 | + - Encourage people to engage in criminal activities or condone criminal behavior. |
| 118 | + Can |
| 119 | + - Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes. |
| 120 | + - Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems. |
| 121 | + O4: Guns and Illegal Weapons. |
| 122 | + Should not |
| 123 | + - Assist in the planning or execution of firearm or illegal weapons crimes, especially when the person indicates that they may act on the information provided to them. |
| 124 | + - Encourage people to engage in firearm or illegal weapons crimes or condone such behavior. |
| 125 | + Can |
| 126 | + - Discuss firearms and the arguments for and against firearm ownership. |
| 127 | + O5: Regulated or Controlled Substances. |
| 128 | + Should not |
| 129 | + - Encourage or assist people in the trafficking, consumption, or use of regulated or controlled substances, including but not limited to illegal drugs, tobacco, alcohol, and cannabis. |
| 130 | + - Assist or encourage people to create such substances when it is illegal to do so. |
| 131 | + Can |
| 132 | + - Discuss the use of regulated or controlled substances, and can provide general information on those substances (ex: what their chemical composition is and what their effects are). |
| 133 | + - Provide information on how those substances are manufactured, provided that the information is not so detailed as to provide step-by-step instructions for their creation. |
| 134 | + O6: Self-Harm. |
| 135 | + Should not |
| 136 | + - Encourage people to harm themselves. |
| 137 | + - Romanticize or condone self-harm or suicide. |
| 138 | + - Provide information regarding the methods of suicide or self-harm. |
| 139 | + - Provide any information or advice, other than information regarding appropriate health resources, to people who express or imply that they wish to harm themselves. |
| 140 | + Can |
| 141 | + - Provide general information regarding self-harm and suicide (ex: statistics, the mental health challenges faced by those who take these actions). |
| 142 | + Should |
| 143 | + - Provide information regarding appropriate health resources to people who imply or express the desire to harm themselves. |
| 144 | + O7: Offensive Language and Insults. |
| 145 | + Should not |
| 146 | + - Address direct insults |
| 147 | + - Use or encourage the use of swear words, bad language, or vulgar expressions. |
| 148 | + - Insult, offend, or demean individuals or groups based on any personal characteristics or choices. |
| 149 | + - Encourage or support the use of offensive or derogatory language. |
| 150 | + Can |
| 151 | + - Discuss the impact and societal perceptions of offensive language and insults. |
| 152 | + - Provide information on communication strategies that avoid the use of offensive language and promote respectful dialogue. |
| 153 | + - Offer guidance on how to address or respond to offensive language or insults in a constructive manner. |
| 154 | + <END UNSAFE CONTENT CATEGORIES> |
| 155 | +
|
| 156 | + <BEGIN CONVERSATION> |
| 157 | + User: {{ user_input }} |
| 158 | + <END CONVERSATION> |
| 159 | +
|
| 160 | + Provide your safety assessment for User in the above conversation: |
| 161 | + - First line must read 'safe' or 'unsafe'. |
| 162 | + - If unsafe, a second line must include a comma-separated list of violated categories. [/INST] |
| 163 | + output_parser: is_content_safe |
| 164 | + |
| 165 | + - task: content_safety_check_output $model=llama_guard_2 |
| 166 | + content: | |
| 167 | + <|begin_of_text|>[INST] Task: Check if there is unsafe content in Agent messages in conversations according our safety policy with the below categories. |
| 168 | +
|
| 169 | + <BEGIN UNSAFE CONTENT CATEGORIES> |
| 170 | + S1: Violent Crimes. |
| 171 | + S2: Non-Violent Crimes. |
| 172 | + S3: Sex Crimes. |
| 173 | + S4: Child Exploitation. |
| 174 | + S5: Specialized Advice. |
| 175 | + S6: Privacy. |
| 176 | + S7: Intellectual Property. |
| 177 | + S8: Indiscriminate Weapons. |
| 178 | + S9: Hate. |
| 179 | + S10: Self-Harm. |
| 180 | + S11: Sexual Content. |
| 181 | + <END UNSAFE CONTENT CATEGORIES> |
| 182 | +
|
| 183 | + <BEGIN CONVERSATION> |
| 184 | +
|
| 185 | + User: {{ user_input }} |
| 186 | +
|
| 187 | + Agent: {{ bot_response }} |
| 188 | +
|
| 189 | + <END CONVERSATION> |
| 190 | +
|
| 191 | + Provide your safety assessment for ONLY THE LAST Agent in the above conversation: |
| 192 | + - First line must read 'safe' or 'unsafe'. |
| 193 | + - If unsafe, a second line must include a comma-separated list of violated categories. [/INST] |
| 194 | + output_parser: is_content_safe |
| 195 | + max_tokens: 3 |
| 196 | + |
| 197 | + - task: content_safety_check_input $model=shieldgemma |
| 198 | + content: | |
| 199 | + You are a policy expert trying to help determine whether a user prompt is in |
| 200 | + violation of the defined safety policies. |
| 201 | +
|
| 202 | + <start_of_turn> |
| 203 | + Human Question: {{ user_input }} |
| 204 | + <end_of_turn> |
| 205 | +
|
| 206 | + Our safety principle is defined in the below: |
| 207 | +
|
| 208 | + * "No Hate Speech": The prompt shall not contain or seek generation of content |
| 209 | + that expresses, incites, or promotes hate based on race, gender, ethnicity, |
| 210 | + religion, nationality, sexual orientation, disability status, or caste. |
| 211 | + * "No Dangerous Content": The prompt shall not contain or seek generation of content that harming oneself |
| 212 | + and/or others (e.g., accessing or building firearms and explosive devices, promotion of terrorism, instructions for suicide). |
| 213 | + * "No Sexually Explicit Information": The prompt shall not contain or seek generation of content |
| 214 | + that contains references to sexual acts or other lewd content (e.g., sexually graphic descriptions, |
| 215 | + content aimed at causing arousal). Medical or scientific terms related to human anatomy or sex education are permitted. |
| 216 | + * "No Harassment": The prompt shall not contain or seek generation of content that is malicious, |
| 217 | + intimidating, bullying, or abusive content targeting another individual (e.g., physical threats, |
| 218 | + denial of tragic events, disparaging victims of violence). |
| 219 | + === |
| 220 | +
|
| 221 | + Does the human prompt/question violate the above principle? Your answer must start with |
| 222 | + 'Yes' or 'No'. And then walk through step by step to be sure we answer |
| 223 | + correctly. |
| 224 | + output_parser: is_content_safe |
| 225 | + max_tokens: 3 |
| 226 | + |
| 227 | + - task: content_safety_check_output $model=shieldgemma |
| 228 | + content: | |
| 229 | + You are a policy expert trying to help determine whether a chatbot response to the user is in |
| 230 | + violation of the defined safety policies: |
| 231 | +
|
| 232 | + <start_of_turn> |
| 233 | + Human Question: {{ user_input }} |
| 234 | + Chatbot Response: {{ bot_response }} |
| 235 | + <end_of_turn> |
| 236 | +
|
| 237 | + Our safety principle is defined in the below: |
| 238 | +
|
| 239 | + * "No Dangerous Content": The chatbot shall not generate content that harming oneself and/or others |
| 240 | + (e.g., accessing or building firearms and explosive devices, promotion of terrorism, instructions for suicide). |
| 241 | + * "No Harassment": The chatbot shall not generate content that is malicious, intimidating, bullying, or |
| 242 | + abusive content targeting another individual (e.g., physical threats, denial of tragic events, disparaging victims of violence). |
| 243 | + * "No Hate Speech": The chatbot shall not generate content that targeting identity and/or protected |
| 244 | + attributes (e.g., racial slurs, promotion of discrimination, calls to violence against protected groups) or dehumanizing/belittling/vilifying on the |
| 245 | + bases of race, ethnicity, religion, disability, age, nationality, veteran status, sexual orientation, gender, gender identity, caste, or any other protected status. |
| 246 | + We do allow content that includes a reference to a protected individual or group in a positive manner. |
| 247 | + * "No Sexually Explicit Information": The chatbot shall not generate content that contains references to sexual acts or |
| 248 | + other lewd content (e.g., sexually graphic descriptions, content aimed at causing arousal). |
| 249 | + Medical or scientific terms related to human anatomy or sex education are permitted. |
| 250 | +
|
| 251 | + === |
| 252 | +
|
| 253 | + Does the chatbot response violate the above principle? Your answer must start with |
| 254 | + 'Yes' or 'No'. And then walk through step by step to be sure we answer |
| 255 | + correctly. |
| 256 | + output_parser: is_content_safe |
| 257 | + max_tokens: 3 |
0 commit comments