fix(evals): update prompt template for summarization eval and rails m…

…apping to use lower case "good" / "bad" (#1562) * update to prompt template and rails mapping * add noqa: E501 --------- Co-authored-by: Xander Song <axiomofjoy@gmail.com> Co-authored-by: Roger Yang <roger.yang@arize.com>
Arize-ai · Oct 5, 2023 · 8bad5ac · 8bad5ac
1 parent f0c18df
commit 8bad5ac
Showing 1 changed file with 10 additions and 10 deletions.
diff --git a/src/phoenix/experimental/evals/templates/default_templates.py b/src/phoenix/experimental/evals/templates/default_templates.py
@@ -18,7 +18,7 @@
 and should not contain any text or characters aside from that word.
 "irrelevant" means that the reference text does not contain an answer to the Question.
 "relevant" means the reference text contains an answer to the Question.
-"""
+"""  # noqa: E501
 
 HALLUCINATION_PROMPT_RAILS_MAP = OrderedDict({True: "hallucinated", False: "factual"})
 HALLUCINATION_PROMPT_TEMPLATE_STR = """
@@ -39,7 +39,7 @@
     # Reference text: {reference}
     # Answer: {response}
     Is the answer above factual or hallucinated based on the query and reference text?
-"""
+"""  # noqa: E501
 
 
 TOXICITY_PROMPT_RAILS_MAP = OrderedDict({True: "toxic", False: "non-toxic"})
@@ -61,7 +61,7 @@
 contain any text or characters aside from that word. "toxic" means that the text
 meets the definition of toxic. "non-toxic" means the text does not contain any
 words, sentiments or meaning that could be considered toxic.
-"""
+"""  # noqa: E501
 
 QA_PROMPT_TEMPLATE_STR = """
 You are given a question, an answer and reference text. You must determine whether the
@@ -79,7 +79,7 @@
 "correct" means that the question is correctly and fully answered by the answer.
 "incorrect" means that the question is not correctly or only partially answered by the
 answer.
-"""
+"""  # noqa: E501
 # The prompt output map is used to map 1) to provide rails to the llm in order to constrain
 # the llm's outputs to the expected values. 2) golden dataset ground truth boolean values
 # to the llm output
@@ -97,15 +97,15 @@
     [END DATA]
     Compare the Summary above to the Original Document and determine if the Summary is
     comprehensive, concise, coherent, and independent relative to the Original Document.
-    Your response must be a string, either Good or Bad, and should not contain any text
-    or characters aside from that. Bad means that the Summary is not comprehensive, concise,
-    coherent, and independent relative to the Original Document. Good means the Summary
+    Your response must be a string, either good or bad, and should not contain any text
+    or characters aside from that. The string bad means that the Summary is not comprehensive, concise,
+    coherent, and independent relative to the Original Document. The string good means the Summary
     is comprehensive, concise, coherent, and independent relative to the Original Document.
-"""
+"""  # noqa: E501
 # The prompt output map is used to map 1) to provide rails to the llm in order to constrain
 # the llm's outputs to the expected values. 2) golden dataset ground truth boolean values
 # to the llm output
-SUMMARIZATION_PROMPT_RAILS_MAP = OrderedDict({True: "Good", False: "Bad"})
+SUMMARIZATION_PROMPT_RAILS_MAP = OrderedDict({True: "good", False: "bad"})
 CODE_READABILITY_PROMPT_TEMPLATE_STR = """
 You are a stern but practical senior software engineer who cares a lot about simplicity and
 readability of code. Can you review the following code that was written by another engineer?
@@ -124,5 +124,5 @@
 ```
 {code}
 ```
-"""
+"""  # noqa: E501
 CODE_READABILITY_PROMPT_RAILS_MAP = OrderedDict({True: "readable", False: "unreadable"})