Prompt Experimentation and UI updates (#36)

* updates to app * added prompt experiments Co-authored-by: Aakanksha Duggal <aduggal@redhat.com> --------- Co-authored-by: Aakanksha Duggal <aduggal@redhat.com>
redhat-et · Feb 28, 2024 · 2e70c97 · 2e70c97
1 parent 33364b2
commit 2e70c97
Show file tree

Hide file tree

Showing 10 changed files with 45,316 additions and 107 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Generating API docs using Generative AI methods
 
-The objective of this project is to conduct a Proof of Concept (POC) exploring the feasibility of utilizing generative AI (GenAI) for writing API documentation. The primary focus involves taking the codebase for the Red Hat Trusted Artifact Signer (RHTAS) project and generating API documentation from it using generative AI. This documentation will then be compared against manually written human documentation to evaluate its overall quality.
+The objective of this project is to conduct a Proof of Concept (POC) exploring the feasibility of utilizing generative AI (GenAI) for writing API documentation. The POC takes the codebase for the Red Hat Trusted Artifact Signer (RHTAS) project and generating API documentation from it using generative AI. This documentation is then evaluated in various ways for quality and correctness.
 
 ## Links
 
@@ -22,18 +22,23 @@ The objective of this project is to conduct a Proof of Concept (POC) exploring t
 
 ## Results
 
-* Out of the models we tried, OpenAI GPT 3.5 generates the most readable and well-structured API documentation, it also parses code structures well and generates documentation for those.
+### Models
+* Out of the models we tried, OpenAI GPT 3.5 generates the most readable and well-structured API documentation, it also parses code structures well and generates documentation for those. See comparison data against IBM granite in this [notebook](notebooks/evaluation/prompt_experiments.ipynb).
 
 * Among the rest of the models, llama-2-70b seems to also do well, it also roughly follows the expected API documentation structures and parses most simple code structures.
 
 * Contrary to what we expected, code based models like IBM granite-20b-code-instruct and codellama-34b-instruct, are sometimes good at following the expected structure, but other times do not generate very readable documentation.
 
+### Evaluation
+Automated evaluation of LLM generated text is a challenging task. We looked at methods like similairy scores (Cosine similarity, ROUGE scores) to compare the generated text against the actual human created documentation. We also looked at readability scores and grammatical checks to ensure that the text is readable and grammatically correct. Although none of the above scores give us a good undertsanding of the quality and usability of the generated output as well as the correctness and accuracy of these generated texts.
+
+Thus we began to explore ways to use LLMs to evaluate LLM outputs. Upon experimentaion and comparing the results against human evaluation, we narrowed down on Langchain criteria based evaluation which uses OpenAI GPT in the background to score the generated text on helpfulness, correctness and logic (this list can be expanded based on requirements. The [study](notebooks/evaluation/quantitative_evaluation.ipynb) revealed these scores to be more effective than the other metrics we tried and to be consistent with human scoring majority of the times.
 
 ## Overall Viability 
 
-* Generative AI models that we tried are performing sufficiently well and can be used for API documentation generation.
+* Generative AI models like GPT3.5 can read code and generate decently structured API documentation that follows a desired structure.
 
-* These models can be integrated into a tool which can be used by developers and maintainers, streamlining the documentation process.
+* These models could be integrated into a tool which can be used by developers and maintainers, streamlining the documentation process.
 
 * Existing tools like GitHub CoPilot provide a mature interface and state-of-the-art results. In scenarios where there is a need to generate API docs for proprietary code, we see the use case for leveraging self hosted models.
 

diff --git a/app/app.py b/app/app.py
@@ -138,18 +138,24 @@ def OPENAI_API_KEY() -> str:
     instruction = st.text_area(
         "Instruction",
         """
-You are an AI system specialized at generating API documentation for the provided Python code. You will be provided functions, classes, or Python scripts. Your documentation should include:
-
-1. Introduction: Briefly describe the purpose of the API and its intended use.   
-2. Functions: Document each API function, including:
-    - Description: Clearly explain what the endpoint or function does.
-    - Parameters: List and describe each parameter, including data types and any constraints.
-    - Return Values: Specify the data type and possible values returned.
-
-3. Error Handling: Describe possible error responses and their meanings.
-
-Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.
-
+Generate API documentation for Python code provided in the prompt. Ensure clarity, accuracy, and user-centricity.
+If no code is provided, do not speculate or generate generic examples. Instead, leave this section blank or state "No code provided".
+
+If Python code is provided:
+
+1. Introduction: 
+2. Class Documentation:
+  - Document each class present in the code, including:
+    - Class Name and Description
+    - Class Attributes and Data types
+    - Documentation for each method within the class, following the instructions below.
+3. Function Documentation:
+  - For each function in the code:
+    - Function Description
+    - Parameters, including names and data types.
+    - Return values, including data types.
+4. Error Handling:
+Describe possible error responses and how they are handled in the code.
 """,
     )
 
@@ -244,60 +250,26 @@ def main(prompt_success: bool, prompt_diff: int, actual_doc: str):
 
     with col3:
         st.subheader("Evaluation Metrics")
-        st.markdown(
-            "**GenAI evaluation on Overall Quality:**",
-            help="Use OpenAI GPT 3 to evaluate the result of the generated API doc",
-        )
-
-        score = eval_using_model(result, openai_key=OPENAI_API_KEY())
-        st.write(score)
 
         st.markdown(
-            "**LangChain evaluation on grammar, descriptiveness and helpfulness:**",
-            help="Use Langchain to evaluate on cutsom criteria (this list can be updated based on what we are looking to see from the generated docs"
+            "**LLM based evaluation on logic, correctness and helpfulness:**",
+            help="Use Langchain Criteria based Eval to evaluate on cutsom criteria (this list can be updated based on what we are looking to see from the generated docs). Note this is language mo0del based evaluation and not always a true indication of the quality of the output that is generatged."
         )
 
         lc_score = eval_using_langchain(prompt, result)
         st.markdown(
-            f"Grammatical: {lc_score[0]['score']}",
-            help="Checks if the output grammatically correct. Binary integer 0 to 1, where 1 would mean that the output is gramatically accurate and 0 means it is not",
+            f"Logical: {lc_score[0]['score']}",
+            help="Checks if the output is logical. Binary integer 0 to 1, where 1 would mean that the output is logical and 0 means it is not",
         )
 
         st.markdown(
-            f"Descriptiveness: {lc_score[1]['score']}",
-            help="Checks if the output descriptive. Binary integer 0 to 1, where 1 would mean that the output is descriptive and 0 means it is not",
-        )
-
-        st.markdown(
-            f"Helpfulness: {lc_score[2]['score']}",
-            help="Checks if the output helpful for the end user. Binary integer 0 to 1, where 1 would mean that the output is helpful and 0 means it is not"
+            f"Helpfulness: {lc_score[1]['score']}",
+            help="Checks if the output helpful for the end user. Binary integer 0 to 1, where 1 would mean that the output is helpful and 0 means it is not",
         )
 
         st.markdown(
-            "**Consistency:**",
-            help="Evaluate how similar or divergent the generated document is to the actual documentation",
-        )
-
-        # calc cosine similarity
-        vectorizer = TfidfVectorizer()
-        tfidf_matrix = vectorizer.fit_transform([actual_doc, result])
-        cosine_sim = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])
-        st.markdown(
-            f"Cosine Similarity Score: {cosine_sim[0][0]:.2f}",
-            help="0 cosine similarity means no similarity between generated and actual API documentation, 1 means they are same",
-        )
-        st.markdown("###")  # add a line break
-
-        st.markdown(
-            "**Readability Scores:**",
-            help="Evaluate how readable the generated text is",
-        )
-
-        # Flesch Reading Ease
-        flesch_reading_ease = textstat.flesch_reading_ease(result)
-        st.markdown(
-            f"Flesch Reading Ease: {flesch_reading_ease:.2f}",
-            help="Flesch Reading Ease measures how easy a text is to read. Higher scores indicate easier readability. Ranges 0-100 and a negative score indicates a more challenging text.",
+            f"Correctness: {lc_score[2]['score']}",
+            help="Checks if the output correct. Binary integer 0 to 1, where 1 would mean that the output is correct and accurate and 0 means it is not"
         )
 
 if st.button("Generate API Documentation"):

diff --git a/app/utils.py b/app/utils.py
@@ -48,91 +48,62 @@ def generate_prompt(
     classes_doc_text_joined = "\n".join(classes_doc_text)
     # print(functions_text_joined)
 
-    prompt = f"""{instruction}\n"""
+    prompt = f"""{instruction}"""
 
     if functions and functions_text_joined:
         prompt += f"""
-
 Function:
-
 {functions_text_joined}
-
 """
 
     if functions_code and functions_code_text_joined:
         prompt += f"""
 Function Code:
-
 {functions_code_text_joined}
-
 Function Documentation:
-
-
 """
 
     if functions_doc and functions_doc_text_joined:
         prompt += f"""
-
-
 Function Docstrings:
-
 {functions_doc_text_joined}
-
 Documentation:
-
 """
 
     if classes and classes_text_joined:
         prompt += f"""
-        
 Class:
-
 {classes_text_joined}
-
 """
     if classes_code and classes_code_text_joined:
         prompt += f"""
-        
 Class code:
-
 {classes_code_text_joined}
-
 Class Documentation:
-
 """
     if classes_doc and classes_doc_text_joined:
         prompt += f"""
-
 Classes Docstrings:
-
 {classes_doc_text_joined}
-
 Documentation
 """
 
     if documentation and documentation_text_joined:
         prompt += f"""
 Here is some code documentation for reference:
-
 {documentation_text_joined}
-
 """
 
     if imports and imports_text_joined:
         prompt += f"""
 Here are the import statements for reference:
-
 {imports_text_joined}
-
 """
 
     if other and other_text:
         prompt += f"""
 Here are other lines of code for reference:
-
 {other_text_joined}
-
-
 """
 
     return prompt
@@ -275,27 +246,21 @@ def eval_using_langchain(prediction: str, query: str):
     evaluation = []
     llm = ChatOpenAI(model="gpt-4", temperature=0)
 
-    # If you wanted to specify multiple criteria. Generally not recommended
-    custom_criterion_1 = {
-        "grammatical": "Is the output grammatically correct?",
-    }
-
-    eval_chain = load_evaluator(EvaluatorType.CRITERIA, llm=llm, criteria=custom_criterion_1)
-
+    # 1
+    custom_criteria_1 = {
+    "logical": "Is the output logical and complete? Does it capture all required fields"
+                    }
+    eval_chain = load_evaluator(EvaluatorType.CRITERIA, llm=llm, criteria=custom_criteria_1)
     eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)
     evaluation.append(eval_result)
-
-    custom_criterion_2 = {
-        "descriptive": "Does the output describe a piece of code and its intended functionality?"
-    }
-
-    eval_chain = load_evaluator(EvaluatorType.CRITERIA, llm=llm, criteria=custom_criterion_2)
-
-    eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)
+
+    # 2
+    evaluator = load_evaluator("labeled_criteria", llm=llm, criteria="correctness")
+    eval_result = evaluator.evaluate_strings(prediction=generated_patch, input=prompt, reference=actual_doc)
     evaluation.append(eval_result)
-
+
+    # 3
     evaluator = load_evaluator("criteria", llm=llm, criteria="helpfulness")
-
     eval_result = evaluator.evaluate_strings(prediction=prediction,input=query)
     evaluation.append(eval_result)