[data gen] streaming log batch run progress, modify debug info, modif…

…y doc, wording, update tool, remove text_meta (#1934) ![image](https://github.com/microsoft/promptflow/assets/49483542/bfc380a8-b3ce-45fb-8610-b6718fd378e6)
microsoft · Feb 4, 2024 · feb8063 · feb8063
1 parent e17e992
commit feb8063
Show file tree

Hide file tree

Showing 18 changed files with 213 additions and 266 deletions.
diff --git a/docs/cloud/azureai/generate-test-data-cloud.md b/docs/cloud/azureai/generate-test-data-cloud.md
@@ -4,27 +4,28 @@ This guide will help you learn how to generate test data on Azure AI, so that yo
 
 ## Prerequisites
 
-1. Go through local test data generation [guide](../../how-to-guides/generate-test-data.md) and prepare your test data generation flow.
-2. Go to the [gen_test_data](../../../examples/gen_test_data) folder and run command `pip install -r requirements_cloud.txt` to prepare local environment.
+1. Go through [local test data generation guide](../../how-to-guides/generate-test-data.md) and prepare your [test data generation flow](../../../examples/gen_test_data/gen_test_data/generate_test_data_flow/).
+2. Go to the [example_gen_test_data](../../../examples/gen_test_data) folder and run command `pip install -r requirements_cloud.txt` to prepare local environment.
 3. Prepare cloud environment.
     - Navigate to file [conda.yml](../../../examples/gen_test_data/conda.yml).
-    - For specific document file types, you may need to add extra packages in `conda.yml`:
-        > !Note: We use llama index `SimpleDirectoryReador` in this process. For the latest information on required packages, please check [here](https://docs.llamaindex.ai/en/stable/examples/data_connectors/simple_directory_reader.html).
-        - .docx - `docx2txt`
-        - .pdf - `pypdf`
-        - .ipynb - `nbconvert`
+    - For specific document file types, you may need to install extra packages:
+      - .docx - `pip install docx2txt`
+      - .pdf - `pip install pypdf`
+      - .ipynb - `pip install nbconvert`
+      > !Note: We use llama index `SimpleDirectoryReador` in this process. For the latest information on required packages, please check [here](https://docs.llamaindex.ai/en/stable/examples/data_connectors/simple_directory_reader.html).
 
 4. Prepare Azure AI resources in cloud.
     - An Azure AI ML workspace - [Create workspace resources you need to get started with Azure AI](https://learn.microsoft.com/en-us/azure/machine-learning/quickstart-create-resources?view=azureml-api-2).
     - A compute target - [Learn more about compute cluster](https://learn.microsoft.com/en-us/azure/machine-learning/concept-compute-target?view=azureml-api-2).
-5. Create cloud connection: [Create a connection](https://microsoft.github.io/promptflow/cloud/azureai/quick-start.html#create-necessary-connections)
+5. [Create cloud connection](https://microsoft.github.io/promptflow/cloud/azureai/quick-start.html#create-necessary-connections)
+
 6. Prepare config.ini
-    - Navigate to [gen_test_data](../../../examples/gen_test_data) folder.
-    - Run command to copy `config.ini.example` and update the configurations in the `configs.ini` file
+    - Navigate to [example_gen_test_data](../../../examples/gen_test_data) folder.
+    - Run command to copy [`config.ini.example`](../../../examples/gen_test_data/config.ini.example).
         ```
         cp config.ini.example config.ini
         ```
-    - Fill in the values in `COMMON` and `CLOUD` section.
+    - Update the configurations in the `configs.ini`. Fill in the values in `COMMON` and `CLOUD` section following inline comment instruction.
 
 
 ## Generate test data at cloud

diff --git a/docs/how-to-guides/generate-test-data.md b/docs/how-to-guides/generate-test-data.md
@@ -1,6 +1,6 @@
 # How to generate test data based on documents
-This guide will instruct you on how to generate test data for RAG systems using pre-existing documents.
-This approach eliminates the need for manual data creation, which is typically time-consuming and labor-intensive, or the expensive option of purchasing pre-packaged test data.
+In this doc, you may learn how to generate test data based on your documents for RAG app.
+This approach helps relieve the efforts of manual data creation, which is typically time-consuming and labor-intensive, or the expensive option of purchasing pre-packaged test data.
 By leveraging the capabilities of llm, this guide streamlines the test data generation process, making it more efficient and cost-effective.
 
 
@@ -14,64 +14,60 @@ By leveraging the capabilities of llm, this guide streamlines the test data gene
 
     **Limitations:**
 
-    - While the test data generator works well with standard documents, it may face challenges with API introduction documents or reference documents.
-    - The test data generator may not function effectively for non-Latin characters, such as Chinese. These limitations may be due to the text loader capabilities, such as `pypdf`.
+    - The test data generator may not function effectively for non-Latin characters, such as Chinese, in certain document types. The limitation is caused by dependent text loader capabilities, such as `pypdf`.
+    - The test data generator may not generate meaningful questions if the document is not well-organized or contains massive code snippets/links, such as API introduction documents or reference documents.
 
-2. Go to the [gen_test_data](../../examples/gen_test_data) folder and install required packages. 
-    - Run in local: `pip install -r requirements.txt`
-    - Run in cloud: `pip install -r requirements_cloud.txt`
+2. Prepare local environment. Go to [example_gen_test_data](../../examples/gen_test_data) folder and install required packages `pip install -r requirements.txt`
 
-    For specific document file types, you will need to install extra packages:
-      > !Note: We use llama index `SimpleDirectoryReador` in this process. For the latest information on required packages, please check [here](https://docs.llamaindex.ai/en/stable/examples/data_connectors/simple_directory_reader.html).
+    For specific document file types, you may need to install extra packages:
       - .docx - `pip install docx2txt`
       - .pdf - `pip install pypdf`
       - .ipynb - `pip install nbconvert`
+      > !Note: We use llama index `SimpleDirectoryReador` in this process. For the latest information on required packages, please check [here](https://docs.llamaindex.ai/en/stable/examples/data_connectors/simple_directory_reader.html).
+
+3. Install VSCode extension `Prompt flow`.
 
-3. Install VSCode extension and create connections refer to [Create a connection](https://microsoft.github.io/promptflow/how-to-guides/manage-connections.html#create-a-connection)
+4. [Create connections](https://microsoft.github.io/promptflow/how-to-guides/manage-connections.html#create-a-connection)
+
+5. Prepare config.ini
+    - Navigate to [example_gen_test_data](../../../examples/gen_test_data) folder.
+    - Run command to copy [`config.ini.example`](../../examples/gen_test_data/config.ini.example).
+        ```
+        cp config.ini.example config.ini
+        ```
+    - Update the configurations in the `configs.ini`. Fill in the values in `COMMON` and `LOCAL` section following inline comment instruction.
 
 
 ## Create a test data generation flow
-  - Open the [generate_test_data_flow](../../examples/gen_test_data/generate_test_data_flow/) folder in VSCode. 
+  - Open the [sample test data generation flow](../../examples/gen_test_data/gen_test_data/generate_test_data_flow/) in VSCode. This flow is designed to generate a pair of question and suggested answer based on the given text chunk. The flow also includes validation prompts to ensure the quality of the generated test data.
+  - Fill in node inputs including `connection`, `model_or_deployment_name`, `response_format`, `score_threshold` or other parameters. Click run button to test the flow in VSCode by referring to [Test flow with VS Code Extension](https://microsoft.github.io/promptflow/how-to-guides/init-and-test-a-flow.html#visual-editor-on-the-vs-code-for-prompt-flow).
 
+    > !Note: Recommend to use `gpt-4` series models than the `gpt-3.5` for better performance.
+    > !Note: Recommend to use `gpt-4` model (Azure OpenAI `gpt-4` model with version `0613`) than `gpt-4-turbo` model (Azure OpenAI `gpt-4` model with version `1106`) for better performance. Due to inferior performance of `gpt-4-turbo` model, when you use it, sometimes you might need to set the `response_format` input of nodes `validate_text_chunk`, `validate_question`, and `validate_suggested_answer` to `json`, in order to make sure the llm can generate valid json response.
 
   - [*Optional*] Customize your test data generation logic refering to [tune-prompts-with-variants](https://microsoft.github.io/promptflow/how-to-guides/tune-prompts-with-variants.html). 
 
     **Understand the prompts**
     
-    The test data generation flow contains five different prompts, classified into two categories based on their roles: generation prompts and validation prompts. Generation prompts are used to create questions, suggested answers, etc., while validation prompts are used to verify the validity of the text trunk, generated question or answer.
+    The test data generation flow contains 4 prompts, classified into two categories based on their roles: generation prompts and validation prompts. Generation prompts are used to create questions, suggested answers, etc., while validation prompts are used to verify the validity of the text chunk, generated question or answer.
     - Generation prompts
-      - *generate question prompt*: frame a question based on the given text trunk.
-      - *generate suggested answer prompt*: generate suggested answer for the question based on the given text trunk.
+      - [*generate question prompt*](../../examples/gen_test_data/gen_test_data/generate_test_data_flow/generate_question_prompt.jinja2): frame a question based on the given text chunk.
+      - [*generate suggested answer prompt*](../../examples/gen_test_data/gen_test_data/generate_test_data_flow/generate_suggested_answer_prompt.jinja2): generate suggested answer for the question based on the given text chunk.
     - Validation prompts
-      - *score text trunk prompt*: validate if the given text trunk is worthy of framing a question. If the score is lower than score_threshold, validation fails.
-      - *validate seed/test question prompt*: validate if the generated question can be clearly understood.
-      - *validate suggested answer*: validate if the generated suggested answer is clear and certain.
-
-      If the validation fails, the corresponding output would be an empty string so that the invalid data would not be incorporated into the final test data set.
-
-
-
-  - Fill in the necessary flow/node inputs, and run the flow in VSCode refering to [Test flow with VS Code Extension](https://microsoft.github.io/promptflow/how-to-guides/init-and-test-a-flow.html#visual-editor-on-the-vs-code-for-prompt-flow).
+      - [*score text chunk prompt*](../../examples/gen_test_data/gen_test_data/generate_test_data_flow/score_text_chunk_prompt.jinja2): score 0-10 to validate if the given text chunk is worthy of framing a question. If the score is lower than `score_threshold` (default 4), validation fails.
+      - [*validate question prompt*](../../examples/gen_test_data/gen_test_data/generate_test_data_flow/validate_question_prompt.jinja2): validate if the generated question is good.
+      - [*validate suggested answer*](../../examples/gen_test_data/gen_test_data/generate_test_data_flow/generate_suggested_answer_prompt.jinja2): validate if the generated suggested answer is good.
 
-    **Set the appropriate model and corresponding response format.** The `gpt-4` model is recommended. The default prompt may yield better results with this model compared to the gpt-3 series.
-      - For the `gpt-4` model with version `0613`, use the response format `text`.
-      - For the `gpt-4` model with version `1106`, use the response format `json`.
+      If the validation fails, would lead to empty string `question`/`suggested_answer` which are removed from final output test data set.
 
+## Generate test data
+- Navigate to [example_gen_test_data](../../examples/gen_test_data_gen) folder.
  
-## Generate test data at local
-- Navigate to [gen_test_data](../../examples/gen_test_data_gen) folder.
-
-- Run command to copy `config.ini.example` and update the `COMMON` and `LOCAL` configurations in the `configs.ini` file
-    ```
-    cp config.ini.example config.ini
-    ```
-  
 - After configuration, run the following command to generate the test data set:
   ```bash
   python -m gen_test_data.run
-  ```
-- The generated test data will be a data jsonl file located in the path you configured in `config.ini`.
+  ``` 
 
+- The generated test data will be a data jsonl file. See detailed log print in console "Saved ... valid test data to ..." to find it.
 
-## Generate test data at cloud
-For handling larger test data, you can leverage the PRS component to run flow in cloud. Please refer to this [guide](../cloud/azureai/generate-test-data-cloud.md) for more information.
+If you expect to generate a large amount of test data beyond your local compute capability, you may try generating test data in cloud, please see this [guide](../cloud/azureai/generate-test-data-cloud.md) for more detailed steps.
diff --git a/examples/gen_test_data/config.ini.example b/examples/gen_test_data/config.ini.example
@@ -18,7 +18,7 @@ connection_name = "<your-connection-name>"
 [LOCAL]
 ; This section is for local test data generation related configuration.
 output_folder = "<your-output-folder-abspath>"
-flow_batch_run_size = 10
+flow_batch_run_size = 4
 
 
 [CLOUD]
@@ -31,7 +31,7 @@ aml_cluster = "<your-compute-name>"
 ; Parallel run step configs
 prs_instance_count = 2
 prs_mini_batch_size = 2
-prs_max_concurrency_per_instance = 10
+prs_max_concurrency_per_instance = 4
 prs_max_retry_count = 3
 prs_run_invocation_time = 800
 prs_allowed_failed_count = -1
diff --git a/examples/gen_test_data/gen_test_data/common.py b/examples/gen_test_data/gen_test_data/common.py
@@ -1,4 +1,7 @@
 import json
+import sys
+import re
+import time
 import typing as t
 from pathlib import Path
 
@@ -19,7 +22,7 @@ def split_document(chunk_size, documents_folder, document_node_output):
 
     logger = get_logger("doc.split")
     logger.info("Step 1: Start to split documents to document nodes...")
-    # count the number of files in documents_folder, including subfolders, use pathlib
+    # count the number of files in documents_folder, including subfolders.
     num_files = sum(1 for _ in Path(documents_folder).rglob("*") if _.is_file())
     logger.info(f"Found {num_files} files in the documents folder '{documents_folder}'. Using chunk size: {chunk_size} to split.")
     # `SimpleDirectoryReader` by default chunk the documents based on heading tags and paragraphs, which may lead to small chunks.
@@ -39,7 +42,7 @@ def split_document(chunk_size, documents_folder, document_node_output):
     return str((Path(document_node_output) / "document_nodes.jsonl"))
 
 
-def clean_data_and_save(test_data_set: list, test_data_output_path: str):
+def clean_data(test_data_set: list, test_data_output_path: str):
     logger = get_logger("data.clean")
     logger.info("Step 3: Start to clean invalid test data...")
     logger.info(f"Collected {len(test_data_set)} test data after the batch run.")
@@ -49,20 +52,61 @@ def clean_data_and_save(test_data_set: list, test_data_output_path: str):
         if test_data and all(
                 val and val != "(Failed)" for key, val in test_data.items() if key.lower() != "line_number"
         ):
-            cleaned_data.append(test_data)
+            data_line = {"question": test_data["question"], "suggested_answer": test_data["suggested_answer"]}
+            cleaned_data.append(data_line)
 
     jsonl_str = "\n".join(map(json.dumps, cleaned_data))
     with open(test_data_output_path, "wt") as text_file:
         print(f"{jsonl_str}", file=text_file)
 
+    # TODO: aggregate invalid data root cause and count, and log it.
     # log debug info path.
-    logger.info(f"Removed {len(test_data_set) - len(cleaned_data)} invalid test data.")
-    logger.info(f"Saved {len(cleaned_data)} valid test data to {test_data_output_path}.")
+    logger.info(f"Removed {len(test_data_set) - len(cleaned_data)} invalid test data. "
+                f"Saved {len(cleaned_data)} valid test data to '{test_data_output_path}'.")
 
 
 def count_non_blank_lines(file_path):
     with open(file_path, 'r') as file:
         lines = file.readlines()
 
     non_blank_lines = len([line for line in lines if line.strip()])
-    return non_blank_lines
+    return non_blank_lines
+
+
+def print_progress(log_file_path: str):
+    logger = get_logger("data.gen")
+    logger.info(f"Showing progress log, or you can click '{log_file_path}' and see detailed batch run log...")
+    log_pattern = re.compile(r".*execution.bulk\s+INFO\s+Finished (\d+) / (\d+) lines\.")
+    # wait for the log file to be created
+    start_time = time.time()
+    while not Path(log_file_path).is_file():
+        time.sleep(1)
+        # if the log file is not created within 5 minutes, raise an error
+        if time.time() - start_time > 300:
+            raise Exception(f"Log file '{log_file_path}' is not created within 5 minutes.")
+
+    try:
+        last_data_time = time.time()
+        with open(log_file_path, 'r') as f:
+            while True:
+                line = f.readline().strip()
+                if line:
+                    last_data_time = time.time()  # Update the time when the last data was received
+                    match = log_pattern.match(line)
+                    if not match:
+                        continue
+
+                    sys.stdout.write("\r" + line)  # \r will move the cursor back to the beginning of the line
+                    sys.stdout.flush()  # flush the buffer to ensure the log is displayed immediately
+                    finished, total = map(int, match.groups())
+                    if finished == total:
+                        logger.info("Batch run is completed.")
+                        break
+                elif time.time() - last_data_time > 300:
+                    logger.info("No new log line received for 5 minutes. Stop reading. See the log file for more details.")
+                    break  # Stop reading
+                else:
+                    time.sleep(1)  # wait for 1 second if no new line is available
+    except KeyboardInterrupt:
+        sys.stdout.write("\n")  # ensure to start on a new line when the user interrupts
+        sys.stdout.flush()
diff --git a/examples/gen_test_data/gen_test_data/components.py b/examples/gen_test_data/gen_test_data/components.py
@@ -1,7 +1,7 @@
 import json
 from pathlib import Path
 
-from common import clean_data_and_save, split_document
+from common import clean_data, split_document
 from mldesigner import Input, Output, command_component
 
 conda_file = Path(__file__).parent.parent / "conda.yml"
@@ -34,15 +34,15 @@ def split_document_component(
 
 
 @command_component(
-    name="clean_data_and_save_component",
+    name="clean_data_component",
     display_name="clean dataset",
     description="Clean test data set to remove empty lines.",
     environment=dict(
         conda_file=conda_file,
         image=env_image,
     ),
 )
-def clean_data_and_save_component(
+def clean_data_component(
     test_data_set_folder: Input(type="uri_folder"), test_data_output: Output(type="uri_folder")
 ) -> str:
     test_data_set_path = Path(test_data_set_folder) / "parallel_run_step.jsonl"
@@ -51,6 +51,6 @@ def clean_data_and_save_component(
         data = [json.loads(line) for line in f]
 
     test_data_output_path = test_data_output / Path("test_data_set.jsonl")
-    clean_data_and_save(data, test_data_output_path)
+    clean_data(data, test_data_output_path)
 
     return str(test_data_output_path)
diff --git a/examples/gen_test_data/gen_test_data/constants.py b/examples/gen_test_data/gen_test_data/constants.py
@@ -1,2 +1,3 @@
 DOCUMENT_NODE = "document_node"
 TEXT_CHUNK = "text_chunk"
+DETAILS_FILE_NAME ="test-data-gen-details.jsonl"