From 4022ad1d7b2651fea4943f3c6a8a135ab8215fc1 Mon Sep 17 00:00:00 2001
From: Colin O'Sullivan <colin.osullivan1@ibm.com>
Date: Wed, 16 Oct 2024 22:45:16 +0100
Subject: [PATCH 1/4] community-sourced SDG guidelines

Signed-off-by: Colin O'Sullivan <colin.osullivan1@ibm.com>
---
 docs/taxonomy/sdg_guidelines.md | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)
 create mode 100644 docs/taxonomy/sdg_guidelines.md

diff --git a/docs/taxonomy/sdg_guidelines.md b/docs/taxonomy/sdg_guidelines.md
new file mode 100644
index 0000000..a1ee43f
--- /dev/null
+++ b/docs/taxonomy/sdg_guidelines.md
@@ -0,0 +1,19 @@
+## Synthetic Data Generation Guidelines
+
+The OIC initiative promoted discussion spaces for sharing and recording best practices to generate seed data in qna yaml files to be ingested in Instructlab. The guidelines were categorized, evolving according to the contributions, and validated by IBM practitioners with real case scenarios. The first version showcases ten categories with guidelines for improving qna_yaml files for the development community to use.
+
+| Topic | Issue | Guideline |
+| --- | --- | --- |
+| Things to Avoid | Historically, LLM is bad in math | Do not provide complex math calculation in Q&A seeds. |
+| Context | What if knowledge is based on documents not existing in the base model? | In the qna.yaml file, you can pass context within a chunk of information (text from the document that Q&A are based on). Adding context to the skill QnA file might generate better-quality data. |
+| Formatting & Front-End specific and may change | How to format data in the Q&A file especially how to format tables? | Currently, only files in Markdown format are supported. If the files are in any other format, they must be converted to Markdown format. For automatic converters, we recommend experimenting with other Markdown conversions like ‘markdown_strict’, ‘asciidoc’ and ‘gfm’. |
+| Intervene in Training | Can I use generated json files to prompt-tuning (watsonx.ai) or using HuggingFace directly? | The output of SDG is in json format and can also be used for traditional fine-tuning. |
+| Quantities | The number of seed examples. How many seeds I should provide? | The number of seeds: 01- Generating ~300 QnA pairs from ~5 seed examples is recommended by InstructLab product team. 02 - Knowledge requires 5 pieces of context from document each with 3 QNAs specific to each context piece for a total of 15 qna pairs. 03 - We tried with less than 300 QnA pairs but found the QnA quality only satisfactory. |
+| Task description | | The task description should be grounded in the domain/document. - Due to this recommendation we should keep in mind that much complex cases can be splitted into smaller chunks of information |
+| Context size limitation | What is the size limit of context window in the Q&A file (qna.yaml)? | Context size limitation: There is a ~2300 context size limit in the QnA yaml file. It is advised to keep the ground truth answers  concise to respect this limit. |
+| After Training | How to check the quality of the data in a large data set of the qna.yaml file? | You don’t have to check out synthetic data generated by the SDG process. After generating synthetic data internally, the IBM Research team is sampling to check quality (no need to check them all, especially for extensive set). |
+| Quality | How to measure quality of obtained data? | To evaluate SDG, you can use following rating range (1-5): 01 - Irrelevant Answer 02 - Relevant but not close to ground truth, model might be hallucinating. 03 - Relevant, model not hallucinating, partly matching the ground truth. 04 - Relevant, model not hallucinating, model is adding irrelevant/unnecessary information 05 - Excellent Answer, Matches closely with Ground Truth |
+| Manual validation | During the manual validation, it understood the entity and intent of the question and searched for the same entity and intent in the corresponding document. The document information was provided in the generated JSON file. | At the next step, manual search validated it the steps or definitions contained in the answer were indeed in the corresponding document. |
+| Quality of data generated | How to enhance the quality of data generated in SD | - Task description: Add a task description relevant to the knowledge documents. We tried adding a custom task description to improve the SDG.<br/> - Prompt template: Add guidelines for instruction and output to stick to document-related keywords and generate instructions from tables. We specifically added these instructions to the prompt template.<br/> - Chunk word count: Increase the word count to increase the chunk sizes taken from the documents in SDG for long answered (Q&A pairs)<br/> - Rouge threshold: To strictly enforce/penalize data quality, one can increase the rouge threshold in the iLab generate command.<br/> - The question and answer pairs should be complete sentences, well formed, and use proper grammar. Longer answers are better than a short yes or no.<br/> - Also, the question and answer pairs must be answered by the  associated context. |
+| Formatting | How many leaf nodes are kept in the taxonomy after adding a Q&A file? | The documents are kept in single leaf node and has one qna file and one attribution.txt. |
+

From db26849e07368c2942cff4559edd734cb266f33a Mon Sep 17 00:00:00 2001
From: Colin O'Sullivan <colin.osullivan1@ibm.com>
Date: Wed, 16 Oct 2024 23:54:08 +0100
Subject: [PATCH 2/4] minor formatting changes

Signed-off-by: Colin O'Sullivan <colin.osullivan1@ibm.com>
---
 docs/taxonomy/sdg_guidelines.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/taxonomy/sdg_guidelines.md b/docs/taxonomy/sdg_guidelines.md
index a1ee43f..3e757a0 100644
--- a/docs/taxonomy/sdg_guidelines.md
+++ b/docs/taxonomy/sdg_guidelines.md
@@ -5,15 +5,15 @@ The OIC initiative promoted discussion spaces for sharing and recording best pra
 | Topic | Issue | Guideline |
 | --- | --- | --- |
 | Things to Avoid | Historically, LLM is bad in math | Do not provide complex math calculation in Q&A seeds. |
-| Context | What if knowledge is based on documents not existing in the base model? | In the qna.yaml file, you can pass context within a chunk of information (text from the document that Q&A are based on). Adding context to the skill QnA file might generate better-quality data. |
+| Context | What if knowledge is based on documents not existing in the base model? | In the qna.yaml file, you can pass context within a chunk of information (text from the document that Q&A are based on). Adding context to the skill QnA file might generate better quality data. |
 | Formatting & Front-End specific and may change | How to format data in the Q&A file especially how to format tables? | Currently, only files in Markdown format are supported. If the files are in any other format, they must be converted to Markdown format. For automatic converters, we recommend experimenting with other Markdown conversions like ‘markdown_strict’, ‘asciidoc’ and ‘gfm’. |
 | Intervene in Training | Can I use generated json files to prompt-tuning (watsonx.ai) or using HuggingFace directly? | The output of SDG is in json format and can also be used for traditional fine-tuning. |
-| Quantities | The number of seed examples. How many seeds I should provide? | The number of seeds: 01- Generating ~300 QnA pairs from ~5 seed examples is recommended by InstructLab product team. 02 - Knowledge requires 5 pieces of context from document each with 3 QNAs specific to each context piece for a total of 15 qna pairs. 03 - We tried with less than 300 QnA pairs but found the QnA quality only satisfactory. |
+| Quantities | How many seeds should I provide? | The number of seeds: <ol><li>Generating ~300 QnA pairs from ~5 seed examples is recommended by the InstructLab product team.</li><li>Knowledge requires 5 pieces of context from document each with 3 QNAs specific to each context piece for a total of 15 QnA pairs.</li><li>We tried with less than 300 QnA pairs but found the QnA quality only satisfactory.</li><ol> |
 | Task description | | The task description should be grounded in the domain/document. - Due to this recommendation we should keep in mind that much complex cases can be splitted into smaller chunks of information |
 | Context size limitation | What is the size limit of context window in the Q&A file (qna.yaml)? | Context size limitation: There is a ~2300 context size limit in the QnA yaml file. It is advised to keep the ground truth answers  concise to respect this limit. |
 | After Training | How to check the quality of the data in a large data set of the qna.yaml file? | You don’t have to check out synthetic data generated by the SDG process. After generating synthetic data internally, the IBM Research team is sampling to check quality (no need to check them all, especially for extensive set). |
-| Quality | How to measure quality of obtained data? | To evaluate SDG, you can use following rating range (1-5): 01 - Irrelevant Answer 02 - Relevant but not close to ground truth, model might be hallucinating. 03 - Relevant, model not hallucinating, partly matching the ground truth. 04 - Relevant, model not hallucinating, model is adding irrelevant/unnecessary information 05 - Excellent Answer, Matches closely with Ground Truth |
+| Quality | How to measure quality of obtained data? | To evaluate SDG, you can use following rating range (1-5): <ol><li>Irrelevant Answer</li><li>Relevant but not close to ground truth, model might be hallucinating.</li><li>Relevant, model not hallucinating, partly matching the ground truth.</li><li>Relevant, model not hallucinating, model is adding irrelevant/unnecessary information</li><li>Excellent Answer, Matches closely with Ground Truth</li><ol> |
 | Manual validation | During the manual validation, it understood the entity and intent of the question and searched for the same entity and intent in the corresponding document. The document information was provided in the generated JSON file. | At the next step, manual search validated it the steps or definitions contained in the answer were indeed in the corresponding document. |
-| Quality of data generated | How to enhance the quality of data generated in SD | - Task description: Add a task description relevant to the knowledge documents. We tried adding a custom task description to improve the SDG.<br/> - Prompt template: Add guidelines for instruction and output to stick to document-related keywords and generate instructions from tables. We specifically added these instructions to the prompt template.<br/> - Chunk word count: Increase the word count to increase the chunk sizes taken from the documents in SDG for long answered (Q&A pairs)<br/> - Rouge threshold: To strictly enforce/penalize data quality, one can increase the rouge threshold in the iLab generate command.<br/> - The question and answer pairs should be complete sentences, well formed, and use proper grammar. Longer answers are better than a short yes or no.<br/> - Also, the question and answer pairs must be answered by the  associated context. |
+| Quality of data generated | How to enhance the quality of data generated in SD | <ul><li>Task description: Add a task description relevant to the knowledge documents. We tried adding a custom task description to improve the SDG.</li><li>Prompt template: Add guidelines for instruction and output to stick to document-related keywords and generate instructions from tables. We specifically added these instructions to the prompt template.</li><li>Chunk word count: Increase the word count to increase the chunk sizes taken from the documents in SDG for long answered (Q&A pairs)</li><li>Rouge threshold: To strictly enforce/penalize data quality, one can increase the rouge threshold in the iLab generate command.</li><li>The question and answer pairs should be complete sentences, well formed, and use proper grammar. Longer answers are better than a short yes or no.</li><li>Also, the question and answer pairs must be answered by the  associated context.</li></ul> |
 | Formatting | How many leaf nodes are kept in the taxonomy after adding a Q&A file? | The documents are kept in single leaf node and has one qna file and one attribution.txt. |
 

From 32d8586600362f932874e03e4bf7125817153c40 Mon Sep 17 00:00:00 2001
From: Colin O'Sullivan <colin.osullivan1@ibm.com>
Date: Thu, 17 Oct 2024 00:24:42 +0100
Subject: [PATCH 3/4] adding new SDG guidelines doc

Signed-off-by: Colin O'Sullivan <colin.osullivan1@ibm.com>
---
 mkdocs.yml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mkdocs.yml b/mkdocs.yml
index f7995c7..f3ab09c 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -34,6 +34,7 @@ nav:
       - Knowledge Overview: taxonomy/knowledge/index.md
       - Knowledge Guide: taxonomy/knowledge/guide.md
       - Knowledge Contribution Details: taxonomy/knowledge/contribution_details.md
+      - Synthetic Data Generation Guidelines: taxonomy/sdg_guidelines.md
   - Community Model Build:
       - About Community Model Build: cmb/index.md
       - Community Model Build Process: cmb/build_process.md

From 0e7ee56d4c508c51e371fce3b00458d586c7fd84 Mon Sep 17 00:00:00 2001
From: Colin O'Sullivan <colin.osullivan1@ibm.com>
Date: Tue, 17 Dec 2024 16:11:01 +0000
Subject: [PATCH 4/4] minor grammer changes

Signed-off-by: Colin O'Sullivan <colin.osullivan1@ibm.com>
---
 docs/taxonomy/sdg_guidelines.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/taxonomy/sdg_guidelines.md b/docs/taxonomy/sdg_guidelines.md
index 3e757a0..06bce3a 100644
--- a/docs/taxonomy/sdg_guidelines.md
+++ b/docs/taxonomy/sdg_guidelines.md
@@ -7,13 +7,13 @@ The OIC initiative promoted discussion spaces for sharing and recording best pra
 | Things to Avoid | Historically, LLM is bad in math | Do not provide complex math calculation in Q&A seeds. |
 | Context | What if knowledge is based on documents not existing in the base model? | In the qna.yaml file, you can pass context within a chunk of information (text from the document that Q&A are based on). Adding context to the skill QnA file might generate better quality data. |
 | Formatting & Front-End specific and may change | How to format data in the Q&A file especially how to format tables? | Currently, only files in Markdown format are supported. If the files are in any other format, they must be converted to Markdown format. For automatic converters, we recommend experimenting with other Markdown conversions like ‘markdown_strict’, ‘asciidoc’ and ‘gfm’. |
-| Intervene in Training | Can I use generated json files to prompt-tuning (watsonx.ai) or using HuggingFace directly? | The output of SDG is in json format and can also be used for traditional fine-tuning. |
+| Intervene in Training | Can I use generated json files for prompt-tuning (watsonx.ai) or using HuggingFace directly? | The output of SDG is in json format and can also be used for traditional fine-tuning. |
 | Quantities | How many seeds should I provide? | The number of seeds: <ol><li>Generating ~300 QnA pairs from ~5 seed examples is recommended by the InstructLab product team.</li><li>Knowledge requires 5 pieces of context from document each with 3 QNAs specific to each context piece for a total of 15 QnA pairs.</li><li>We tried with less than 300 QnA pairs but found the QnA quality only satisfactory.</li><ol> |
-| Task description | | The task description should be grounded in the domain/document. - Due to this recommendation we should keep in mind that much complex cases can be splitted into smaller chunks of information |
+| Task description | | The task description should be grounded in the domain/document. - Due to this recommendation we should keep in mind that complex cases can be split into smaller chunks of information |
 | Context size limitation | What is the size limit of context window in the Q&A file (qna.yaml)? | Context size limitation: There is a ~2300 context size limit in the QnA yaml file. It is advised to keep the ground truth answers  concise to respect this limit. |
 | After Training | How to check the quality of the data in a large data set of the qna.yaml file? | You don’t have to check out synthetic data generated by the SDG process. After generating synthetic data internally, the IBM Research team is sampling to check quality (no need to check them all, especially for extensive set). |
-| Quality | How to measure quality of obtained data? | To evaluate SDG, you can use following rating range (1-5): <ol><li>Irrelevant Answer</li><li>Relevant but not close to ground truth, model might be hallucinating.</li><li>Relevant, model not hallucinating, partly matching the ground truth.</li><li>Relevant, model not hallucinating, model is adding irrelevant/unnecessary information</li><li>Excellent Answer, Matches closely with Ground Truth</li><ol> |
-| Manual validation | During the manual validation, it understood the entity and intent of the question and searched for the same entity and intent in the corresponding document. The document information was provided in the generated JSON file. | At the next step, manual search validated it the steps or definitions contained in the answer were indeed in the corresponding document. |
+| Quality | How to measure quality of obtained data? | To evaluate SDG, you can use the following rating range (1-5): <ol><li>Irrelevant Answer</li><li>Relevant but not close to ground truth, model might be hallucinating.</li><li>Relevant, model not hallucinating, partly matching the ground truth.</li><li>Relevant, model not hallucinating, model is adding irrelevant/unnecessary information</li><li>Excellent Answer, Matches closely with Ground Truth</li><ol> |
+| Manual validation | During the manual validation, it understood the entity and intent of the question and searched for the same entity and intent in the corresponding document. The document information was provided in the generated JSON file. | At the next step, manual search validates that the steps or definitions contained in the answer were indeed in the corresponding document. |
 | Quality of data generated | How to enhance the quality of data generated in SD | <ul><li>Task description: Add a task description relevant to the knowledge documents. We tried adding a custom task description to improve the SDG.</li><li>Prompt template: Add guidelines for instruction and output to stick to document-related keywords and generate instructions from tables. We specifically added these instructions to the prompt template.</li><li>Chunk word count: Increase the word count to increase the chunk sizes taken from the documents in SDG for long answered (Q&A pairs)</li><li>Rouge threshold: To strictly enforce/penalize data quality, one can increase the rouge threshold in the iLab generate command.</li><li>The question and answer pairs should be complete sentences, well formed, and use proper grammar. Longer answers are better than a short yes or no.</li><li>Also, the question and answer pairs must be answered by the  associated context.</li></ul> |
 | Formatting | How many leaf nodes are kept in the taxonomy after adding a Q&A file? | The documents are kept in single leaf node and has one qna file and one attribution.txt. |