From f6b1e5c6286f22889c04e44d2df7183aebbaea06 Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Wed, 5 Feb 2025 09:38:46 -0600 Subject: [PATCH 1/9] build(gitignore): hide jetbrains directory from git Signed-off-by: Laura Santamaria --- .gitignore | 3 +++ 1 file changed, 3 insertions(+) diff --git a/.gitignore b/.gitignore index 29b8574d..8c708da7 100644 --- a/.gitignore +++ b/.gitignore @@ -12,3 +12,6 @@ venv .\#* .projectile .dir-locals.el + +# PyCharm/JetBrains +.idea \ No newline at end of file From 5db42867dfb312827befc732ee821b1f1176ae04 Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Wed, 5 Feb 2025 12:49:59 -0600 Subject: [PATCH 2/9] feat(taxonomy): start proposal on revamp of taxonomy concept Signed-off-by: Laura Santamaria --- docs/taxonomy-revamp-2025.md | 72 ++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) create mode 100644 docs/taxonomy-revamp-2025.md diff --git a/docs/taxonomy-revamp-2025.md b/docs/taxonomy-revamp-2025.md new file mode 100644 index 00000000..c843199c --- /dev/null +++ b/docs/taxonomy-revamp-2025.md @@ -0,0 +1,72 @@ +--- +author: Laura Santamaria (@nimbinatus) +date: 05 February 2025 +status: proposed +--- + +## Issues + +Our taxonomy tree structure and knowledge/skill file structure was designed with upstream taxonomy submissions in mind. An end user working with a taxonomy locally using InstructLab has to follow all of those requirements, increasing complexity of their work. + +The end user typically gets hung up on where to place the file in a massive file tree where sub-branches are not defined. For someone not working in the upstream taxonomy, this is basically bikeshedding[^bikeshed]. The only requirement for the SDG process is sorting things into `knowledge` and `skills`. + +The user experience of working with the `qna.yaml` file is poor for a handful of reasons: + +- Most of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy. +- YAML is a notoriously complex, loose format with a lot of potholes. + - YAML files of different specifications parse completely differently (e.g., 1.2 vs 1.1). + - Note that PyYAML, our base tool, parses YAML 1.1, not 1.2. There is a long way to go[^PyYAML] to support 1.2, which has been the latest spec since 2009. As such, even if someone were to search the Internet for a solution because they are not familiar with YAML, they likely will stumble across 1.2 solutions that don't work for 1.1. + - There are at least 9 different ways to indicate a multi-line string in YAML[^9ways], depending on which block scalar indicator[^blockscalar] is used and which block chomping indicator[^blockchomping] is used (this does **not** count the indentation indicator[^blockindentation]!). Then there are double-quoted flow scalar multilines[^doublequotedflowscalar] and single-quoted flow scalar multilines[^singlequotedflowscalar], which can cause more problems. +- The linting system, intended to ensure the YAML file is readable by the SDG process, adds more burden on the non-technical user. + - The linter for YAML enforces an 80-character line length by default. That makes sense if you're working on code read from a terminal, but not to a typical end user used to working with rich text editors for a reading comprehension experience working with paragraphs. + - The linter also complains about trailing whitespace, another common thing that the typical end user won't understand why everything is failing. + +From a code perspective, + +- We are already using JSON in the datamixing process in SDG[^datamixing]. +- Docling also exports JSON as input and output[^docling]. +- JSON is also much more friendly to UI work, which is a primary path we would like people to use. + +Overall, the `qna.yaml` file needs to have fewer knobs and fewer pitfalls. + +The process of writing question and answer sets also is more like writing reading comprehension sets from a standardized exam. It would be better to frame this hands-on part of the process as similar to the passage and question sets from English reading comprehension exams + +## Proposed solution + +- Drop the folder structure in favor of a schema field for submission type and even domain, if necessary. + - The schema field can be entered automatically via the UI through a user selecting `knowledge` or `skill`. +- Streamline the schema. + - Make `created_by`, `domain`, and `document_outline` optional fields. Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone. + - +- Switch to JSON and Markdown for the `qna.yaml` document. + - Allow the user to use Markdown in a WYSIWYG experience, and then use a Markdown-to-JSON converter to handle the conversion to a code-friendly format. + - Markdown is very user-friendly, and converters handle a lot of the issues with encoding and special characters that happen in situations like working in other languages. We don't have to worry about a linter arguing about line length with the end user, and we wouldn't have to think about whether the user used tabs or spaces or forgot to strip whitespace at the end of a line. +- Frame the Q&A writing process as a reading comprehension process. + - Write documentation and tutorials based on existing tutorials on writing reading comprehension questions and example answers for standardized exams. + - Most people can understand reading to learn versus learning to read type questions. The new, streamlined schema that matches the most simple needs could help here along with a solid set of docs and tutorials on how to write reading comprehension sets. We could borrow heavily from the standard tutorials for writing standardized exams that are out there for free and already battle-tested. + +[^bikeshed]: The story of the bikeshed is a common metaphor. The story goes that a group that is working on the approvals for the construction plan of a nuclear power plant gets stuck on what color to paint the bike shed at one of the entrances to the plant. Mutliple meetings are scheduled to hash out the issue of the color of the bike shed, with heated arguments. However, the rest of the plan for the power plant is not examined in detail or critiqued. People have an easier time evaluating and having an opinion on something that is as trivial as a bike shed's color when faced with complex decisions on other systems. https://en.wiktionary.org/wiki/bikeshedding +[^9ways]: You can experience this issue in action with the interactive experience on https://yaml-multiline.info/. +[^blockscalar]: https://yaml.org/spec/1.2.2/#81-block-scalar-styles + > YAML provides two block scalar styles, literal and folded. Each provides a different trade-off between readability and expressive power. +[^blockchomping]: https://yaml.org/spec/1.2.2/#8112-block-chomping-indicator + > Chomping controls how final line breaks and trailing empty lines are interpreted. YAML provides three chomping methods: +[^blockindentation]: https://yaml.org/spec/1.2.2/#8111-block-indentation-indicator + > Every block scalar has a content indentation level. The content of the block scalar excludes a number of leading spaces on each line up to the content indentation level. + > + > If a block scalar has an indentation indicator, then the content indentation level of the block scalar is equal to the indentation level of the block scalar plus the integer value of the indentation indicator character. + > + > If no indentation indicator is given, then the content indentation level is equal to the number of leading spaces on the first non-empty line of the contents. If there is no non-empty line then the content indentation level is equal to the number of spaces on the longest line. + > + >It is an error if any non-empty line does not begin with a number of spaces greater than or equal to the content indentation level. + > + >It is an error for any of the leading empty lines to contain more spaces than the first non-empty line. + > + >A YAML processor should only emit an explicit indentation indicator for cases where detection will fail. +[^doublequotedflowscalar]: https://yaml.org/spec/1.2.2/#double-quoted-style + > In a multi-line double-quoted scalar, line breaks are subject to flow line folding, which discards any trailing white space characters. It is also possible to escape the line break character. In this case, the escaped line break is excluded from the content and any trailing white space characters that precede the escaped line break are preserved. Combined with the ability to escape white space characters, this allows double-quoted lines to be broken at arbitrary positions. +[^singlequotedflowscalar]: https://yaml.org/spec/1.2.2/#single-quoted-style + > In addition, it is only possible to break a long single-quoted line where a space character is surrounded by non-spaces. [...] All leading and trailing white space characters are excluded from the content. Each continuation line must therefore contain at least one non-space character. Empty lines, if any, are consumed as part of the line folding. +[^datamixing]: stuff +[^docling]: https://ds4sd.github.io/docling/supported_formats/ notes docling supports JSON-serialized Docling Documents and Markdown as input and JSON and Markdown as outputs. +[^PyYAML]: https://github.com/yaml/pyyaml/issues/486 \ No newline at end of file From c4b8e315eddb1658c9834792301d6ba9cc3b8581 Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Wed, 5 Feb 2025 15:28:07 -0600 Subject: [PATCH 3/9] docs(more): more things Signed-off-by: Laura Santamaria --- docs/taxonomy-revamp-2025.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/taxonomy-revamp-2025.md b/docs/taxonomy-revamp-2025.md index c843199c..5fe0c7d7 100644 --- a/docs/taxonomy-revamp-2025.md +++ b/docs/taxonomy-revamp-2025.md @@ -15,7 +15,7 @@ The user experience of working with the `qna.yaml` file is poor for a handful of - Most of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy. - YAML is a notoriously complex, loose format with a lot of potholes. - YAML files of different specifications parse completely differently (e.g., 1.2 vs 1.1). - - Note that PyYAML, our base tool, parses YAML 1.1, not 1.2. There is a long way to go[^PyYAML] to support 1.2, which has been the latest spec since 2009. As such, even if someone were to search the Internet for a solution because they are not familiar with YAML, they likely will stumble across 1.2 solutions that don't work for 1.1. + - Note that PyYAML, our base tool [^tooling], parses YAML 1.1, not 1.2. There is a long way to go[^PyYAML] to support 1.2, which has been the latest spec since 2009. As such, even if someone were to search the Internet for a solution because they are not familiar with YAML, they likely will stumble across 1.2 solutions that don't work for 1.1. - There are at least 9 different ways to indicate a multi-line string in YAML[^9ways], depending on which block scalar indicator[^blockscalar] is used and which block chomping indicator[^blockchomping] is used (this does **not** count the indentation indicator[^blockindentation]!). Then there are double-quoted flow scalar multilines[^doublequotedflowscalar] and single-quoted flow scalar multilines[^singlequotedflowscalar], which can cause more problems. - The linting system, intended to ensure the YAML file is readable by the SDG process, adds more burden on the non-technical user. - The linter for YAML enforces an 80-character line length by default. That makes sense if you're working on code read from a terminal, but not to a typical end user used to working with rich text editors for a reading comprehension experience working with paragraphs. @@ -69,4 +69,5 @@ The process of writing question and answer sets also is more like writing readin > In addition, it is only possible to break a long single-quoted line where a space character is surrounded by non-spaces. [...] All leading and trailing white space characters are excluded from the content. Each continuation line must therefore contain at least one non-space character. Empty lines, if any, are consumed as part of the line folding. [^datamixing]: stuff [^docling]: https://ds4sd.github.io/docling/supported_formats/ notes docling supports JSON-serialized Docling Documents and Markdown as input and JSON and Markdown as outputs. -[^PyYAML]: https://github.com/yaml/pyyaml/issues/486 \ No newline at end of file +[^PyYAML]: https://github.com/yaml/pyyaml/issues/486 +[^tooling]: https://github.com/instructlab/schema/blob/main/pyproject.toml#L27-L30 and \ No newline at end of file From 4ed5233f00227610e4120ebc9fbf03e35b01d8c9 Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Wed, 5 Feb 2025 17:00:33 -0600 Subject: [PATCH 4/9] style(lint): go away linter Signed-off-by: Laura Santamaria --- .spellcheck-en-custom.txt | 2 ++ docs/taxonomy-revamp-2025.md | 28 ++++++++++++++-------------- 2 files changed, 16 insertions(+), 14 deletions(-) diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt index 38ab79d5..ffd79008 100644 --- a/.spellcheck-en-custom.txt +++ b/.spellcheck-en-custom.txt @@ -15,6 +15,7 @@ backend backends benchmarking Bhandwaldar +bikeshedding brainer Cappi checkpointing @@ -214,6 +215,7 @@ SaaS safetensor safetensors Salawu +Santamaria scalable SDG sdg diff --git a/docs/taxonomy-revamp-2025.md b/docs/taxonomy-revamp-2025.md index 5fe0c7d7..c1f224b7 100644 --- a/docs/taxonomy-revamp-2025.md +++ b/docs/taxonomy-revamp-2025.md @@ -8,7 +8,7 @@ status: proposed Our taxonomy tree structure and knowledge/skill file structure was designed with upstream taxonomy submissions in mind. An end user working with a taxonomy locally using InstructLab has to follow all of those requirements, increasing complexity of their work. -The end user typically gets hung up on where to place the file in a massive file tree where sub-branches are not defined. For someone not working in the upstream taxonomy, this is basically bikeshedding[^bikeshed]. The only requirement for the SDG process is sorting things into `knowledge` and `skills`. +The end user typically gets hung up on where to place the file in a massive file tree where sub-branches are not defined. For someone not working in the upstream taxonomy, this is basically bikeshedding[^bike shed]. The only requirement for the SDG process is sorting things into `knowledge` and `skills`. The user experience of working with the `qna.yaml` file is poor for a handful of reasons: @@ -16,14 +16,14 @@ The user experience of working with the `qna.yaml` file is poor for a handful of - YAML is a notoriously complex, loose format with a lot of potholes. - YAML files of different specifications parse completely differently (e.g., 1.2 vs 1.1). - Note that PyYAML, our base tool [^tooling], parses YAML 1.1, not 1.2. There is a long way to go[^PyYAML] to support 1.2, which has been the latest spec since 2009. As such, even if someone were to search the Internet for a solution because they are not familiar with YAML, they likely will stumble across 1.2 solutions that don't work for 1.1. - - There are at least 9 different ways to indicate a multi-line string in YAML[^9ways], depending on which block scalar indicator[^blockscalar] is used and which block chomping indicator[^blockchomping] is used (this does **not** count the indentation indicator[^blockindentation]!). Then there are double-quoted flow scalar multilines[^doublequotedflowscalar] and single-quoted flow scalar multilines[^singlequotedflowscalar], which can cause more problems. + - There are at least 9 different ways to indicate a multi-line string in YAML[^9 ways], depending on which block scalar indicator[^block scalar] is used and which block chomping indicator[^block chomping] is used (this does **not** count the indentation indicator[^block indentation]!). Then there are double-quoted flow scalar multilines[^double quoted flow scalar] and single-quoted flow scalar multilines[^single quoted flow scalar], which can cause more problems. - The linting system, intended to ensure the YAML file is readable by the SDG process, adds more burden on the non-technical user. - The linter for YAML enforces an 80-character line length by default. That makes sense if you're working on code read from a terminal, but not to a typical end user used to working with rich text editors for a reading comprehension experience working with paragraphs. - The linter also complains about trailing whitespace, another common thing that the typical end user won't understand why everything is failing. From a code perspective, -- We are already using JSON in the datamixing process in SDG[^datamixing]. +- We are already using JSON in the data mixing process in SDG[^data mixing]. - Docling also exports JSON as input and output[^docling]. - JSON is also much more friendly to UI work, which is a primary path we would like people to use. @@ -45,13 +45,13 @@ The process of writing question and answer sets also is more like writing readin - Write documentation and tutorials based on existing tutorials on writing reading comprehension questions and example answers for standardized exams. - Most people can understand reading to learn versus learning to read type questions. The new, streamlined schema that matches the most simple needs could help here along with a solid set of docs and tutorials on how to write reading comprehension sets. We could borrow heavily from the standard tutorials for writing standardized exams that are out there for free and already battle-tested. -[^bikeshed]: The story of the bikeshed is a common metaphor. The story goes that a group that is working on the approvals for the construction plan of a nuclear power plant gets stuck on what color to paint the bike shed at one of the entrances to the plant. Mutliple meetings are scheduled to hash out the issue of the color of the bike shed, with heated arguments. However, the rest of the plan for the power plant is not examined in detail or critiqued. People have an easier time evaluating and having an opinion on something that is as trivial as a bike shed's color when faced with complex decisions on other systems. https://en.wiktionary.org/wiki/bikeshedding -[^9ways]: You can experience this issue in action with the interactive experience on https://yaml-multiline.info/. -[^blockscalar]: https://yaml.org/spec/1.2.2/#81-block-scalar-styles +[^bike shed]: The story of the bikeshed is a common metaphor. The story goes that a group that is working on the approvals for the construction plan of a nuclear power plant gets stuck on what color to paint the bike shed at one of the entrances to the plant. Multiple meetings are scheduled to hash out the issue of the color of the bike shed, with heated arguments. However, the rest of the plan for the power plant is not examined in detail or critiqued. People have an easier time evaluating and having an opinion on something that is as trivial as a bike shed's color when faced with complex decisions on other systems. [Wiktionary entry](https://en.wiktionary.org/wiki/bikeshedding) +[^9 ways]: You can experience this issue in action with the interactive experience on [yaml-multiline.info](https://yaml-multiline.info/). +[^block scalar]: [YAML Spec v1.2.2 on block scalar styles](https://yaml.org/spec/1.2.2/#81-block-scalar-styles) > YAML provides two block scalar styles, literal and folded. Each provides a different trade-off between readability and expressive power. -[^blockchomping]: https://yaml.org/spec/1.2.2/#8112-block-chomping-indicator +[^block chomping]: [YAML Spec v1.2.2 on block chomping indicators](https://yaml.org/spec/1.2.2/#8112-block-chomping-indicator) > Chomping controls how final line breaks and trailing empty lines are interpreted. YAML provides three chomping methods: -[^blockindentation]: https://yaml.org/spec/1.2.2/#8111-block-indentation-indicator +[^block indentation]: [YAML Spec v1.2.2 on block indentation indicators](https://yaml.org/spec/1.2.2/#8111-block-indentation-indicator) > Every block scalar has a content indentation level. The content of the block scalar excludes a number of leading spaces on each line up to the content indentation level. > > If a block scalar has an indentation indicator, then the content indentation level of the block scalar is equal to the indentation level of the block scalar plus the integer value of the indentation indicator character. @@ -63,11 +63,11 @@ The process of writing question and answer sets also is more like writing readin >It is an error for any of the leading empty lines to contain more spaces than the first non-empty line. > >A YAML processor should only emit an explicit indentation indicator for cases where detection will fail. -[^doublequotedflowscalar]: https://yaml.org/spec/1.2.2/#double-quoted-style +[^double quoted flow scalar]: [YAML Spec v1.2.2 on the double-quoted flow scalar](https://yaml.org/spec/1.2.2/#double-quoted-style) > In a multi-line double-quoted scalar, line breaks are subject to flow line folding, which discards any trailing white space characters. It is also possible to escape the line break character. In this case, the escaped line break is excluded from the content and any trailing white space characters that precede the escaped line break are preserved. Combined with the ability to escape white space characters, this allows double-quoted lines to be broken at arbitrary positions. -[^singlequotedflowscalar]: https://yaml.org/spec/1.2.2/#single-quoted-style +[^single quoted flow scalar]: [YAML Spec v1.2.2 on the single-quoted flow scalar](https://yaml.org/spec/1.2.2/#single-quoted-style) > In addition, it is only possible to break a long single-quoted line where a space character is surrounded by non-spaces. [...] All leading and trailing white space characters are excluded from the content. Each continuation line must therefore contain at least one non-space character. Empty lines, if any, are consumed as part of the line folding. -[^datamixing]: stuff -[^docling]: https://ds4sd.github.io/docling/supported_formats/ notes docling supports JSON-serialized Docling Documents and Markdown as input and JSON and Markdown as outputs. -[^PyYAML]: https://github.com/yaml/pyyaml/issues/486 -[^tooling]: https://github.com/instructlab/schema/blob/main/pyproject.toml#L27-L30 and \ No newline at end of file +[^data mixing]: stuff +[^docling]: [The Docling documentation](https://ds4sd.github.io/docling/supported_formats/) notes docling supports JSON-serialized Docling Documents and Markdown as input and JSON and Markdown as outputs. +[^PyYAML]: [yaml/pyyaml#486](https://github.com/yaml/pyyaml/issues/486) +[^tooling]: [Our tooling dependencies](https://github.com/instructlab/schema/blob/main/pyproject.toml#L27-L30) \ No newline at end of file From 5f289457d361085f678f38933a48eceb8fd68c33 Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Wed, 5 Feb 2025 18:04:54 -0600 Subject: [PATCH 5/9] feat(solutions): reorganize and add to solutions Signed-off-by: Laura Santamaria --- docs/taxonomy-revamp-2025.md | 46 ++++++++++++++++++++++++++---------- 1 file changed, 33 insertions(+), 13 deletions(-) diff --git a/docs/taxonomy-revamp-2025.md b/docs/taxonomy-revamp-2025.md index c1f224b7..442df953 100644 --- a/docs/taxonomy-revamp-2025.md +++ b/docs/taxonomy-revamp-2025.md @@ -31,19 +31,39 @@ Overall, the `qna.yaml` file needs to have fewer knobs and fewer pitfalls. The process of writing question and answer sets also is more like writing reading comprehension sets from a standardized exam. It would be better to frame this hands-on part of the process as similar to the passage and question sets from English reading comprehension exams -## Proposed solution - -- Drop the folder structure in favor of a schema field for submission type and even domain, if necessary. - - The schema field can be entered automatically via the UI through a user selecting `knowledge` or `skill`. -- Streamline the schema. - - Make `created_by`, `domain`, and `document_outline` optional fields. Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone. - - -- Switch to JSON and Markdown for the `qna.yaml` document. - - Allow the user to use Markdown in a WYSIWYG experience, and then use a Markdown-to-JSON converter to handle the conversion to a code-friendly format. - - Markdown is very user-friendly, and converters handle a lot of the issues with encoding and special characters that happen in situations like working in other languages. We don't have to worry about a linter arguing about line length with the end user, and we wouldn't have to think about whether the user used tabs or spaces or forgot to strip whitespace at the end of a line. -- Frame the Q&A writing process as a reading comprehension process. - - Write documentation and tutorials based on existing tutorials on writing reading comprehension questions and example answers for standardized exams. - - Most people can understand reading to learn versus learning to read type questions. The new, streamlined schema that matches the most simple needs could help here along with a solid set of docs and tutorials on how to write reading comprehension sets. We could borrow heavily from the standard tutorials for writing standardized exams that are out there for free and already battle-tested. +## Proposed solutions + +To fix the user experience when working with data, I propose the following ideas. In general, the basic idea is "Keep It Simple; Make It Tick." + +### Use a schema field rather than directory tree structure + +Drop the folder structure in favor of a schema field for submission type and even domain, if necessary. The schema field can be entered automatically via the UI through a user selecting `knowledge` or `skill`. + +### Streamline the schema + +Make `created_by`, `domain`, and `document_outline` optional fields. Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone. + +### Switch to JSON and Markdown for the `qna.yaml` document + +Allow the user to use Markdown in a WYSIWYG experience, and then use a Markdown-to-JSON converter to handle the conversion to a code-friendly format. + +Markdown is very user-friendly, and converters handle a lot of the issues with encoding and special characters that happen in situations like working in other languages. We don't have to worry about a linter arguing about line length with the end user, and we wouldn't have to think about whether the user used tabs or spaces or forgot to strip whitespace at the end of a line. + +This would also make it a lot easier for the UI to work with contributions. JSON plays well with JavaScript overall without importing more libraries and creating dependency issues, and Python has a very good built-in for working with JSON files. + +Users who decide to build it without needing the converter are likely familiar with JSON, and there are fewer pitfalls and less likelihood of tooling choices impacting meaning as JSON's standard has not changed since 2017, and barely changed from the original standard. + +### Reframe the Q&A writing process as a reading comprehension process + +Write documentation and tutorials based on existing tutorials on writing reading comprehension questions and example answers for standardized exams. + +Most people can understand reading to learn versus learning to read type questions. The new, streamlined schema that matches the most simple needs could help here along with a solid set of docs and tutorials on how to write reading comprehension sets. We could borrow heavily from the standard tutorials for writing standardized exams that are out there for free and already battle-tested. + +## Unaddressed concerns + +The issue of needing a git repository for document storage is possibly out of scope of this document. However, I'm adding it as something that may need its own ADR/dev doc. The end user experience of needing a git repository is needlessly complex and also still follows the idea of the upstream taxonomy and community model build. A user working with InstructLab locally does not need the version tracking provided by git and likely probably already has a document storage system. I propose changing the general idea from a git repository to a simple address, whether that's local storage, remote storage, or a version-controlled repository. Make it more flexible. + +∎ [^bike shed]: The story of the bikeshed is a common metaphor. The story goes that a group that is working on the approvals for the construction plan of a nuclear power plant gets stuck on what color to paint the bike shed at one of the entrances to the plant. Multiple meetings are scheduled to hash out the issue of the color of the bike shed, with heated arguments. However, the rest of the plan for the power plant is not examined in detail or critiqued. People have an easier time evaluating and having an opinion on something that is as trivial as a bike shed's color when faced with complex decisions on other systems. [Wiktionary entry](https://en.wiktionary.org/wiki/bikeshedding) [^9 ways]: You can experience this issue in action with the interactive experience on [yaml-multiline.info](https://yaml-multiline.info/). From 70d58df4525174e1038a1162a6aa805b4c2d5ddc Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Wed, 5 Feb 2025 18:12:27 -0600 Subject: [PATCH 6/9] fix(footnotes): patch the footnotes by just using numbers (should fix spelling, too) Signed-off-by: Laura Santamaria --- docs/taxonomy-revamp-2025.md | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/docs/taxonomy-revamp-2025.md b/docs/taxonomy-revamp-2025.md index 442df953..afdda68b 100644 --- a/docs/taxonomy-revamp-2025.md +++ b/docs/taxonomy-revamp-2025.md @@ -8,23 +8,23 @@ status: proposed Our taxonomy tree structure and knowledge/skill file structure was designed with upstream taxonomy submissions in mind. An end user working with a taxonomy locally using InstructLab has to follow all of those requirements, increasing complexity of their work. -The end user typically gets hung up on where to place the file in a massive file tree where sub-branches are not defined. For someone not working in the upstream taxonomy, this is basically bikeshedding[^bike shed]. The only requirement for the SDG process is sorting things into `knowledge` and `skills`. +The end user typically gets hung up on where to place the file in a massive file tree where sub-branches are not defined. For someone not working in the upstream taxonomy, this is basically bikeshedding[^1]. The only requirement for the SDG process is sorting things into `knowledge` and `skills`. The user experience of working with the `qna.yaml` file is poor for a handful of reasons: - Most of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy. - YAML is a notoriously complex, loose format with a lot of potholes. - YAML files of different specifications parse completely differently (e.g., 1.2 vs 1.1). - - Note that PyYAML, our base tool [^tooling], parses YAML 1.1, not 1.2. There is a long way to go[^PyYAML] to support 1.2, which has been the latest spec since 2009. As such, even if someone were to search the Internet for a solution because they are not familiar with YAML, they likely will stumble across 1.2 solutions that don't work for 1.1. - - There are at least 9 different ways to indicate a multi-line string in YAML[^9 ways], depending on which block scalar indicator[^block scalar] is used and which block chomping indicator[^block chomping] is used (this does **not** count the indentation indicator[^block indentation]!). Then there are double-quoted flow scalar multilines[^double quoted flow scalar] and single-quoted flow scalar multilines[^single quoted flow scalar], which can cause more problems. + - Note that PyYAML, our base tool [^2], parses YAML 1.1, not 1.2. There is a long way to go[^3] to support 1.2, which has been the latest spec since 2009. As such, even if someone were to search the Internet for a solution because they are not familiar with YAML, they likely will stumble across 1.2 solutions that don't work for 1.1. + - There are at least 9 different ways to indicate a multi-line string in YAML[^4], depending on which block scalar indicator[^5] is used and which block chomping indicator[^6] is used (this does **not** count the indentation indicator[^7]!). Then there are double-quoted flow scalar multilines[^8] and single-quoted flow scalar multilines[^9], which can cause more problems. - The linting system, intended to ensure the YAML file is readable by the SDG process, adds more burden on the non-technical user. - The linter for YAML enforces an 80-character line length by default. That makes sense if you're working on code read from a terminal, but not to a typical end user used to working with rich text editors for a reading comprehension experience working with paragraphs. - The linter also complains about trailing whitespace, another common thing that the typical end user won't understand why everything is failing. From a code perspective, -- We are already using JSON in the data mixing process in SDG[^data mixing]. -- Docling also exports JSON as input and output[^docling]. +- We are already using JSON in the data mixing process in SDG[^10]. +- Docling also exports JSON as input and output[^11]. - JSON is also much more friendly to UI work, which is a primary path we would like people to use. Overall, the `qna.yaml` file needs to have fewer knobs and fewer pitfalls. @@ -65,13 +65,15 @@ The issue of needing a git repository for document storage is possibly out of sc ∎ -[^bike shed]: The story of the bikeshed is a common metaphor. The story goes that a group that is working on the approvals for the construction plan of a nuclear power plant gets stuck on what color to paint the bike shed at one of the entrances to the plant. Multiple meetings are scheduled to hash out the issue of the color of the bike shed, with heated arguments. However, the rest of the plan for the power plant is not examined in detail or critiqued. People have an easier time evaluating and having an opinion on something that is as trivial as a bike shed's color when faced with complex decisions on other systems. [Wiktionary entry](https://en.wiktionary.org/wiki/bikeshedding) -[^9 ways]: You can experience this issue in action with the interactive experience on [yaml-multiline.info](https://yaml-multiline.info/). -[^block scalar]: [YAML Spec v1.2.2 on block scalar styles](https://yaml.org/spec/1.2.2/#81-block-scalar-styles) +[^1]: The story of the bikeshed is a common metaphor. The story goes that a group that is working on the approvals for the construction plan of a nuclear power plant gets stuck on what color to paint the bike shed at one of the entrances to the plant. Multiple meetings are scheduled to hash out the issue of the color of the bike shed, with heated arguments. However, the rest of the plan for the power plant is not examined in detail or critiqued. People have an easier time evaluating and having an opinion on something that is as trivial as a bike shed's color when faced with complex decisions on other systems. [Wiktionary entry](https://en.wiktionary.org/wiki/bikeshedding) +[^2]: [Our tooling dependencies](https://github.com/instructlab/schema/blob/main/pyproject.toml#L27-L30) +[^3]: [yaml/pyyaml#486](https://github.com/yaml/pyyaml/issues/486) +[^4]: You can experience this issue in action with the interactive experience on [yaml-multiline.info](https://yaml-multiline.info/). +[^5]: [YAML Spec v1.2.2 on block scalar styles](https://yaml.org/spec/1.2.2/#81-block-scalar-styles) > YAML provides two block scalar styles, literal and folded. Each provides a different trade-off between readability and expressive power. -[^block chomping]: [YAML Spec v1.2.2 on block chomping indicators](https://yaml.org/spec/1.2.2/#8112-block-chomping-indicator) +[^6]: [YAML Spec v1.2.2 on block chomping indicators](https://yaml.org/spec/1.2.2/#8112-block-chomping-indicator) > Chomping controls how final line breaks and trailing empty lines are interpreted. YAML provides three chomping methods: -[^block indentation]: [YAML Spec v1.2.2 on block indentation indicators](https://yaml.org/spec/1.2.2/#8111-block-indentation-indicator) +[^7]: [YAML Spec v1.2.2 on block indentation indicators](https://yaml.org/spec/1.2.2/#8111-block-indentation-indicator) > Every block scalar has a content indentation level. The content of the block scalar excludes a number of leading spaces on each line up to the content indentation level. > > If a block scalar has an indentation indicator, then the content indentation level of the block scalar is equal to the indentation level of the block scalar plus the integer value of the indentation indicator character. @@ -83,11 +85,9 @@ The issue of needing a git repository for document storage is possibly out of sc >It is an error for any of the leading empty lines to contain more spaces than the first non-empty line. > >A YAML processor should only emit an explicit indentation indicator for cases where detection will fail. -[^double quoted flow scalar]: [YAML Spec v1.2.2 on the double-quoted flow scalar](https://yaml.org/spec/1.2.2/#double-quoted-style) +[^8]: [YAML Spec v1.2.2 on the double-quoted flow scalar](https://yaml.org/spec/1.2.2/#double-quoted-style) > In a multi-line double-quoted scalar, line breaks are subject to flow line folding, which discards any trailing white space characters. It is also possible to escape the line break character. In this case, the escaped line break is excluded from the content and any trailing white space characters that precede the escaped line break are preserved. Combined with the ability to escape white space characters, this allows double-quoted lines to be broken at arbitrary positions. -[^single quoted flow scalar]: [YAML Spec v1.2.2 on the single-quoted flow scalar](https://yaml.org/spec/1.2.2/#single-quoted-style) +[^9]: [YAML Spec v1.2.2 on the single-quoted flow scalar](https://yaml.org/spec/1.2.2/#single-quoted-style) > In addition, it is only possible to break a long single-quoted line where a space character is surrounded by non-spaces. [...] All leading and trailing white space characters are excluded from the content. Each continuation line must therefore contain at least one non-space character. Empty lines, if any, are consumed as part of the line folding. -[^data mixing]: stuff -[^docling]: [The Docling documentation](https://ds4sd.github.io/docling/supported_formats/) notes docling supports JSON-serialized Docling Documents and Markdown as input and JSON and Markdown as outputs. -[^PyYAML]: [yaml/pyyaml#486](https://github.com/yaml/pyyaml/issues/486) -[^tooling]: [Our tooling dependencies](https://github.com/instructlab/schema/blob/main/pyproject.toml#L27-L30) \ No newline at end of file +[^10]: stuff +[^11]: [The Docling documentation](https://ds4sd.github.io/docling/supported_formats/) notes docling supports JSON-serialized Docling Documents and Markdown as input and JSON and Markdown as outputs. From 179aef80321dfe561f037b9f1d9c6f40876adf5c Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Thu, 6 Feb 2025 09:11:24 -0600 Subject: [PATCH 7/9] style(linter): add to spelling list and remove random whitespace Signed-off-by: Laura Santamaria --- .spellcheck-en-custom.txt | 9 +++++++++ docs/taxonomy-revamp-2025.md | 4 ++-- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt index ffd79008..00d8cd24 100644 --- a/.spellcheck-en-custom.txt +++ b/.spellcheck-en-custom.txt @@ -15,6 +15,7 @@ backend backends benchmarking Bhandwaldar +bikeshed bikeshedding brainer Cappi @@ -121,6 +122,7 @@ Langgraph leaderboard lifecycle lignment +linter LLM LLMs llms @@ -146,6 +148,8 @@ MMLU modularize modularized MTEB +multiline +multilines Murdock mvp Nakamura @@ -194,6 +198,7 @@ PyPI pyproject PyTorch pyyaml +PyYAML qlora qna quantized @@ -202,6 +207,8 @@ Radeon RDNA README rebase +Reframe +reframe Ren repo repos @@ -273,8 +280,10 @@ Vishnoi vLLM vllm watsonx +whitespace Wikisource wikisql +Wiktionary WIP WSL xcode diff --git a/docs/taxonomy-revamp-2025.md b/docs/taxonomy-revamp-2025.md index afdda68b..580f2e5a 100644 --- a/docs/taxonomy-revamp-2025.md +++ b/docs/taxonomy-revamp-2025.md @@ -51,7 +51,7 @@ Markdown is very user-friendly, and converters handle a lot of the issues with e This would also make it a lot easier for the UI to work with contributions. JSON plays well with JavaScript overall without importing more libraries and creating dependency issues, and Python has a very good built-in for working with JSON files. -Users who decide to build it without needing the converter are likely familiar with JSON, and there are fewer pitfalls and less likelihood of tooling choices impacting meaning as JSON's standard has not changed since 2017, and barely changed from the original standard. +Users who decide to build it without needing the converter are likely familiar with JSON, and there are fewer pitfalls and less likelihood of tooling choices impacting meaning as the JSON standard has not changed since 2017, and barely changed from the original standard. ### Reframe the Q&A writing process as a reading comprehension process @@ -66,7 +66,7 @@ The issue of needing a git repository for document storage is possibly out of sc ∎ [^1]: The story of the bikeshed is a common metaphor. The story goes that a group that is working on the approvals for the construction plan of a nuclear power plant gets stuck on what color to paint the bike shed at one of the entrances to the plant. Multiple meetings are scheduled to hash out the issue of the color of the bike shed, with heated arguments. However, the rest of the plan for the power plant is not examined in detail or critiqued. People have an easier time evaluating and having an opinion on something that is as trivial as a bike shed's color when faced with complex decisions on other systems. [Wiktionary entry](https://en.wiktionary.org/wiki/bikeshedding) -[^2]: [Our tooling dependencies](https://github.com/instructlab/schema/blob/main/pyproject.toml#L27-L30) +[^2]: [Our tooling dependencies](https://github.com/instructlab/schema/blob/main/pyproject.toml#L27-L30) [^3]: [yaml/pyyaml#486](https://github.com/yaml/pyyaml/issues/486) [^4]: You can experience this issue in action with the interactive experience on [yaml-multiline.info](https://yaml-multiline.info/). [^5]: [YAML Spec v1.2.2 on block scalar styles](https://yaml.org/spec/1.2.2/#81-block-scalar-styles) From 1d730f90d7a7a0329f79c204db2517e0a752921c Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Thu, 6 Feb 2025 09:25:14 -0600 Subject: [PATCH 8/9] fix(content): add a couple more thoughts that were kicking around Signed-off-by: Laura Santamaria --- docs/taxonomy-revamp-2025.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/taxonomy-revamp-2025.md b/docs/taxonomy-revamp-2025.md index 580f2e5a..3d0cad76 100644 --- a/docs/taxonomy-revamp-2025.md +++ b/docs/taxonomy-revamp-2025.md @@ -8,12 +8,12 @@ status: proposed Our taxonomy tree structure and knowledge/skill file structure was designed with upstream taxonomy submissions in mind. An end user working with a taxonomy locally using InstructLab has to follow all of those requirements, increasing complexity of their work. -The end user typically gets hung up on where to place the file in a massive file tree where sub-branches are not defined. For someone not working in the upstream taxonomy, this is basically bikeshedding[^1]. The only requirement for the SDG process is sorting things into `knowledge` and `skills`. +The end user gets hung up on where to place the file in a massive file tree where sub-branches are not defined. For someone not working in the upstream taxonomy, this is basically bikeshedding[^1]. The only requirement for the SDG process is sorting things into `knowledge` and `skills`. The user experience of working with the `qna.yaml` file is poor for a handful of reasons: -- Most of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy. -- YAML is a notoriously complex, loose format with a lot of potholes. +- Many of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy. +- YAML is a notoriously complex, loose format with a lot of potholes. As a couple of examples: - YAML files of different specifications parse completely differently (e.g., 1.2 vs 1.1). - Note that PyYAML, our base tool [^2], parses YAML 1.1, not 1.2. There is a long way to go[^3] to support 1.2, which has been the latest spec since 2009. As such, even if someone were to search the Internet for a solution because they are not familiar with YAML, they likely will stumble across 1.2 solutions that don't work for 1.1. - There are at least 9 different ways to indicate a multi-line string in YAML[^4], depending on which block scalar indicator[^5] is used and which block chomping indicator[^6] is used (this does **not** count the indentation indicator[^7]!). Then there are double-quoted flow scalar multilines[^8] and single-quoted flow scalar multilines[^9], which can cause more problems. @@ -45,13 +45,13 @@ Make `created_by`, `domain`, and `document_outline` optional fields. Enforce tho ### Switch to JSON and Markdown for the `qna.yaml` document -Allow the user to use Markdown in a WYSIWYG experience, and then use a Markdown-to-JSON converter to handle the conversion to a code-friendly format. +Allow the user to use Markdown in a WYSIWYG experience, and then use a Markdown-to-JSON converter to handle the conversion to a code-friendly and machine-readable format. Markdown is very user-friendly, and converters handle a lot of the issues with encoding and special characters that happen in situations like working in other languages. We don't have to worry about a linter arguing about line length with the end user, and we wouldn't have to think about whether the user used tabs or spaces or forgot to strip whitespace at the end of a line. -This would also make it a lot easier for the UI to work with contributions. JSON plays well with JavaScript overall without importing more libraries and creating dependency issues, and Python has a very good built-in for working with JSON files. +This would also make it a lot easier for the UI to work with contributions. JSON plays well with JavaScript overall without importing more libraries and creating dependency issues, and Python has a very good built-in for working with JSON files. Fewer dependencies means a smaller attack surface, as well. -Users who decide to build it without needing the converter are likely familiar with JSON, and there are fewer pitfalls and less likelihood of tooling choices impacting meaning as the JSON standard has not changed since 2017, and barely changed from the original standard. +Users who decide to build it without needing the converter are likely familiar with JSON, and there are fewer pitfalls and less likelihood of tooling choices impacting meaning (e.g., where line breaks are for paragraph structures) as the JSON standard has not changed since 2017, and barely changed from the original standard. ### Reframe the Q&A writing process as a reading comprehension process @@ -61,7 +61,7 @@ Most people can understand reading to learn versus learning to read type questio ## Unaddressed concerns -The issue of needing a git repository for document storage is possibly out of scope of this document. However, I'm adding it as something that may need its own ADR/dev doc. The end user experience of needing a git repository is needlessly complex and also still follows the idea of the upstream taxonomy and community model build. A user working with InstructLab locally does not need the version tracking provided by git and likely probably already has a document storage system. I propose changing the general idea from a git repository to a simple address, whether that's local storage, remote storage, or a version-controlled repository. Make it more flexible. +The issue of needing a git repository for document storage is possibly out of scope of this document. However, I'm adding it as something that may need its own ADR/dev doc. The end user experience of needing a git repository is needlessly complex and also still follows the idea of the upstream taxonomy and community model build. A user working with InstructLab locally does not need the version tracking provided by git and likely probably already has a document storage system. I propose changing the general idea from a git repository to a simple address, whether that's local storage, remote storage, or a version-controlled repository. Make it more flexible and extensible to match where someone chooses to store their data, perhaps through an environment variable to set as one implementation example. This could also decouple the documentation process from the SDG process by allowing the end-user subject-matter expert to create and upload content to a central store without ever touching InstructLab's tooling chains and then a end-user operations or development specialist to run the InstructLab tooling separately. ∎ From c4785f491c07bac2f2bf3c807ceb1ad018183f99 Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Thu, 6 Feb 2025 09:40:52 -0600 Subject: [PATCH 9/9] fix(footnote): fixed forgotten footnote reference Signed-off-by: Laura Santamaria --- docs/taxonomy-revamp-2025.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/taxonomy-revamp-2025.md b/docs/taxonomy-revamp-2025.md index 3d0cad76..9c32bdb3 100644 --- a/docs/taxonomy-revamp-2025.md +++ b/docs/taxonomy-revamp-2025.md @@ -89,5 +89,5 @@ The issue of needing a git repository for document storage is possibly out of sc > In a multi-line double-quoted scalar, line breaks are subject to flow line folding, which discards any trailing white space characters. It is also possible to escape the line break character. In this case, the escaped line break is excluded from the content and any trailing white space characters that precede the escaped line break are preserved. Combined with the ability to escape white space characters, this allows double-quoted lines to be broken at arbitrary positions. [^9]: [YAML Spec v1.2.2 on the single-quoted flow scalar](https://yaml.org/spec/1.2.2/#single-quoted-style) > In addition, it is only possible to break a long single-quoted line where a space character is surrounded by non-spaces. [...] All leading and trailing white space characters are excluded from the content. Each continuation line must therefore contain at least one non-space character. Empty lines, if any, are consumed as part of the line folding. -[^10]: stuff +[^10]: [Based on the SDG documentation](https://github.com/instructlab/sdg/blob/main/docs/dataset_formats.md#data-mixing-recipes-and-mixed-dataset-output), the process outputs JSON formats for the mixed datasets. [^11]: [The Docling documentation](https://ds4sd.github.io/docling/supported_formats/) notes docling supports JSON-serialized Docling Documents and Markdown as input and JSON and Markdown as outputs.