You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/taxonomy-revamp-2025.md
+14-14
Original file line number
Diff line number
Diff line change
@@ -8,22 +8,22 @@ status: proposed
8
8
9
9
Our taxonomy tree structure and knowledge/skill file structure was designed with upstream taxonomy submissions in mind. An end user working with a taxonomy locally using InstructLab has to follow all of those requirements, increasing complexity of their work.
10
10
11
-
The end user typically gets hung up on where to place the file in a massive file tree where sub-branches are not defined. For someone not working in the upstream taxonomy, this is basically bikeshedding[^bikeshed]. The only requirement for the SDG process is sorting things into `knowledge` and `skills`.
11
+
The end user typically gets hung up on where to place the file in a massive file tree where sub-branches are not defined. For someone not working in the upstream taxonomy, this is basically bikeshedding[^bike shed]. The only requirement for the SDG process is sorting things into `knowledge` and `skills`.
12
12
13
13
The user experience of working with the `qna.yaml` file is poor for a handful of reasons:
14
14
15
15
- Most of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy.
16
16
- YAML is a notoriously complex, loose format with a lot of potholes.
17
17
- YAML files of different specifications parse completely differently (e.g., 1.2 vs 1.1).
18
18
- Note that PyYAML, our base tool [^tooling], parses YAML 1.1, not 1.2. There is a long way to go[^PyYAML] to support 1.2, which has been the latest spec since 2009. As such, even if someone were to search the Internet for a solution because they are not familiar with YAML, they likely will stumble across 1.2 solutions that don't work for 1.1.
19
-
- There are at least 9 different ways to indicate a multi-line string in YAML[^9ways], depending on which block scalar indicator[^blockscalar] is used and which block chomping indicator[^blockchomping] is used (this does **not** count the indentation indicator[^blockindentation]!). Then there are double-quoted flow scalar multilines[^doublequotedflowscalar] and single-quoted flow scalar multilines[^singlequotedflowscalar], which can cause more problems.
19
+
- There are at least 9 different ways to indicate a multi-line string in YAML[^9 ways], depending on which block scalar indicator[^block scalar] is used and which block chomping indicator[^block chomping] is used (this does **not** count the indentation indicator[^block indentation]!). Then there are double-quoted flow scalar multilines[^double quoted flow scalar] and single-quoted flow scalar multilines[^single quoted flow scalar], which can cause more problems.
20
20
- The linting system, intended to ensure the YAML file is readable by the SDG process, adds more burden on the non-technical user.
21
21
- The linter for YAML enforces an 80-character line length by default. That makes sense if you're working on code read from a terminal, but not to a typical end user used to working with rich text editors for a reading comprehension experience working with paragraphs.
22
22
- The linter also complains about trailing whitespace, another common thing that the typical end user won't understand why everything is failing.
23
23
24
24
From a code perspective,
25
25
26
-
- We are already using JSON in the datamixing process in SDG[^datamixing].
26
+
- We are already using JSON in the data mixing process in SDG[^data mixing].
27
27
- Docling also exports JSON as input and output[^docling].
28
28
- JSON is also much more friendly to UI work, which is a primary path we would like people to use.
29
29
@@ -45,13 +45,13 @@ The process of writing question and answer sets also is more like writing readin
45
45
- Write documentation and tutorials based on existing tutorials on writing reading comprehension questions and example answers for standardized exams.
46
46
- Most people can understand reading to learn versus learning to read type questions. The new, streamlined schema that matches the most simple needs could help here along with a solid set of docs and tutorials on how to write reading comprehension sets. We could borrow heavily from the standard tutorials for writing standardized exams that are out there for free and already battle-tested.
47
47
48
-
[^bikeshed]: The story of the bikeshed is a common metaphor. The story goes that a group that is working on the approvals for the construction plan of a nuclear power plant gets stuck on what color to paint the bike shed at one of the entrances to the plant. Mutliple meetings are scheduled to hash out the issue of the color of the bike shed, with heated arguments. However, the rest of the plan for the power plant is not examined in detail or critiqued. People have an easier time evaluating and having an opinion on something that is as trivial as a bike shed's color when faced with complex decisions on other systems. https://en.wiktionary.org/wiki/bikeshedding
49
-
[^9ways]: You can experience this issue in action with the interactive experience on https://yaml-multiline.info/.
[^bike shed]: The story of the bikeshed is a common metaphor. The story goes that a group that is working on the approvals for the construction plan of a nuclear power plant gets stuck on what color to paint the bike shed at one of the entrances to the plant. Multiple meetings are scheduled to hash out the issue of the color of the bike shed, with heated arguments. However, the rest of the plan for the power plant is not examined in detail or critiqued. People have an easier time evaluating and having an opinion on something that is as trivial as a bike shed's color when faced with complex decisions on other systems. [Wiktionary entry](https://en.wiktionary.org/wiki/bikeshedding)
49
+
[^9 ways]: You can experience this issue in action with the interactive experience on [yaml-multiline.info](https://yaml-multiline.info/).
50
+
[^block scalar]: [YAML Spec v1.2.2 on block scalar styles](https://yaml.org/spec/1.2.2/#81-block-scalar-styles)
51
51
> YAML provides two block scalar styles, literal and folded. Each provides a different trade-off between readability and expressive power.
[^block indentation]: [YAML Spec v1.2.2 on block indentation indicators](https://yaml.org/spec/1.2.2/#8111-block-indentation-indicator)
55
55
> Every block scalar has a content indentation level. The content of the block scalar excludes a number of leading spaces on each line up to the content indentation level.
56
56
>
57
57
> If a block scalar has an indentation indicator, then the content indentation level of the block scalar is equal to the indentation level of the block scalar plus the integer value of the indentation indicator character.
@@ -63,11 +63,11 @@ The process of writing question and answer sets also is more like writing readin
63
63
>It is an error for any of the leading empty lines to contain more spaces than the first non-empty line.
64
64
>
65
65
>A YAML processor should only emit an explicit indentation indicator for cases where detection will fail.
[^double quoted flow scalar]: [YAML Spec v1.2.2 on the double-quoted flow scalar](https://yaml.org/spec/1.2.2/#double-quoted-style)
67
67
> In a multi-line double-quoted scalar, line breaks are subject to flow line folding, which discards any trailing white space characters. It is also possible to escape the line break character. In this case, the escaped line break is excluded from the content and any trailing white space characters that precede the escaped line break are preserved. Combined with the ability to escape white space characters, this allows double-quoted lines to be broken at arbitrary positions.
[^single quoted flow scalar]: [YAML Spec v1.2.2 on the single-quoted flow scalar](https://yaml.org/spec/1.2.2/#single-quoted-style)
69
69
> In addition, it is only possible to break a long single-quoted line where a space character is surrounded by non-spaces. [...] All leading and trailing white space characters are excluded from the content. Each continuation line must therefore contain at least one non-space character. Empty lines, if any, are consumed as part of the line folding.
70
-
[^datamixing]: stuff
71
-
[^docling]: https://ds4sd.github.io/docling/supported_formats/ notes docling supports JSON-serialized Docling Documents and Markdown as input and JSON and Markdown as outputs.
[^tooling]: https://github.com/instructlab/schema/blob/main/pyproject.toml#L27-L30 and
70
+
[^data mixing]: stuff
71
+
[^docling]: [The Docling documentation](https://ds4sd.github.io/docling/supported_formats/) notes docling supports JSON-serialized Docling Documents and Markdown as input and JSON and Markdown as outputs.
0 commit comments