Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement Proposal: Taxonomy Updates #188

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,6 @@ venv
.\#*
.projectile
.dir-locals.el

# PyCharm/JetBrains
.idea
11 changes: 11 additions & 0 deletions .spellcheck-en-custom.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ backend
backends
benchmarking
Bhandwaldar
bikeshed
bikeshedding
brainer
Cappi
checkpointing
Expand Down Expand Up @@ -120,6 +122,7 @@ Langgraph
leaderboard
lifecycle
lignment
linter
LLM
LLMs
llms
Expand All @@ -145,6 +148,8 @@ MMLU
modularize
modularized
MTEB
multiline
multilines
Murdock
mvp
Nakamura
Expand Down Expand Up @@ -193,6 +198,7 @@ PyPI
pyproject
PyTorch
pyyaml
PyYAML
qlora
qna
quantized
Expand All @@ -201,6 +207,8 @@ Radeon
RDNA
README
rebase
Reframe
reframe
Ren
repo
repos
Expand All @@ -214,6 +222,7 @@ SaaS
safetensor
safetensors
Salawu
Santamaria
scalable
SDG
sdg
Expand Down Expand Up @@ -271,8 +280,10 @@ Vishnoi
vLLM
vllm
watsonx
whitespace
Wikisource
wikisql
Wiktionary
WIP
WSL
xcode
Expand Down
93 changes: 93 additions & 0 deletions docs/taxonomy-revamp-2025.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
---
author: Laura Santamaria (@nimbinatus)
date: 05 February 2025
status: proposed
---

## Issues

Our taxonomy tree structure and knowledge/skill file structure was designed with upstream taxonomy submissions in mind. An end user working with a taxonomy locally using InstructLab has to follow all of those requirements, increasing complexity of their work.

The end user typically gets hung up on where to place the file in a massive file tree where sub-branches are not defined. For someone not working in the upstream taxonomy, this is basically bikeshedding[^1]. The only requirement for the SDG process is sorting things into `knowledge` and `skills`.

The user experience of working with the `qna.yaml` file is poor for a handful of reasons:

- Most of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy.
- YAML is a notoriously complex, loose format with a lot of potholes.
- YAML files of different specifications parse completely differently (e.g., 1.2 vs 1.1).
- Note that PyYAML, our base tool [^2], parses YAML 1.1, not 1.2. There is a long way to go[^3] to support 1.2, which has been the latest spec since 2009. As such, even if someone were to search the Internet for a solution because they are not familiar with YAML, they likely will stumble across 1.2 solutions that don't work for 1.1.
- There are at least 9 different ways to indicate a multi-line string in YAML[^4], depending on which block scalar indicator[^5] is used and which block chomping indicator[^6] is used (this does **not** count the indentation indicator[^7]!). Then there are double-quoted flow scalar multilines[^8] and single-quoted flow scalar multilines[^9], which can cause more problems.
- The linting system, intended to ensure the YAML file is readable by the SDG process, adds more burden on the non-technical user.
- The linter for YAML enforces an 80-character line length by default. That makes sense if you're working on code read from a terminal, but not to a typical end user used to working with rich text editors for a reading comprehension experience working with paragraphs.
- The linter also complains about trailing whitespace, another common thing that the typical end user won't understand why everything is failing.

From a code perspective,

- We are already using JSON in the data mixing process in SDG[^10].
- Docling also exports JSON as input and output[^11].
- JSON is also much more friendly to UI work, which is a primary path we would like people to use.

Overall, the `qna.yaml` file needs to have fewer knobs and fewer pitfalls.

The process of writing question and answer sets also is more like writing reading comprehension sets from a standardized exam. It would be better to frame this hands-on part of the process as similar to the passage and question sets from English reading comprehension exams

## Proposed solutions

To fix the user experience when working with data, I propose the following ideas. In general, the basic idea is "Keep It Simple; Make It Tick."

### Use a schema field rather than directory tree structure

Drop the folder structure in favor of a schema field for submission type and even domain, if necessary. The schema field can be entered automatically via the UI through a user selecting `knowledge` or `skill`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing the directory tree structure lets you do is organize at multiple levels, e.g.,knowledge/animals/reptiles/turtles. I don't know how important that ability is. I guess we could encode that in the schema field or (e.g., have the value be knowledge/animals/reptiles/turtles) but I think it would make more sense to have a field for this purpose, e.g., the schema could be knowledge and the categorization could be animals/reptiles/turtles.

If we don't have the ability to nest knowledge and skills into groups and subgroups, then I would say it is not really a taxonomy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly think that's really the question here, and I honestly don't know. I can't find reference to it being useful in the code, but I may be missing something. Is the whole nested structure really necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's one reason why it could turn out to be useful eventually: Imagine you have too much data and you want to do some sort of subset selection either before running SDG (generate last data to begin with) or after running SDG (removing some of the data that you just generated). In either case you might want to use the hierarchy to constrain what gets discarded. Maybe you want to ensure the coverage stays as wide as possible so you would rather discard half of the stuff in animals/reptiles and half of the stuff in animals/insects then discarding all of the stuff in one and none of the stuff in the other. Or maybe you want to go the other way: you would rather teach to mastery in some subjects then teach to partial mastery in more subjects and then you would really rather discard an entire branch of the taxonomy then you would want to discard lots of pieces of lots of different branches. In either case having that structure would be useful.

On the other hand, in general is not a great idea to include stuff in a schema because there's a hypothetical argument for why it might be useful someday. So maybe we should put more thought into whether we really do want to do these things in the foreseeable future before making a commitment one way or the other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, here's another related reason: say you are building a large taxonomy (either the community taxonomy or maybe a private internal taxonomy for some customer). Maybe in phase one the taxonomy covers a large variety of topics and then in phase two it starts to get so large and you decide what you really want is to split up the taxonomy into pieces so that you can train separate models for each piece. In that case, having the hierarchy could be super useful because you've already done the work to split it up into pieces as you added stuff to the taxonomy instead waiting until you want to do the splitting up now you have this huge undifferentiated blob that you need to deal with. That might be a better argument for keeping the hierarchy than the one that I mentioned above around subsetting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File directory structure could end up being metadata on ingested chunks as well, which could be useful in multiple ways.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work if we allow but not require the directory structure, while making sure the filepath is part of metadata? Then it's up to the user to decide how they want to organize their files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. The important thing, from the code perspective, is that the actual directory names and nested tree structure is not really used other than to name temporary files and identify where there are new files. I've dug through the codebase for SDG. A user organize their files in a taxonomy like science > biology > ornithology or in one like documents > reports > lab, and the SDG process doesn't know the difference. A taxonomy with knowledge categorization is a human construct for human use, and I don't think we should define that for the end user if we want this to be generalizable to any end user's system. We can encourage sharing that information with us through metadata, sure, but it's not something we use otherwise.

In short, I think maintaining a document store and version control for a user, whether forcing a specific way to do it or providing a system ourselves, is outside of our remit with the project.


### Streamline the schema

Make `created_by`, `domain`, and `document_outline` optional fields. Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: though -> through


### Switch to JSON and Markdown for the `qna.yaml` document
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with using markdown in our projects is that often times you want to include markdown in your content, so doing it this way will break the parsing without having a hacky solution.

JSON I like but it's not easy for humans. I do think we should support it, but I wouldn't rely on it as a primary "user-facing" format.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For using Markdown, the idea is a simple user-friendly writing format outside of the UI (which would be the happy path in my mind). Since we already require the markdown to be parsed when it gets taken into the SDG process, I think it makes sense to let the user see the output rendered, allowing them some generic ability to understand whether they converted things correctly.

JSON is the transport format, basically. An end user writing seed examples would not see it unless they explicitly choose to skip the markdown/UI formats and write it themselves. I would imagine that someone making that decision knows how to use it. But this allows us to add a programmatic guardrail in the conversion process to handle any and all translation layers, and I think that the conversion could be just as well handled by Docling as anything else, further reducing our dependency footprint and attack surface.

Copy link
Member

@RobotSail RobotSail Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nimbinatus I appreciate your thorough response, however I'm still having a hard time understanding your idea around Markdown. Could you please provide an example of what you're thinking of? I think that will help clear up my confusion.

I agree with your point around JSON, that makes sense 👍


Allow the user to use Markdown in a WYSIWYG experience, and then use a Markdown-to-JSON converter to handle the conversion to a code-friendly format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little worried that this would lead to a situation where it is easy to write Markdown that looks good but is not usable by SDG after it goes through the Markdown-to-JSON converter. However, I am not sure how big a deal that would be. What might help provide some intuitions is an example of what such a Markdown would look like; that might provide more of a sense of how hard it is to write SDG-complient Markdown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also, FWIW, I agree with the premise here that YAML is a big part of the problem with our current taxonomy format.)


Markdown is very user-friendly, and converters handle a lot of the issues with encoding and special characters that happen in situations like working in other languages. We don't have to worry about a linter arguing about line length with the end user, and we wouldn't have to think about whether the user used tabs or spaces or forgot to strip whitespace at the end of a line.

This would also make it a lot easier for the UI to work with contributions. JSON plays well with JavaScript overall without importing more libraries and creating dependency issues, and Python has a very good built-in for working with JSON files.

Users who decide to build it without needing the converter are likely familiar with JSON, and there are fewer pitfalls and less likelihood of tooling choices impacting meaning as the JSON standard has not changed since 2017, and barely changed from the original standard.

### Reframe the Q&A writing process as a reading comprehension process
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good for knowledge, I believe @abhi1092 is actually working on something along the lines of using natural concepts from reading comprehension for the Knowledge 1.5 pipeline.

For skills training though it's probably a bit different since there we want the model to learn how to transform and permute different data.


Write documentation and tutorials based on existing tutorials on writing reading comprehension questions and example answers for standardized exams.

Most people can understand reading to learn versus learning to read type questions. The new, streamlined schema that matches the most simple needs could help here along with a solid set of docs and tutorials on how to write reading comprehension sets. We could borrow heavily from the standard tutorials for writing standardized exams that are out there for free and already battle-tested.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question answering and reading comprehension are different tasks. Reading comprehension is about pulling information given a text, while question answering can involve reasoning / synthesis when a question is not concretely answered by some piece of text.

Is training on reading comprehension questions enough? @jwm4, thoughts?


## Unaddressed concerns

The issue of needing a git repository for document storage is possibly out of scope of this document. However, I'm adding it as something that may need its own ADR/dev doc. The end user experience of needing a git repository is needlessly complex and also still follows the idea of the upstream taxonomy and community model build. A user working with InstructLab locally does not need the version tracking provided by git and likely probably already has a document storage system. I propose changing the general idea from a git repository to a simple address, whether that's local storage, remote storage, or a version-controlled repository. Make it more flexible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under "Streamline the schema", you proposed "Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone." I feel we could do the same for the git repository issue: i.e., community submissions are required to link to git and provide commit hashes but that's enforced in the community repo not in InstructLab.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I have that in the Unaddressed Concerns section as I'm not sure whether that's in scope for this specific change. Which team owns that part of the process isn't quite clear, so I didn't want to go tromping on toes...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I would prefer to go the other way: Include the git stuff in this proposal and then see if anyone pushes back. If they have a good argument for keeping the git enforcement in InstructLab (instead of enforcing it at the community level), they can make it.



[^1]: The story of the bikeshed is a common metaphor. The story goes that a group that is working on the approvals for the construction plan of a nuclear power plant gets stuck on what color to paint the bike shed at one of the entrances to the plant. Multiple meetings are scheduled to hash out the issue of the color of the bike shed, with heated arguments. However, the rest of the plan for the power plant is not examined in detail or critiqued. People have an easier time evaluating and having an opinion on something that is as trivial as a bike shed's color when faced with complex decisions on other systems. [Wiktionary entry](https://en.wiktionary.org/wiki/bikeshedding)
[^2]: [Our tooling dependencies](https://github.com/instructlab/schema/blob/main/pyproject.toml#L27-L30)
[^3]: [yaml/pyyaml#486](https://github.com/yaml/pyyaml/issues/486)
[^4]: You can experience this issue in action with the interactive experience on [yaml-multiline.info](https://yaml-multiline.info/).
[^5]: [YAML Spec v1.2.2 on block scalar styles](https://yaml.org/spec/1.2.2/#81-block-scalar-styles)
> YAML provides two block scalar styles, literal and folded. Each provides a different trade-off between readability and expressive power.
[^6]: [YAML Spec v1.2.2 on block chomping indicators](https://yaml.org/spec/1.2.2/#8112-block-chomping-indicator)
> Chomping controls how final line breaks and trailing empty lines are interpreted. YAML provides three chomping methods:
[^7]: [YAML Spec v1.2.2 on block indentation indicators](https://yaml.org/spec/1.2.2/#8111-block-indentation-indicator)
> Every block scalar has a content indentation level. The content of the block scalar excludes a number of leading spaces on each line up to the content indentation level.
>
> If a block scalar has an indentation indicator, then the content indentation level of the block scalar is equal to the indentation level of the block scalar plus the integer value of the indentation indicator character.
>
> If no indentation indicator is given, then the content indentation level is equal to the number of leading spaces on the first non-empty line of the contents. If there is no non-empty line then the content indentation level is equal to the number of spaces on the longest line.
>
>It is an error if any non-empty line does not begin with a number of spaces greater than or equal to the content indentation level.
>
>It is an error for any of the leading empty lines to contain more spaces than the first non-empty line.
>
>A YAML processor should only emit an explicit indentation indicator for cases where detection will fail.
[^8]: [YAML Spec v1.2.2 on the double-quoted flow scalar](https://yaml.org/spec/1.2.2/#double-quoted-style)
> In a multi-line double-quoted scalar, line breaks are subject to flow line folding, which discards any trailing white space characters. It is also possible to escape the line break character. In this case, the escaped line break is excluded from the content and any trailing white space characters that precede the escaped line break are preserved. Combined with the ability to escape white space characters, this allows double-quoted lines to be broken at arbitrary positions.
[^9]: [YAML Spec v1.2.2 on the single-quoted flow scalar](https://yaml.org/spec/1.2.2/#single-quoted-style)
> In addition, it is only possible to break a long single-quoted line where a space character is surrounded by non-spaces. [...] All leading and trailing white space characters are excluded from the content. Each continuation line must therefore contain at least one non-space character. Empty lines, if any, are consumed as part of the line folding.
[^10]: stuff
[^11]: [The Docling documentation](https://ds4sd.github.io/docling/supported_formats/) notes docling supports JSON-serialized Docling Documents and Markdown as input and JSON and Markdown as outputs.