Enhancement Proposal: Taxonomy Updates #188

nimbinatus · 2025-02-05T21:28:45Z

This proposal discusses how the concept of the taxonomy should change and evolve to meet user needs.

Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>

… spelling, too) Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>

Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>

jwm4

This basically looks good to me but I had some review comments too.

jwm4 · 2025-02-06T15:24:57Z

docs/taxonomy-revamp-2025.md

+
+### Use a schema field rather than directory tree structure
+
+Drop the folder structure in favor of a schema field for submission type and even domain, if necessary. The schema field can be entered automatically via the UI through a user selecting `knowledge` or `skill`.


Another thing the directory tree structure lets you do is organize at multiple levels, e.g.,knowledge/animals/reptiles/turtles. I don't know how important that ability is. I guess we could encode that in the schema field or (e.g., have the value be knowledge/animals/reptiles/turtles) but I think it would make more sense to have a field for this purpose, e.g., the schema could be knowledge and the categorization could be animals/reptiles/turtles.

If we don't have the ability to nest knowledge and skills into groups and subgroups, then I would say it is not really a taxonomy.

I honestly think that's really the question here, and I honestly don't know. I can't find reference to it being useful in the code, but I may be missing something. Is the whole nested structure really necessary?

Here's one reason why it could turn out to be useful eventually: Imagine you have too much data and you want to do some sort of subset selection either before running SDG (generate last data to begin with) or after running SDG (removing some of the data that you just generated). In either case you might want to use the hierarchy to constrain what gets discarded. Maybe you want to ensure the coverage stays as wide as possible so you would rather discard half of the stuff in animals/reptiles and half of the stuff in animals/insects then discarding all of the stuff in one and none of the stuff in the other. Or maybe you want to go the other way: you would rather teach to mastery in some subjects then teach to partial mastery in more subjects and then you would really rather discard an entire branch of the taxonomy then you would want to discard lots of pieces of lots of different branches. In either case having that structure would be useful.

On the other hand, in general is not a great idea to include stuff in a schema because there's a hypothetical argument for why it might be useful someday. So maybe we should put more thought into whether we really do want to do these things in the foreseeable future before making a commitment one way or the other.

Oh, here's another related reason: say you are building a large taxonomy (either the community taxonomy or maybe a private internal taxonomy for some customer). Maybe in phase one the taxonomy covers a large variety of topics and then in phase two it starts to get so large and you decide what you really want is to split up the taxonomy into pieces so that you can train separate models for each piece. In that case, having the hierarchy could be super useful because you've already done the work to split it up into pieces as you added stuff to the taxonomy instead waiting until you want to do the splitting up now you have this huge undifferentiated blob that you need to deal with. That might be a better argument for keeping the hierarchy than the one that I mentioned above around subsetting.

File directory structure could end up being metadata on ingested chunks as well, which could be useful in multiple ways.

Would it work if we allow but not require the directory structure, while making sure the filepath is part of metadata? Then it's up to the user to decide how they want to organize their files.

I think so. The important thing, from the code perspective, is that the actual directory names and nested tree structure is not really used other than to name temporary files and identify where there are new files. I've dug through the codebase for SDG. A user organize their files in a taxonomy like science > biology > ornithology or in one like documents > reports > lab, and the SDG process doesn't know the difference. A taxonomy with knowledge categorization is a human construct for human use, and I don't think we should define that for the end user if we want this to be generalizable to any end user's system. We can encourage sharing that information with us through metadata, sure, but it's not something we use otherwise.

In short, I think maintaining a document store and version control for a user, whether forcing a specific way to do it or providing a system ourselves, is outside of our remit with the project.

jwm4 · 2025-02-06T15:28:55Z

docs/taxonomy-revamp-2025.md

+
+### Switch to JSON and Markdown for the `qna.yaml` document
+
+Allow the user to use Markdown in a WYSIWYG experience, and then use a Markdown-to-JSON converter to handle the conversion to a code-friendly format.


I am a little worried that this would lead to a situation where it is easy to write Markdown that looks good but is not usable by SDG after it goes through the Markdown-to-JSON converter. However, I am not sure how big a deal that would be. What might help provide some intuitions is an example of what such a Markdown would look like; that might provide more of a sense of how hard it is to write SDG-complient Markdown.

(Also, FWIW, I agree with the premise here that YAML is a big part of the problem with our current taxonomy format.)

jwm4 · 2025-02-06T15:32:10Z

docs/taxonomy-revamp-2025.md

+
+## Unaddressed concerns
+
+The issue of needing a git repository for document storage is possibly out of scope of this document. However, I'm adding it as something that may need its own ADR/dev doc. The end user experience of needing a git repository is needlessly complex and also still follows the idea of the upstream taxonomy and community model build. A user working with InstructLab locally does not need the version tracking provided by git and likely probably already has a document storage system. I propose changing the general idea from a git repository to a simple address, whether that's local storage, remote storage, or a version-controlled repository. Make it more flexible.


Under "Streamline the schema", you proposed "Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone." I feel we could do the same for the git repository issue: i.e., community submissions are required to link to git and provide commit hashes but that's enforced in the community repo not in InstructLab.

Agreed. I have that in the Unaddressed Concerns section as I'm not sure whether that's in scope for this specific change. Which team owns that part of the process isn't quite clear, so I didn't want to go tromping on toes...

I guess I would prefer to go the other way: Include the git stuff in this proposal and then see if anyone pushes back. If they have a good argument for keeping the git enforcement in InstructLab (instead of enforcing it at the community level), they can make it.

docs/taxonomy-revamp-2025.md

Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>

alinaryan

These are great ideas! I would also suggest addressing the git dependency, upstream taxonomy, and community model builds in this doc. Alternatively, opening dev docs simultaneously to this one could be helpful, because each idea informs the other.

alinaryan · 2025-02-06T15:50:27Z

docs/taxonomy-revamp-2025.md

+
+The user experience of working with the `qna.yaml` file is poor for a handful of reasons:
+
+- Many of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy.


can you reference an example of this?

I read this as referring to the stuff in the "Streamline the schema" section below; if that's what is meant, then maybe a note like "See 'Streamline the schema' below" would be helpful here.

alinaryan

These are great ideas! I would also suggest addressing the git dependency, upstream taxonomy, and community model builds in this doc. Alternatively, opening dev docs simultaneously to this one could be helpful, because each idea informs the other.

nimbinatus · 2025-02-06T21:06:50Z

I would also suggest addressing the git dependency, upstream taxonomy, and community model builds in this doc. Alternatively, opening dev docs simultaneously to this one could be helpful, because each idea informs the other.

I think the questions of what to do with the upstream taxonomy and community model build are separate situations and not relevant to this document; there are references to the difference in a few places:

Our taxonomy tree structure and knowledge/skill file structure was designed with upstream taxonomy submissions in mind. An end user working with a taxonomy locally using InstructLab has to follow all of those requirements, increasing complexity of their work.

and

The end user gets hung up on where to place the file in a massive file tree where sub-branches are not defined. For someone not working in the upstream taxonomy, this is basically bikeshedding[^1]. The only requirement for the SDG process is sorting things into knowledge and skills.

and

Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone.

are some examples.

For the git dependency, I addressed an initial thought to solving that problem in the unaddressed concerns section, as I mentioned to Bill above. If there is a consensus that I fold that into this document, though, I am happy to do so.

anastasds · 2025-02-07T14:04:03Z

docs/taxonomy-revamp-2025.md

+
+Write documentation and tutorials based on existing tutorials on writing reading comprehension questions and example answers for standardized exams.
+
+Most people can understand reading to learn versus learning to read type questions. The new, streamlined schema that matches the most simple needs could help here along with a solid set of docs and tutorials on how to write reading comprehension sets. We could borrow heavily from the standard tutorials for writing standardized exams that are out there for free and already battle-tested.


Question answering and reading comprehension are different tasks. Reading comprehension is about pulling information given a text, while question answering can involve reasoning / synthesis when a question is not concretely answered by some piece of text.

Is training on reading comprehension questions enough? @jwm4, thoughts?

anastasds · 2025-02-07T14:05:44Z

docs/taxonomy-revamp-2025.md

+
+## Unaddressed concerns
+
+The issue of needing a git repository for document storage is possibly out of scope of this document. However, I'm adding it as something that may need its own ADR/dev doc. The end user experience of needing a git repository is needlessly complex and also still follows the idea of the upstream taxonomy and community model build. A user working with InstructLab locally does not need the version tracking provided by git and likely probably already has a document storage system. I propose changing the general idea from a git repository to a simple address, whether that's local storage, remote storage, or a version-controlled repository. Make it more flexible and extensible to match where someone chooses to store their data, perhaps through an environment variable to set as one implementation example. This could also decouple the documentation process from the SDG process by allowing the end-user subject-matter expert to create and upload content to a central store without ever touching InstructLab's tooling chains and then a end-user operations or development specialist to run the InstructLab tooling separately.


I would call out that one thing that git gives is a built-in data provenance system - change tracking, change attribution, modification dates (consider enterprise uses cases where you might have conflicting texts and want to pick the more recent ones, maybe such as HR policy documents).

I agree that dependence on git is best removed, but I think it important to not lose data provenance. This will also become important when we want to robustly handle document updates and re-ingestion.

I suspect most end users already have their own system, whether it solves the data provenance problem or not, and are reluctant to add another to their stack. It's much easier to use, say, the versioning available on Azure for a document rather than teach their end users about git, and we don't have to maintain the system for them as an open source project.

anastasds · 2025-02-07T14:06:09Z

docs/taxonomy-revamp-2025.md

+The user experience of working with the `qna.yaml` file is poor for a handful of reasons:
+
+- Many of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy.
+- YAML is a notoriously complex, loose format with a lot of potholes. As a couple of examples:


nimbinatus · 2025-02-07T16:22:24Z

/hold

Comments and reviews are welcome; I just want to be sure the SDG team gets time to review this :)

RobotSail · 2025-02-07T19:52:21Z

docs/taxonomy-revamp-2025.md

+
+Make `created_by`, `domain`, and `document_outline` optional fields. Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone.
+
+### Switch to JSON and Markdown for the `qna.yaml` document


The issue with using markdown in our projects is that often times you want to include markdown in your content, so doing it this way will break the parsing without having a hacky solution.

JSON I like but it's not easy for humans. I do think we should support it, but I wouldn't rely on it as a primary "user-facing" format.

For using Markdown, the idea is a simple user-friendly writing format outside of the UI (which would be the happy path in my mind). Since we already require the markdown to be parsed when it gets taken into the SDG process, I think it makes sense to let the user see the output rendered, allowing them some generic ability to understand whether they converted things correctly.

JSON is the transport format, basically. An end user writing seed examples would not see it unless they explicitly choose to skip the markdown/UI formats and write it themselves. I would imagine that someone making that decision knows how to use it. But this allows us to add a programmatic guardrail in the conversion process to handle any and all translation layers, and I think that the conversion could be just as well handled by Docling as anything else, further reducing our dependency footprint and attack surface.

@nimbinatus I appreciate your thorough response, however I'm still having a hard time understanding your idea around Markdown. Could you please provide an example of what you're thinking of? I think that will help clear up my confusion.

I agree with your point around JSON, that makes sense 👍

RobotSail · 2025-02-07T19:53:57Z

docs/taxonomy-revamp-2025.md

+
+Users who decide to build it without needing the converter are likely familiar with JSON, and there are fewer pitfalls and less likelihood of tooling choices impacting meaning (e.g., where line breaks are for paragraph structures) as the JSON standard has not changed since 2017, and barely changed from the original standard.
+
+### Reframe the Q&A writing process as a reading comprehension process


I think this is good for knowledge, I believe @abhi1092 is actually working on something along the lines of using natural concepts from reading comprehension for the Knowledge 1.5 pipeline.

For skills training though it's probably a bit different since there we want the model to learn how to transform and permute different data.

RobotSail

A few points but I like the overall idea

nimbinatus · 2025-02-07T20:18:22Z

Note: I've been given new information about the taxonomy file structure. I may be updating this document soon, but please still leave me comments.

booxter · 2025-02-10T21:55:12Z

docs/taxonomy-revamp-2025.md

+
+### Use a schema field rather than directory tree structure
+
+Drop the folder structure in favor of a schema field for submission type and even domain, if necessary. The schema field can be entered automatically via the UI through a user selecting `knowledge` or `skill`.


Would it work if we allow but not require the directory structure, while making sure the filepath is part of metadata? Then it's up to the user to decide how they want to organize their files.

booxter · 2025-02-10T21:55:33Z

docs/taxonomy-revamp-2025.md

+
+### Streamline the schema
+
+Make `created_by`, `domain`, and `document_outline` optional fields. Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone.


nit: though -> through

nimbinatus added 3 commits February 5, 2025 09:38

build(gitignore): hide jetbrains directory from git

f6b1e5c

Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>

feat(taxonomy): start proposal on revamp of taxonomy concept

5db4286

Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>

docs(more): more things

c4b8e31

Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>

nimbinatus marked this pull request as draft February 5, 2025 21:51

style(lint): go away linter

4ed5233

Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>

nimbinatus force-pushed the taxonomy-revamp-2025 branch from 0287599 to 4ed5233 Compare February 5, 2025 23:02

nimbinatus added 4 commits February 5, 2025 18:04

feat(solutions): reorganize and add to solutions

5f28945

Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>

fix(footnotes): patch the footnotes by just using numbers (should fix…

70d58df

… spelling, too) Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>

style(linter): add to spelling list and remove random whitespace

179aef8

Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>

fix(content): add a couple more thoughts that were kicking around

1d730f9

Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>

nimbinatus marked this pull request as ready for review February 6, 2025 15:26

jwm4 requested changes Feb 6, 2025

View reviewed changes

nimbinatus changed the title ~~Taxonomy revamp 2025~~ Enhancement Proposal: Taxonomy Updates Feb 6, 2025

fix(footnote): fixed forgotten footnote reference

c4785f4

Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>

alinaryan reviewed Feb 6, 2025

View reviewed changes

anastasds reviewed Feb 7, 2025

View reviewed changes

RobotSail reviewed Feb 7, 2025

View reviewed changes

RobotSail approved these changes Feb 7, 2025

View reviewed changes

nimbinatus mentioned this pull request Feb 7, 2025

Review taxonomy update proposal instructlab/sdg#544

Closed

booxter reviewed Feb 10, 2025

View reviewed changes


		### Use a schema field rather than directory tree structure

		Drop the folder structure in favor of a schema field for submission type and even domain, if necessary. The schema field can be entered automatically via the UI through a user selecting `knowledge` or `skill`.


		### Switch to JSON and Markdown for the `qna.yaml` document

		Allow the user to use Markdown in a WYSIWYG experience, and then use a Markdown-to-JSON converter to handle the conversion to a code-friendly format.


		## Unaddressed concerns

		The issue of needing a git repository for document storage is possibly out of scope of this document. However, I'm adding it as something that may need its own ADR/dev doc. The end user experience of needing a git repository is needlessly complex and also still follows the idea of the upstream taxonomy and community model build. A user working with InstructLab locally does not need the version tracking provided by git and likely probably already has a document storage system. I propose changing the general idea from a git repository to a simple address, whether that's local storage, remote storage, or a version-controlled repository. Make it more flexible.


		The user experience of working with the `qna.yaml` file is poor for a handful of reasons:

		- Many of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy.


		Write documentation and tutorials based on existing tutorials on writing reading comprehension questions and example answers for standardized exams.

		Most people can understand reading to learn versus learning to read type questions. The new, streamlined schema that matches the most simple needs could help here along with a solid set of docs and tutorials on how to write reading comprehension sets. We could borrow heavily from the standard tutorials for writing standardized exams that are out there for free and already battle-tested.


		Make `created_by`, `domain`, and `document_outline` optional fields. Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone.

		### Switch to JSON and Markdown for the `qna.yaml` document


		Users who decide to build it without needing the converter are likely familiar with JSON, and there are fewer pitfalls and less likelihood of tooling choices impacting meaning (e.g., where line breaks are for paragraph structures) as the JSON standard has not changed since 2017, and barely changed from the original standard.

		### Reframe the Q&A writing process as a reading comprehension process


		### Streamline the schema

		Make `created_by`, `domain`, and `document_outline` optional fields. Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone.

Enhancement Proposal: Taxonomy Updates #188

Are you sure you want to change the base?

Enhancement Proposal: Taxonomy Updates #188

Uh oh!

Conversation

nimbinatus commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jwm4 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alinaryan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alinaryan left a comment

Choose a reason for hiding this comment

Uh oh!

nimbinatus commented Feb 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nimbinatus Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nimbinatus commented Feb 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobotSail Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

nimbinatus commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nimbinatus commented Feb 5, 2025 •

edited

Loading

nimbinatus Feb 7, 2025 •

edited

Loading

RobotSail Feb 8, 2025 •

edited

Loading

nimbinatus commented Feb 7, 2025 •

edited

Loading