-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement Proposal: Taxonomy Updates #188
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>
Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>
Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>
Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>
0287599
to
4ed5233
Compare
Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>
… spelling, too) Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>
Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>
Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This basically looks good to me but I had some review comments too.
|
||
### Use a schema field rather than directory tree structure | ||
|
||
Drop the folder structure in favor of a schema field for submission type and even domain, if necessary. The schema field can be entered automatically via the UI through a user selecting `knowledge` or `skill`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing the directory tree structure lets you do is organize at multiple levels, e.g.,knowledge/animals/reptiles/turtles
. I don't know how important that ability is. I guess we could encode that in the schema field or (e.g., have the value be knowledge/animals/reptiles/turtles
) but I think it would make more sense to have a field for this purpose, e.g., the schema
could be knowledge
and the categorization
could be animals/reptiles/turtles
.
If we don't have the ability to nest knowledge and skills into groups and subgroups, then I would say it is not really a taxonomy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly think that's really the question here, and I honestly don't know. I can't find reference to it being useful in the code, but I may be missing something. Is the whole nested structure really necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's one reason why it could turn out to be useful eventually: Imagine you have too much data and you want to do some sort of subset selection either before running SDG (generate last data to begin with) or after running SDG (removing some of the data that you just generated). In either case you might want to use the hierarchy to constrain what gets discarded. Maybe you want to ensure the coverage stays as wide as possible so you would rather discard half of the stuff in animals/reptiles
and half of the stuff in animals/insects
then discarding all of the stuff in one and none of the stuff in the other. Or maybe you want to go the other way: you would rather teach to mastery in some subjects then teach to partial mastery in more subjects and then you would really rather discard an entire branch of the taxonomy then you would want to discard lots of pieces of lots of different branches. In either case having that structure would be useful.
On the other hand, in general is not a great idea to include stuff in a schema because there's a hypothetical argument for why it might be useful someday. So maybe we should put more thought into whether we really do want to do these things in the foreseeable future before making a commitment one way or the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, here's another related reason: say you are building a large taxonomy (either the community taxonomy or maybe a private internal taxonomy for some customer). Maybe in phase one the taxonomy covers a large variety of topics and then in phase two it starts to get so large and you decide what you really want is to split up the taxonomy into pieces so that you can train separate models for each piece. In that case, having the hierarchy could be super useful because you've already done the work to split it up into pieces as you added stuff to the taxonomy instead waiting until you want to do the splitting up now you have this huge undifferentiated blob that you need to deal with. That might be a better argument for keeping the hierarchy than the one that I mentioned above around subsetting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File directory structure could end up being metadata on ingested chunks as well, which could be useful in multiple ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it work if we allow but not require the directory structure, while making sure the filepath is part of metadata? Then it's up to the user to decide how they want to organize their files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. The important thing, from the code perspective, is that the actual directory names and nested tree structure is not really used other than to name temporary files and identify where there are new files. I've dug through the codebase for SDG. A user organize their files in a taxonomy like science > biology > ornithology
or in one like documents > reports > lab
, and the SDG process doesn't know the difference. A taxonomy with knowledge categorization is a human construct for human use, and I don't think we should define that for the end user if we want this to be generalizable to any end user's system. We can encourage sharing that information with us through metadata, sure, but it's not something we use otherwise.
In short, I think maintaining a document store and version control for a user, whether forcing a specific way to do it or providing a system ourselves, is outside of our remit with the project.
docs/taxonomy-revamp-2025.md
Outdated
|
||
### Switch to JSON and Markdown for the `qna.yaml` document | ||
|
||
Allow the user to use Markdown in a WYSIWYG experience, and then use a Markdown-to-JSON converter to handle the conversion to a code-friendly format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little worried that this would lead to a situation where it is easy to write Markdown that looks good but is not usable by SDG after it goes through the Markdown-to-JSON converter. However, I am not sure how big a deal that would be. What might help provide some intuitions is an example of what such a Markdown would look like; that might provide more of a sense of how hard it is to write SDG-complient Markdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also, FWIW, I agree with the premise here that YAML is a big part of the problem with our current taxonomy format.)
docs/taxonomy-revamp-2025.md
Outdated
|
||
## Unaddressed concerns | ||
|
||
The issue of needing a git repository for document storage is possibly out of scope of this document. However, I'm adding it as something that may need its own ADR/dev doc. The end user experience of needing a git repository is needlessly complex and also still follows the idea of the upstream taxonomy and community model build. A user working with InstructLab locally does not need the version tracking provided by git and likely probably already has a document storage system. I propose changing the general idea from a git repository to a simple address, whether that's local storage, remote storage, or a version-controlled repository. Make it more flexible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under "Streamline the schema", you proposed "Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone." I feel we could do the same for the git repository issue: i.e., community submissions are required to link to git and provide commit hashes but that's enforced in the community repo not in InstructLab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I have that in the Unaddressed Concerns section as I'm not sure whether that's in scope for this specific change. Which team owns that part of the process isn't quite clear, so I didn't want to go tromping on toes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I would prefer to go the other way: Include the git stuff in this proposal and then see if anyone pushes back. If they have a good argument for keeping the git enforcement in InstructLab (instead of enforcing it at the community level), they can make it.
Signed-off-by: Laura Santamaria <nimbinatus@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are great ideas! I would also suggest addressing the git dependency, upstream taxonomy, and community model builds in this doc. Alternatively, opening dev docs simultaneously to this one could be helpful, because each idea informs the other.
|
||
The user experience of working with the `qna.yaml` file is poor for a handful of reasons: | ||
|
||
- Many of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you reference an example of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this as referring to the stuff in the "Streamline the schema" section below; if that's what is meant, then maybe a note like "See 'Streamline the schema' below" would be helpful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are great ideas! I would also suggest addressing the git dependency, upstream taxonomy, and community model builds in this doc. Alternatively, opening dev docs simultaneously to this one could be helpful, because each idea informs the other.
I think the questions of what to do with the upstream taxonomy and community model build are separate situations and not relevant to this document; there are references to the difference in a few places:
and
and
are some examples. For the git dependency, I addressed an initial thought to solving that problem in the unaddressed concerns section, as I mentioned to Bill above. If there is a consensus that I fold that into this document, though, I am happy to do so. |
|
||
Write documentation and tutorials based on existing tutorials on writing reading comprehension questions and example answers for standardized exams. | ||
|
||
Most people can understand reading to learn versus learning to read type questions. The new, streamlined schema that matches the most simple needs could help here along with a solid set of docs and tutorials on how to write reading comprehension sets. We could borrow heavily from the standard tutorials for writing standardized exams that are out there for free and already battle-tested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question answering and reading comprehension are different tasks. Reading comprehension is about pulling information given a text, while question answering can involve reasoning / synthesis when a question is not concretely answered by some piece of text.
Is training on reading comprehension questions enough? @jwm4, thoughts?
|
||
## Unaddressed concerns | ||
|
||
The issue of needing a git repository for document storage is possibly out of scope of this document. However, I'm adding it as something that may need its own ADR/dev doc. The end user experience of needing a git repository is needlessly complex and also still follows the idea of the upstream taxonomy and community model build. A user working with InstructLab locally does not need the version tracking provided by git and likely probably already has a document storage system. I propose changing the general idea from a git repository to a simple address, whether that's local storage, remote storage, or a version-controlled repository. Make it more flexible and extensible to match where someone chooses to store their data, perhaps through an environment variable to set as one implementation example. This could also decouple the documentation process from the SDG process by allowing the end-user subject-matter expert to create and upload content to a central store without ever touching InstructLab's tooling chains and then a end-user operations or development specialist to run the InstructLab tooling separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call out that one thing that git gives is a built-in data provenance system - change tracking, change attribution, modification dates (consider enterprise uses cases where you might have conflicting texts and want to pick the more recent ones, maybe such as HR policy documents).
I agree that dependence on git is best removed, but I think it important to not lose data provenance. This will also become important when we want to robustly handle document updates and re-ingestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect most end users already have their own system, whether it solves the data provenance problem or not, and are reluctant to add another to their stack. It's much easier to use, say, the versioning available on Azure for a document rather than teach their end users about git, and we don't have to maintain the system for them as an open source project.
The user experience of working with the `qna.yaml` file is poor for a handful of reasons: | ||
|
||
- Many of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy. | ||
- YAML is a notoriously complex, loose format with a lot of potholes. As a couple of examples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/hold Comments and reviews are welcome; I just want to be sure the SDG team gets time to review this :) |
|
||
Make `created_by`, `domain`, and `document_outline` optional fields. Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone. | ||
|
||
### Switch to JSON and Markdown for the `qna.yaml` document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with using markdown in our projects is that often times you want to include markdown in your content, so doing it this way will break the parsing without having a hacky solution.
JSON I like but it's not easy for humans. I do think we should support it, but I wouldn't rely on it as a primary "user-facing" format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For using Markdown, the idea is a simple user-friendly writing format outside of the UI (which would be the happy path in my mind). Since we already require the markdown to be parsed when it gets taken into the SDG process, I think it makes sense to let the user see the output rendered, allowing them some generic ability to understand whether they converted things correctly.
JSON is the transport format, basically. An end user writing seed examples would not see it unless they explicitly choose to skip the markdown/UI formats and write it themselves. I would imagine that someone making that decision knows how to use it. But this allows us to add a programmatic guardrail in the conversion process to handle any and all translation layers, and I think that the conversion could be just as well handled by Docling as anything else, further reducing our dependency footprint and attack surface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nimbinatus I appreciate your thorough response, however I'm still having a hard time understanding your idea around Markdown. Could you please provide an example of what you're thinking of? I think that will help clear up my confusion.
I agree with your point around JSON, that makes sense 👍
|
||
Users who decide to build it without needing the converter are likely familiar with JSON, and there are fewer pitfalls and less likelihood of tooling choices impacting meaning (e.g., where line breaks are for paragraph structures) as the JSON standard has not changed since 2017, and barely changed from the original standard. | ||
|
||
### Reframe the Q&A writing process as a reading comprehension process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good for knowledge, I believe @abhi1092 is actually working on something along the lines of using natural concepts from reading comprehension for the Knowledge 1.5 pipeline.
For skills training though it's probably a bit different since there we want the model to learn how to transform and permute different data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few points but I like the overall idea
Note: I've been given new information about the taxonomy file structure. I may be updating this document soon, but please still leave me comments. |
|
||
### Use a schema field rather than directory tree structure | ||
|
||
Drop the folder structure in favor of a schema field for submission type and even domain, if necessary. The schema field can be entered automatically via the UI through a user selecting `knowledge` or `skill`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it work if we allow but not require the directory structure, while making sure the filepath is part of metadata? Then it's up to the user to decide how they want to organize their files.
|
||
### Streamline the schema | ||
|
||
Make `created_by`, `domain`, and `document_outline` optional fields. Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: though -> through
This proposal discusses how the concept of the taxonomy should change and evolve to meet user needs.