From 984af494ed45c3b49eb03882e805b398c3053e76 Mon Sep 17 00:00:00 2001 From: Bill Murdock Date: Thu, 14 Nov 2024 16:27:41 -0500 Subject: [PATCH 01/11] Create sdg-refactor.md Signed-off-by: Bill Murdock --- docs/sdg/sdg-refactor.md | 136 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 136 insertions(+) create mode 100644 docs/sdg/sdg-refactor.md diff --git a/docs/sdg/sdg-refactor.md b/docs/sdg/sdg-refactor.md new file mode 100644 index 00000000..8ad5c1b6 --- /dev/null +++ b/docs/sdg/sdg-refactor.md @@ -0,0 +1,136 @@ +# Refactor preprocessing and postprocessing in SDG + +## Context + +The existing synthetic data generation (SDG) repository includes several related pieces of functionality: + +- Traverse an InstructLab taxonomy to identify unstaged qna.yaml files to generate data from. +- From each qna.yaml file, extract the example context/question/answer tuples for use as seed data. +- From each *knowledge* qna.yaml file, also fetch the document referenced by this file. Use [Docling](https://github.com/DS4SD/docling) to convert the file to a JSON format and then split the file into chunks that are small enough to be used as contexts for synthetic data generation. +- *Given the seed data and the document chunks if any, generate additional synthetic context/question/answer tuples.* +- Mix the outputs with some pre-computed data sets when applicable +- Split the data into train and test + +Of all of these, only the one emphasized (*Given the seed data ... generate ... tuples*) is core SDG functionality. The others are essentially preprocessing and postprocessing steps to enable the core SDG functionality and produce outputs useable for future steps. In the current flow, preprocessing has a taxonomy with some new seed data added to it as input. The output of preprocessing includes a set of context/question/answer tuples for both knowledge and skill taxonomy nodes. For knowledge taxonomy nodes it also includes a set of document chunks. SDG uses the context/question/answer tuples as seed examples, and it uses the document chunks (if there are any) as example contexts from which to generate additional data. That additional data is then sent to the postprocessing step to produce the final outputs. + +We have heard that some users want a stand-alone SDG capability that includes only the core SDG functionality. Specifically, they already have a set of seed context/question/answer tuples and optionallly a set of document chunks. All they want from SDG is to take that input and produce an new sythetic data set as output without doing any mixing into pre-computed data or splitting into train and test. The preprocessing and postprocessing capabilities currently in SDG are not relevant to those users. + +Also as context, in the near future we are absorbing a set of updates to the core SDG functionality to make it more modularized and flexible. That might turn out to be irrelevant to this document which is focused on what to do with the non-core functionality (preprocessing and postprocessing). However, it is mentioned here in the context section in case that context winds up being useful. + +Furthermore, in 2025 we are hoping to have some sort of retrieval-augmented generation (RAG) capability that is either part of or tightly integrated with InstructLab. Such a capability would have signficant overlap with the functionality of the preprocessing for SDG. As noted above, when a taxonomy has a knowledge qna.yaml file that references a document, SDG uses Docling to convert the file to JSON and then splits the file into chunks of appropriate size for SDG. The RAG capability would _also_ want the same Docling JSON output but would need to split it into chunks that are sized appropriately for vector retrieval (i.e., that fit within the context window of the semantic encoding model). + +## Question 1: What user flows should be supported? + +Here are some user flows that seem like they might be valuable: + +1. User installs the full InstructLab (CLI and/or GUI). They want any of the following using CLI or GUI interactions: + - 1.1. They have a taxonomy with some new seed data they added to it. They want to run the full pipeline including SDG and model training and evaluation. + - 1.2. They have a taxonomy with some new seed data they added to it. They want to run SDG and then evaluate an existing model on the outputs of that SDG. + - 1.3. They have a taxonomy with some new seed data they added to it. They want to run SDG only. + - 1.3.1. They also want to see the _inputs_ to SDG that get extracted from the taxonomy (i.e., a set of seed context/question/answer tuples and optionallly a set of document chunks). + - 1.3.2. Alternatively, maybe they _only_ want to see the _inputs_ to SDG -- they don't actually want to run SDG. + - 1.4. They have a set of seed context/question/answer tuples and optionallly a set of document chunks. They want to run the full pipeline including SDG and model training and evaluation. + - 1.5. They have a set of seed context/question/answer tuples and optionallly a set of document chunks. They want to run SDG and then evaluate an existing model on the outputs of that SDG. + - 1.6. They have a set of seed context/question/answer tuples and optionallly a set of document chunks. They want to run SDG only. +2. User installs the SDG library only. They want to invoke any of the following as a library call from code they write: + - 2.1. They have a taxonomy with some new seed data they added to it. They want to run SDG only without any postprocessing. + - 2.2. They have a set of seed context/question/answer tuples and optionallly a set of document chunks. They want to run SDG only without any postprocessing. + +If I understand the guidance from our PM (William Caban), the flows we are being asked to support here are 1.1., 1.2., 1.3, 1.3.1, and 2.2. However, I am not sure that I understand the guidance. Are we confident that we _do_ want to support those flows and do not want to support any of the others? More specifically, do we think the users who want 2.2 might be satisfied with 1.6 instead? Both of those are SDG-only flows, but the latter is more developer focused and the former is more business-user focused. Also, are we really confident that all of the "SDG only" customers really want to do their own document chunking? Might there be some customers that have seed context/question/answer tuples and _documents_ and want us to chunk the documents for them? + +Also note, that for all of the flows above that start with "they have a taxonomy", there are open questions around the complexity of the taxonomy format. Some users want to be able to provide seed data and documents without explicitly or implicitly extending some sort of base taxonomy. If we had something simpler than the existing taxonomy format that still included some way to specify context/question/answer tuples and references to full documents (i.e., not _chunks_), would we even call that a taxonomy or would it be something else? If we call it something else, then some or all of the flows that take in a taxonomy might also be applicable for that something else. For the remainder of this document, however, we will assume that any sort of simplified/easier-to-use variant of a taxonomy will also be called a "taxonomy". + +## Question 2: What should the commands be in the CLI? + +One way to support both 1.3.1 and 1.3.2 would be to have separate CLI commands for the preprocessing, core SDG, and postprocessing step . Alternatively, a single CLI command that does all of these and also saves the outputs of preprocessing to disk would support 1.3.1 but _not_ 1.3.2. Even if we only want to support 1.3.1, having separate CLI commands for each step might be desirable because it is just more intuitive that if a user wants to save the outputs of preprocessing to disk to have a command to do that instead of having it be a "side effect" of an omnibus SDG command. Here is a rough outline of what the separate commands would be: + +- `ilab data prep` would handle all the preprocessing (the first three bullets in the Context section above, plus any additional preprocessing we add in the future). +- `ilab data generate` would take as input some data in the same format that `ilab data prep` produces and would run the core synthetic data generation *only*. Note that this is a breaking change from the current behavior of `ilab data generate`, but that may be acceptable because the user base is still small. +- `ilab data process` would take as input some data in the same format that `ilab data generate` produces and would run the postprocessing (the last two bullets in the Context section above, plus any additional postprocessing we add in the future). + +Detailed technical specifications for these commands are outside the scope of this document and should appear in a future document instead. + +## Question 3: Where should the preprocessing and postprocessing code go? + +As noted earlier, currently the preprocessing and postprocessing code is in the SDG library. Here are some options for what to do with it. + +### Option 1: Leave preprocessing and postprocessing in SDG + +Currently there is no documentation that I know of that explains how to do 2.1 or 2.2 (or anything else, really) with the SDG library by itself. However, with some additional documenting and _maybe_ some refactoring, it should be feasible to support both 2.1 and 2.2 in SDG. With that said, if 2.1 is not needed and 2.2 is, then it would _also_ be possible to move the preprocessing and postprocessing code out of SDG. Some pros and cons of leaving in SDG: + +Pro: + +- Future changes to the input format for preprocessing and/or the output format for postprocessing (e.g., adding more expressive power to the taxonomy format) require changes to the core SDG *and* the preprocssing/postprocessing. That's easier to do if they are in the same repository because they can be done in a single PR instead of multiple PRs that need to be coordinated. +- It is simpler to leave things where they are. +- If we're not totally sure which of the options we want, then it might make more sense to stick with this option for now since it avoids doing a work to move preprocessing and postprocessing *now* that could then be followed by more work to move preprocessing and postprocessing *again* after we decide where it goes. + +Con: + +- The core logic of SDG is inherently complex and represents some of the most sophisticated and differentiating elements of InstructLab. For that reason, it would be nice to have it in its own repository by itself. New contributors to that core logic find it challenging enough to navigate the core functionality without having to also figure out where the core logic starts and the preprocessing and postprocessing capabilities end. This could be mitigated by having better technical documentation (README, comments) for the SDG library. +- As noted in the Context section earlier, in the near future we are absorbing a set of updates to the core SDG functionality. Absorbing those updates is somewhat simpler if the core SDG logic is all alone in a repository of its own. +- Keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. We certainly _could_ have well documented API contracts for preprocessing and postprocessing and core SDG functionality that makes it clear how they interact even when both of these exist in the same repository, but it is probably more likely that we _will_ do so if they are separated. +- The logic behind the core SDG algorithms are mainly developed and maintained by the Red Hat AI Innovations team (commonly referred to as the "research" team because many people on that team used to work for IBM Research) while the logic behind the preprocessing and postprocessing is mainly developed and maintained by the Red Hat AI engineering "data" team. Having multiple teams working on a component increases the amount of coordination required. Note, however, that preprocessing, postprocessing and core SDG all belong to the entire InstructLab commmunity and *not* Red Hat (much less any one team in Red Hat). So the teams really need to keep collaborating with the entire community at all times and not get into a mindset of "owning" a single piece of code. +- The ezxpected RAG functionality in 2025 will have some complex interactions with both preprocessing and postprocessing, perhaps even involving user flows in which the core SDG functionality is not needed. In that case, it would be confusing to have the code path for RAG include a call out to the SDG library for doing preprocessing but not actually doing the core SDG. +- It would just be simpler to explain to all stakeholders if the functionality that I've been calling "core SDG" was really just called "SDG". We can't do that now because the SDG library has preprocessing and postprocessing in it too. + +Conclusion: + +- While the cons here are substantial, so are the pros. None of the cons really seem disqualifying. The first pro (future changes to the formats can be more self-contained) seems particularly compelling because this is a rapidly evolving field and adding new expressive power seems like something we will want frequently. + +### Option 2: Move preprocessing and postprocessing into a new repository + +We could have a new repository for preprocessing and postprocessing and move all the preprocessing and postprocessing code there. + +Pro: + +- Avoids all the cons of Option 1. +- Preprocessing and postprocessing are a coherent pieces of functionality that *could* have their own library (or libraries, FWIW). + +Cons: + +- Avoids all the pros of Option 1. +- Having a separate repository with its own library brings in an enormous amount of overhead in maintaining that repository (e.g., CI/CD). +- Having a separate repository with its own library also brings in an enormous amount of overhead in maintaining the CLI repository's dependency on all of those libraries. +- Does not allow user flow 2.1 but maybe that's OK because it is not a priority and anyway the users could approximate that flow by also installing the ingestion library. + +Conclusion: + +- The cost of having a separate reposity is so high that we would only consider this option as a last resort. + +### Option 3: Move preprocessing and postprocessing into the CLI repository + +Pro: + +- The CLI already has a lot of "supporting" (non-core) functionality, so it would respect established precedent to include preprocessing and/or postprocessing here. +- Supporting user flow 1.3.2 requires separate CLI commands for preprocessing and core SDG. This is slightly simpler if preprocessing is implemented in CLI. If preprocessing remains in the SDG library instead then the CLI would need to make _separate_ calls to the SDG library for preprocessing and core SDG to support user flow 1.3.2. That adds a little complexity. +- Avoids some of the cons of Option 1, but see below for some overlap. +- Avoids some of the cons of Option 2, but see below for some overlap. + +Con: + +- Avoids the pros of both Option 1 and Option 2. +- As with Option 1, this approach involves a lot of coordination. There are a lot of stakehoders involved in the CLI and locating preprocessing and postprocessing there drags those stakeholders into issues relating to preprocessing and postprocessing. However, as with Option 1, coordinating across stakeholders is something an open source project needs to do well anyway to remain engaged with the community. +- As with Option 1, this approach suffers from the fact that keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. In the case of Option 1, the interconnected components that would not have as much pressure to be documented would be preprocessing/postprocessing and core SDG. In the case of Option 3, the interconnected components that would not have as much pressure to be documented would be the CLI and preprocessing/postprocessing. However, in both cases, this con could be alleviated by just having the discipline to document the APIs well even without such pressure. +- As with Option 2, this approach would not enable user flow 2.1. Maybe that's fine since it is not on our requirements list. + +Conclusion: + +- This seems like a reasonable option. The cons are mostly manageable. However, overall the pros of Option 1 seem more compelling. + +### Option 4: Preprocessing and postprocessing go to different locations + +We could also mix and match any of the above options separately for preprocessing and postprocessing. For example, preprocessing could move to the CLI repo and postprocessing could stay in the SDG repo. Or preprocessing could move to a new repository and postprocessing could move to a different new repository or the _same_ new repository. Enumerating all possible permutations of where each could go and enumerating pros and cons of each of them would make this document unbearably long. If anyone wants to advocate for a small number of specific permutations, we will add them to this document. + +## Question 4: Should preprocessing, postprocessing, and core SDG be separate Python packages? + +If we choose Option 1 (leave preprocessing and postprocessing in SDG) then we still have the option to separate them into distinct Python packages. That would get us some of the benefits of Option 2 (moving them to a different repository) while avoiding *some* of the costs of Option 2. In particular, it would make the boundaries clearer and put more pressure on the developers of preprocessing, postprocessing, and core SDG to have well documented contracts for how each of these elements interact. With that said, it would also bring in some additional complexity and increase the amount of work involved. + +## Decisions + +Since this is a draft, no decisions are made yet. However, here are the current draft decisions: + +- We will support the following user flows: 1.1., 1.2., 1.3, 1.3.1, 1.3.2, 2.1, and 2.1 as documented in the Question 1 section above. +- We will adopt the updates to the CLI that will be documented in Question 2 above. +- We will leave the preprocessing in SDG as described in Question 3: Option 1. +- We will leave the postprocessing in SDG as described in Question 3: Option 1. +- We will not separate preprocessing, postprocessing, and SDG into separate packages. From 36f4cebb33d2d1e6f58e0ed45b7a7a470ad0377d Mon Sep 17 00:00:00 2001 From: Bill Murdock Date: Thu, 14 Nov 2024 16:57:00 -0500 Subject: [PATCH 02/11] Fix linting and spelling errors Signed-off-by: Bill Murdock --- docs/sdg/sdg-refactor.md | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/docs/sdg/sdg-refactor.md b/docs/sdg/sdg-refactor.md index 8ad5c1b6..4fd555b6 100644 --- a/docs/sdg/sdg-refactor.md +++ b/docs/sdg/sdg-refactor.md @@ -11,13 +11,13 @@ The existing synthetic data generation (SDG) repository includes several related - Mix the outputs with some pre-computed data sets when applicable - Split the data into train and test -Of all of these, only the one emphasized (*Given the seed data ... generate ... tuples*) is core SDG functionality. The others are essentially preprocessing and postprocessing steps to enable the core SDG functionality and produce outputs useable for future steps. In the current flow, preprocessing has a taxonomy with some new seed data added to it as input. The output of preprocessing includes a set of context/question/answer tuples for both knowledge and skill taxonomy nodes. For knowledge taxonomy nodes it also includes a set of document chunks. SDG uses the context/question/answer tuples as seed examples, and it uses the document chunks (if there are any) as example contexts from which to generate additional data. That additional data is then sent to the postprocessing step to produce the final outputs. +Of all of these, only the one emphasized (*Given the seed data ... generate ... tuples*) is core SDG functionality. The others are essentially preprocessing and postprocessing steps to enable the core SDG functionality and produce outputs usable for future steps. In the current flow, preprocessing has a taxonomy with some new seed data added to it as input. The output of preprocessing includes a set of context/question/answer tuples for both knowledge and skill taxonomy nodes. For knowledge taxonomy nodes it also includes a set of document chunks. SDG uses the context/question/answer tuples as seed examples, and it uses the document chunks (if there are any) as example contexts from which to generate additional data. That additional data is then sent to the postprocessing step to produce the final outputs. -We have heard that some users want a stand-alone SDG capability that includes only the core SDG functionality. Specifically, they already have a set of seed context/question/answer tuples and optionallly a set of document chunks. All they want from SDG is to take that input and produce an new sythetic data set as output without doing any mixing into pre-computed data or splitting into train and test. The preprocessing and postprocessing capabilities currently in SDG are not relevant to those users. +We have heard that some users want a stand-alone SDG capability that includes only the core SDG functionality. Specifically, they already have a set of seed context/question/answer tuples and optionallly a set of document chunks. All they want from SDG is to take that input and produce an new synthetic data set as output without doing any mixing into pre-computed data or splitting into train and test. The preprocessing and postprocessing capabilities currently in SDG are not relevant to those users. Also as context, in the near future we are absorbing a set of updates to the core SDG functionality to make it more modularized and flexible. That might turn out to be irrelevant to this document which is focused on what to do with the non-core functionality (preprocessing and postprocessing). However, it is mentioned here in the context section in case that context winds up being useful. -Furthermore, in 2025 we are hoping to have some sort of retrieval-augmented generation (RAG) capability that is either part of or tightly integrated with InstructLab. Such a capability would have signficant overlap with the functionality of the preprocessing for SDG. As noted above, when a taxonomy has a knowledge qna.yaml file that references a document, SDG uses Docling to convert the file to JSON and then splits the file into chunks of appropriate size for SDG. The RAG capability would _also_ want the same Docling JSON output but would need to split it into chunks that are sized appropriately for vector retrieval (i.e., that fit within the context window of the semantic encoding model). +Furthermore, in 2025 we are hoping to have some sort of retrieval-augmented generation (RAG) capability that is either part of or tightly integrated with InstructLab. Such a capability would have significant overlap with the functionality of the preprocessing for SDG. As noted above, when a taxonomy has a knowledge qna.yaml file that references a document, SDG uses Docling to convert the file to JSON and then splits the file into chunks of appropriate size for SDG. The RAG capability would *also* want the same Docling JSON output but would need to split it into chunks that are sized appropriately for vector retrieval (i.e., that fit within the context window of the semantic encoding model). ## Question 1: What user flows should be supported? @@ -27,8 +27,8 @@ Here are some user flows that seem like they might be valuable: - 1.1. They have a taxonomy with some new seed data they added to it. They want to run the full pipeline including SDG and model training and evaluation. - 1.2. They have a taxonomy with some new seed data they added to it. They want to run SDG and then evaluate an existing model on the outputs of that SDG. - 1.3. They have a taxonomy with some new seed data they added to it. They want to run SDG only. - - 1.3.1. They also want to see the _inputs_ to SDG that get extracted from the taxonomy (i.e., a set of seed context/question/answer tuples and optionallly a set of document chunks). - - 1.3.2. Alternatively, maybe they _only_ want to see the _inputs_ to SDG -- they don't actually want to run SDG. + - 1.3.1. They also want to see the inputs to SDG that get extracted from the taxonomy (i.e., a set of seed context/question/answer tuples and optionallly a set of document chunks). + - 1.3.2. Alternatively, maybe they only want to see the inputs to SDG -- they don't actually want to run SDG. - 1.4. They have a set of seed context/question/answer tuples and optionallly a set of document chunks. They want to run the full pipeline including SDG and model training and evaluation. - 1.5. They have a set of seed context/question/answer tuples and optionallly a set of document chunks. They want to run SDG and then evaluate an existing model on the outputs of that SDG. - 1.6. They have a set of seed context/question/answer tuples and optionallly a set of document chunks. They want to run SDG only. @@ -36,13 +36,13 @@ Here are some user flows that seem like they might be valuable: - 2.1. They have a taxonomy with some new seed data they added to it. They want to run SDG only without any postprocessing. - 2.2. They have a set of seed context/question/answer tuples and optionallly a set of document chunks. They want to run SDG only without any postprocessing. -If I understand the guidance from our PM (William Caban), the flows we are being asked to support here are 1.1., 1.2., 1.3, 1.3.1, and 2.2. However, I am not sure that I understand the guidance. Are we confident that we _do_ want to support those flows and do not want to support any of the others? More specifically, do we think the users who want 2.2 might be satisfied with 1.6 instead? Both of those are SDG-only flows, but the latter is more developer focused and the former is more business-user focused. Also, are we really confident that all of the "SDG only" customers really want to do their own document chunking? Might there be some customers that have seed context/question/answer tuples and _documents_ and want us to chunk the documents for them? +If I understand the guidance from our product management, the flows that our users want us to support here are 1.1., 1.2., 1.3, 1.3.1, and 2.2. However, I am not sure that I understand the guidance. Are we confident that we do want to support those flows and do not want to support any of the others? More specifically, do we think the users who want 2.2 might be satisfied with 1.6 instead? Both of those are SDG-only flows, but the latter is more developer focused and the former is more business-user focused. Also, are we really confident that all of the "SDG only" customers really want to do their own document chunking? Might there be some customers that have seed context/question/answer tuples and documents and want us to chunk the documents for them? -Also note, that for all of the flows above that start with "they have a taxonomy", there are open questions around the complexity of the taxonomy format. Some users want to be able to provide seed data and documents without explicitly or implicitly extending some sort of base taxonomy. If we had something simpler than the existing taxonomy format that still included some way to specify context/question/answer tuples and references to full documents (i.e., not _chunks_), would we even call that a taxonomy or would it be something else? If we call it something else, then some or all of the flows that take in a taxonomy might also be applicable for that something else. For the remainder of this document, however, we will assume that any sort of simplified/easier-to-use variant of a taxonomy will also be called a "taxonomy". +Also note, that for all of the flows above that start with "they have a taxonomy", there are open questions around the complexity of the taxonomy format. Some users want to be able to provide seed data and documents without explicitly or implicitly extending some sort of base taxonomy. If we had something simpler than the existing taxonomy format that still included some way to specify context/question/answer tuples and references to full documents (i.e., not *chunks*), would we even call that a taxonomy or would it be something else? If we call it something else, then some or all of the flows that take in a taxonomy might also be applicable for that something else. For the remainder of this document, however, we will assume that any sort of simplified/easier-to-use variant of a taxonomy will also be called a "taxonomy". ## Question 2: What should the commands be in the CLI? -One way to support both 1.3.1 and 1.3.2 would be to have separate CLI commands for the preprocessing, core SDG, and postprocessing step . Alternatively, a single CLI command that does all of these and also saves the outputs of preprocessing to disk would support 1.3.1 but _not_ 1.3.2. Even if we only want to support 1.3.1, having separate CLI commands for each step might be desirable because it is just more intuitive that if a user wants to save the outputs of preprocessing to disk to have a command to do that instead of having it be a "side effect" of an omnibus SDG command. Here is a rough outline of what the separate commands would be: +One way to support both 1.3.1 and 1.3.2 would be to have separate CLI commands for the preprocessing, core SDG, and postprocessing step . Alternatively, a single CLI command that does all of these and also saves the outputs of preprocessing to disk would support 1.3.1 but *not* 1.3.2. Even if we only want to support 1.3.1, having separate CLI commands for each step might be desirable because it is just more intuitive that if a user wants to save the outputs of preprocessing to disk to have a command to do that instead of having it be a "side effect" of an omnibus SDG command. Here is a rough outline of what the separate commands would be: - `ilab data prep` would handle all the preprocessing (the first three bullets in the Context section above, plus any additional preprocessing we add in the future). - `ilab data generate` would take as input some data in the same format that `ilab data prep` produces and would run the core synthetic data generation *only*. Note that this is a breaking change from the current behavior of `ilab data generate`, but that may be acceptable because the user base is still small. @@ -56,7 +56,7 @@ As noted earlier, currently the preprocessing and postprocessing code is in the ### Option 1: Leave preprocessing and postprocessing in SDG -Currently there is no documentation that I know of that explains how to do 2.1 or 2.2 (or anything else, really) with the SDG library by itself. However, with some additional documenting and _maybe_ some refactoring, it should be feasible to support both 2.1 and 2.2 in SDG. With that said, if 2.1 is not needed and 2.2 is, then it would _also_ be possible to move the preprocessing and postprocessing code out of SDG. Some pros and cons of leaving in SDG: +Currently there is no documentation that I know of that explains how to do 2.1 or 2.2 (or anything else, really) with the SDG library by itself. However, with some additional documenting and *maybe* some refactoring, it should be feasible to support both 2.1 and 2.2 in SDG. With that said, if 2.1 is not needed and 2.2 is, then it would *also* be possible to move the preprocessing and postprocessing code out of SDG. Some pros and cons of leaving in SDG: Pro: @@ -68,9 +68,9 @@ Con: - The core logic of SDG is inherently complex and represents some of the most sophisticated and differentiating elements of InstructLab. For that reason, it would be nice to have it in its own repository by itself. New contributors to that core logic find it challenging enough to navigate the core functionality without having to also figure out where the core logic starts and the preprocessing and postprocessing capabilities end. This could be mitigated by having better technical documentation (README, comments) for the SDG library. - As noted in the Context section earlier, in the near future we are absorbing a set of updates to the core SDG functionality. Absorbing those updates is somewhat simpler if the core SDG logic is all alone in a repository of its own. -- Keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. We certainly _could_ have well documented API contracts for preprocessing and postprocessing and core SDG functionality that makes it clear how they interact even when both of these exist in the same repository, but it is probably more likely that we _will_ do so if they are separated. +- Keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. We certainly *could* have well documented API contracts for preprocessing and postprocessing and core SDG functionality that makes it clear how they interact even when both of these exist in the same repository, but it is probably more likely that we *will* do so if they are separated. - The logic behind the core SDG algorithms are mainly developed and maintained by the Red Hat AI Innovations team (commonly referred to as the "research" team because many people on that team used to work for IBM Research) while the logic behind the preprocessing and postprocessing is mainly developed and maintained by the Red Hat AI engineering "data" team. Having multiple teams working on a component increases the amount of coordination required. Note, however, that preprocessing, postprocessing and core SDG all belong to the entire InstructLab commmunity and *not* Red Hat (much less any one team in Red Hat). So the teams really need to keep collaborating with the entire community at all times and not get into a mindset of "owning" a single piece of code. -- The ezxpected RAG functionality in 2025 will have some complex interactions with both preprocessing and postprocessing, perhaps even involving user flows in which the core SDG functionality is not needed. In that case, it would be confusing to have the code path for RAG include a call out to the SDG library for doing preprocessing but not actually doing the core SDG. +- The expected RAG functionality in 2025 will have some complex interactions with both preprocessing and postprocessing, perhaps even involving user flows in which the core SDG functionality is not needed. In that case, it would be confusing to have the code path for RAG include a call out to the SDG library for doing preprocessing but not actually doing the core SDG. - It would just be simpler to explain to all stakeholders if the functionality that I've been calling "core SDG" was really just called "SDG". We can't do that now because the SDG library has preprocessing and postprocessing in it too. Conclusion: @@ -95,21 +95,21 @@ Cons: Conclusion: -- The cost of having a separate reposity is so high that we would only consider this option as a last resort. +- The cost of having a separate repository is so high that we would only consider this option as a last resort. ### Option 3: Move preprocessing and postprocessing into the CLI repository Pro: - The CLI already has a lot of "supporting" (non-core) functionality, so it would respect established precedent to include preprocessing and/or postprocessing here. -- Supporting user flow 1.3.2 requires separate CLI commands for preprocessing and core SDG. This is slightly simpler if preprocessing is implemented in CLI. If preprocessing remains in the SDG library instead then the CLI would need to make _separate_ calls to the SDG library for preprocessing and core SDG to support user flow 1.3.2. That adds a little complexity. +- Supporting user flow 1.3.2 requires separate CLI commands for preprocessing and core SDG. This is slightly simpler if preprocessing is implemented in CLI. If preprocessing remains in the SDG library instead then the CLI would need to make separate calls to the SDG library for preprocessing and core SDG to support user flow 1.3.2. That adds a little complexity. - Avoids some of the cons of Option 1, but see below for some overlap. - Avoids some of the cons of Option 2, but see below for some overlap. Con: - Avoids the pros of both Option 1 and Option 2. -- As with Option 1, this approach involves a lot of coordination. There are a lot of stakehoders involved in the CLI and locating preprocessing and postprocessing there drags those stakeholders into issues relating to preprocessing and postprocessing. However, as with Option 1, coordinating across stakeholders is something an open source project needs to do well anyway to remain engaged with the community. +- As with Option 1, this approach involves a lot of coordination. There are a lot of stakeholders involved in the CLI and locating preprocessing and postprocessing there drags those stakeholders into issues relating to preprocessing and postprocessing. However, as with Option 1, coordinating across stakeholders is something an open source project needs to do well anyway to remain engaged with the community. - As with Option 1, this approach suffers from the fact that keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. In the case of Option 1, the interconnected components that would not have as much pressure to be documented would be preprocessing/postprocessing and core SDG. In the case of Option 3, the interconnected components that would not have as much pressure to be documented would be the CLI and preprocessing/postprocessing. However, in both cases, this con could be alleviated by just having the discipline to document the APIs well even without such pressure. - As with Option 2, this approach would not enable user flow 2.1. Maybe that's fine since it is not on our requirements list. @@ -119,7 +119,7 @@ Conclusion: ### Option 4: Preprocessing and postprocessing go to different locations -We could also mix and match any of the above options separately for preprocessing and postprocessing. For example, preprocessing could move to the CLI repo and postprocessing could stay in the SDG repo. Or preprocessing could move to a new repository and postprocessing could move to a different new repository or the _same_ new repository. Enumerating all possible permutations of where each could go and enumerating pros and cons of each of them would make this document unbearably long. If anyone wants to advocate for a small number of specific permutations, we will add them to this document. +We could also mix and match any of the above options separately for preprocessing and postprocessing. For example, preprocessing could move to the CLI repo and postprocessing could stay in the SDG repo. Or preprocessing could move to a new repository and postprocessing could move to a different new repository or the same new repository. Enumerating all possible permutations of where each could go and enumerating pros and cons of each of them would make this document unbearably long. If anyone wants to advocate for a small number of specific permutations, we will add them to this document. ## Question 4: Should preprocessing, postprocessing, and core SDG be separate Python packages? From 3d41e04ec0aea33082a98cf543908474b8df8992 Mon Sep 17 00:00:00 2001 From: Bill Murdock Date: Fri, 15 Nov 2024 12:34:10 -0500 Subject: [PATCH 03/11] Update in response to review comments Signed-off-by: Bill Murdock --- docs/sdg/sdg-refactor.md | 45 +++++++++++++++++++++++++--------------- 1 file changed, 28 insertions(+), 17 deletions(-) diff --git a/docs/sdg/sdg-refactor.md b/docs/sdg/sdg-refactor.md index 4fd555b6..d3f4ef6e 100644 --- a/docs/sdg/sdg-refactor.md +++ b/docs/sdg/sdg-refactor.md @@ -19,26 +19,30 @@ Also as context, in the near future we are absorbing a set of updates to the cor Furthermore, in 2025 we are hoping to have some sort of retrieval-augmented generation (RAG) capability that is either part of or tightly integrated with InstructLab. Such a capability would have significant overlap with the functionality of the preprocessing for SDG. As noted above, when a taxonomy has a knowledge qna.yaml file that references a document, SDG uses Docling to convert the file to JSON and then splits the file into chunks of appropriate size for SDG. The RAG capability would *also* want the same Docling JSON output but would need to split it into chunks that are sized appropriately for vector retrieval (i.e., that fit within the context window of the semantic encoding model). +An additional complication is the fact that InstructLab's existing "taxonomy" structure is a tree structure encoded as a git repo that can be cloned/pushed/shared using the normal git constructs and flow. A taxonomy has *staged* nodes that are presumed to already be fully addressed by the model and *unstaged* nodes that are not, which is why the first item in the list above involves identifying only the unstaged qna.yaml files. However, some users might have the essential elements of a taxonomy (seed context/question/answer tuples for both skills and knowledge plus documents for knowledge) but do not want to put that information in a tree it a git repo. For the purposes of this document, we will refer to those essential elements as "raw seed content". The "raw seed content" includes all of the things that go into a qna.yanl file. In the current code base, the way InstructLab gets to the raw seed content is by identifying unstaged qna.yaml files from a local clone of a taxonomy. However, in the future we might add functionality that allows users to simply point at some raw seed content without having to tie it to a github respository for a taxonomy. If the raw seed content includes knowledge elements (not just skills) then those knowledge elements will have references to documents. When the raw seed content is processed, the documents are fetched, converted, and chunked (the third step in the list above). For this document, we will use the term "processed seed content" to refer to the outputs of that processing. So to summarize the data structure terms being discussed here: + +- *Raw seed content* -- A set of elements each of which has a set of context/question/answer tuples. Some elements may be *knowledge* elements which also have references to documents. +- *Processed seed content* -- The same as raw seed content except all references to documents are replaced with a set of document chunks of approrpriate size for SDG. +- *Taxonomy* -- A tree structure encoded as a git repo. Some leaves of the taxonomy are unstaged, indicating that they should be used for raw seed content. + ## Question 1: What user flows should be supported? Here are some user flows that seem like they might be valuable: 1. User installs the full InstructLab (CLI and/or GUI). They want any of the following using CLI or GUI interactions: - - 1.1. They have a taxonomy with some new seed data they added to it. They want to run the full pipeline including SDG and model training and evaluation. - - 1.2. They have a taxonomy with some new seed data they added to it. They want to run SDG and then evaluate an existing model on the outputs of that SDG. - - 1.3. They have a taxonomy with some new seed data they added to it. They want to run SDG only. - - 1.3.1. They also want to see the inputs to SDG that get extracted from the taxonomy (i.e., a set of seed context/question/answer tuples and optionallly a set of document chunks). + - 1.1. They have raw seed content. They want to run the full pipeline including SDG and model training and evaluation. + - 1.2. They have raw seed content. They want to run SDG and then evaluate an existing model on the outputs of that SDG. + - 1.3. They have raw seed content. They want to run SDG only. + - 1.3.1. They also want to see the inputs to SDG that get extracted from the raw seed content (i.e., a set of seed context/question/answer tuples and with document chunks for the knowledge if any). - 1.3.2. Alternatively, maybe they only want to see the inputs to SDG -- they don't actually want to run SDG. - - 1.4. They have a set of seed context/question/answer tuples and optionallly a set of document chunks. They want to run the full pipeline including SDG and model training and evaluation. - - 1.5. They have a set of seed context/question/answer tuples and optionallly a set of document chunks. They want to run SDG and then evaluate an existing model on the outputs of that SDG. - - 1.6. They have a set of seed context/question/answer tuples and optionallly a set of document chunks. They want to run SDG only. + - 1.4. They have processed seed content. They want to run the full pipeline including SDG and model training and evaluation. + - 1.5. They have processed seed content. They want to run SDG and then evaluate an existing model on the outputs of that SDG. + - 1.6. They have processed seed content. They want to run SDG only. 2. User installs the SDG library only. They want to invoke any of the following as a library call from code they write: - - 2.1. They have a taxonomy with some new seed data they added to it. They want to run SDG only without any postprocessing. - - 2.2. They have a set of seed context/question/answer tuples and optionallly a set of document chunks. They want to run SDG only without any postprocessing. - -If I understand the guidance from our product management, the flows that our users want us to support here are 1.1., 1.2., 1.3, 1.3.1, and 2.2. However, I am not sure that I understand the guidance. Are we confident that we do want to support those flows and do not want to support any of the others? More specifically, do we think the users who want 2.2 might be satisfied with 1.6 instead? Both of those are SDG-only flows, but the latter is more developer focused and the former is more business-user focused. Also, are we really confident that all of the "SDG only" customers really want to do their own document chunking? Might there be some customers that have seed context/question/answer tuples and documents and want us to chunk the documents for them? + - 2.1. They have raw seed content. They want to run SDG only without any postprocessing. + - 2.2. They have processed seed content. They want to run SDG only without any postprocessing. -Also note, that for all of the flows above that start with "they have a taxonomy", there are open questions around the complexity of the taxonomy format. Some users want to be able to provide seed data and documents without explicitly or implicitly extending some sort of base taxonomy. If we had something simpler than the existing taxonomy format that still included some way to specify context/question/answer tuples and references to full documents (i.e., not *chunks*), would we even call that a taxonomy or would it be something else? If we call it something else, then some or all of the flows that take in a taxonomy might also be applicable for that something else. For the remainder of this document, however, we will assume that any sort of simplified/easier-to-use variant of a taxonomy will also be called a "taxonomy". +If I understand the latest guidance from our product management, the flows that our users want us to support here are 1.1., 1.2., 1.3, 1.3.1, and 1.6. In an earlier draft of this proposal, I had said that I thought product management also wanted 2.2, but the latest guidance doesn't seem consistent with that understanding. I am still not sure, so more clarification would be helpful. ## Question 2: What should the commands be in the CLI? @@ -60,13 +64,14 @@ Currently there is no documentation that I know of that explains how to do 2.1 o Pro: -- Future changes to the input format for preprocessing and/or the output format for postprocessing (e.g., adding more expressive power to the taxonomy format) require changes to the core SDG *and* the preprocssing/postprocessing. That's easier to do if they are in the same repository because they can be done in a single PR instead of multiple PRs that need to be coordinated. +- Future changes to the input format for preprocessing and/or the output format for postprocessing (e.g., adding more expressive power to the taxonomy format) require changes to the core SDG *and* the preprocessing/postprocessing. That's easier to do if they are in the same repository because they can be done in a single PR instead of multiple PRs that need to be coordinated. - It is simpler to leave things where they are. - If we're not totally sure which of the options we want, then it might make more sense to stick with this option for now since it avoids doing a work to move preprocessing and postprocessing *now* that could then be followed by more work to move preprocessing and postprocessing *again* after we decide where it goes. Con: - The core logic of SDG is inherently complex and represents some of the most sophisticated and differentiating elements of InstructLab. For that reason, it would be nice to have it in its own repository by itself. New contributors to that core logic find it challenging enough to navigate the core functionality without having to also figure out where the core logic starts and the preprocessing and postprocessing capabilities end. This could be mitigated by having better technical documentation (README, comments) for the SDG library. +- To the extent that the plan is for SDG to be run independently, then there will be tooling built around the SDG repo. The more tooling built around just running SDG independently the more risk of breaking contracts for that tooling. The more functionality living in SDG that isn't SDG, the more surface area there is to break. - As noted in the Context section earlier, in the near future we are absorbing a set of updates to the core SDG functionality. Absorbing those updates is somewhat simpler if the core SDG logic is all alone in a repository of its own. - Keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. We certainly *could* have well documented API contracts for preprocessing and postprocessing and core SDG functionality that makes it clear how they interact even when both of these exist in the same repository, but it is probably more likely that we *will* do so if they are separated. - The logic behind the core SDG algorithms are mainly developed and maintained by the Red Hat AI Innovations team (commonly referred to as the "research" team because many people on that team used to work for IBM Research) while the logic behind the preprocessing and postprocessing is mainly developed and maintained by the Red Hat AI engineering "data" team. Having multiple teams working on a component increases the amount of coordination required. Note, however, that preprocessing, postprocessing and core SDG all belong to the entire InstructLab commmunity and *not* Red Hat (much less any one team in Red Hat). So the teams really need to keep collaborating with the entire community at all times and not get into a mindset of "owning" a single piece of code. @@ -91,7 +96,7 @@ Cons: - Avoids all the pros of Option 1. - Having a separate repository with its own library brings in an enormous amount of overhead in maintaining that repository (e.g., CI/CD). - Having a separate repository with its own library also brings in an enormous amount of overhead in maintaining the CLI repository's dependency on all of those libraries. -- Does not allow user flow 2.1 but maybe that's OK because it is not a priority and anyway the users could approximate that flow by also installing the ingestion library. +- Does not allow user flow 2.1 (because that flow explicitly excludes installing the CLI) but maybe that's OK because it is not a priority and anyway the users could approximate that flow by also installing the ingestion library. Conclusion: @@ -101,15 +106,21 @@ Conclusion: Pro: -- The CLI already has a lot of "supporting" (non-core) functionality, so it would respect established precedent to include preprocessing and/or postprocessing here. -- Supporting user flow 1.3.2 requires separate CLI commands for preprocessing and core SDG. This is slightly simpler if preprocessing is implemented in CLI. If preprocessing remains in the SDG library instead then the CLI would need to make separate calls to the SDG library for preprocessing and core SDG to support user flow 1.3.2. That adds a little complexity. +- The CLI already has a lot of "supporting" (non-core) functionality. It contains most user facing logic aside from what we call the "core" parts of the workflow (SDG, Train, Eval). Since the preprocessing and postprocessing are non-code parts of SDG, this change would respect established precedent. Examples of existing functionality that follow this pattern include all of the following and more: + - download + - serve + - chat + - list + - edit + - init +- Supporting user flow 1.3.2 requires separate CLI commands for preprocessing and core SDG. This is slightly simpler if preprocessing is implemented in CLI. If preprocessing remains in the SDG library instead then the CLI would need to make separate calls to the SDG library for preprocessing and core SDG to support user flow 1.3.2. That adds a little complexity. - Avoids some of the cons of Option 1, but see below for some overlap. - Avoids some of the cons of Option 2, but see below for some overlap. Con: - Avoids the pros of both Option 1 and Option 2. -- As with Option 1, this approach involves a lot of coordination. There are a lot of stakeholders involved in the CLI and locating preprocessing and postprocessing there drags those stakeholders into issues relating to preprocessing and postprocessing. However, as with Option 1, coordinating across stakeholders is something an open source project needs to do well anyway to remain engaged with the community. +- As with Option 1, this approach involves a lot of coordination. There are a lot of stakeholders involved in the CLI and locating preprocessing and postprocessing there drags those stakeholders into issues relating to preprocessing and postprocessing. However, as with Option 1, coordinating across stakeholders is something an open source project needs to do well anyway to remain engaged with the community. - As with Option 1, this approach suffers from the fact that keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. In the case of Option 1, the interconnected components that would not have as much pressure to be documented would be preprocessing/postprocessing and core SDG. In the case of Option 3, the interconnected components that would not have as much pressure to be documented would be the CLI and preprocessing/postprocessing. However, in both cases, this con could be alleviated by just having the discipline to document the APIs well even without such pressure. - As with Option 2, this approach would not enable user flow 2.1. Maybe that's fine since it is not on our requirements list. From 630d63708afbed30f0d1ec04c3947b12c90de37b Mon Sep 17 00:00:00 2001 From: Bill Murdock Date: Sun, 17 Nov 2024 19:40:15 -0500 Subject: [PATCH 04/11] Update draft conclusions Signed-off-by: Bill Murdock --- docs/sdg/sdg-refactor.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/sdg/sdg-refactor.md b/docs/sdg/sdg-refactor.md index d3f4ef6e..0b1e9f8b 100644 --- a/docs/sdg/sdg-refactor.md +++ b/docs/sdg/sdg-refactor.md @@ -142,6 +142,6 @@ Since this is a draft, no decisions are made yet. However, here are the current - We will support the following user flows: 1.1., 1.2., 1.3, 1.3.1, 1.3.2, 2.1, and 2.1 as documented in the Question 1 section above. - We will adopt the updates to the CLI that will be documented in Question 2 above. -- We will leave the preprocessing in SDG as described in Question 3: Option 1. -- We will leave the postprocessing in SDG as described in Question 3: Option 1. +- We will move preprocessing to the CLI repository as described in Question 3: Option 3. +- We will move preprocessing to the CLI repository as described in Question 3: Option 3. - We will not separate preprocessing, postprocessing, and SDG into separate packages. From 92eb6b56de5a96a85304b42268bff0fdaa51bd17 Mon Sep 17 00:00:00 2001 From: Bill Murdock Date: Sun, 17 Nov 2024 19:58:57 -0500 Subject: [PATCH 05/11] Fix some typos Signed-off-by: Bill Murdock --- docs/sdg/sdg-refactor.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/sdg/sdg-refactor.md b/docs/sdg/sdg-refactor.md index 0b1e9f8b..d2eeddea 100644 --- a/docs/sdg/sdg-refactor.md +++ b/docs/sdg/sdg-refactor.md @@ -13,16 +13,16 @@ The existing synthetic data generation (SDG) repository includes several related Of all of these, only the one emphasized (*Given the seed data ... generate ... tuples*) is core SDG functionality. The others are essentially preprocessing and postprocessing steps to enable the core SDG functionality and produce outputs usable for future steps. In the current flow, preprocessing has a taxonomy with some new seed data added to it as input. The output of preprocessing includes a set of context/question/answer tuples for both knowledge and skill taxonomy nodes. For knowledge taxonomy nodes it also includes a set of document chunks. SDG uses the context/question/answer tuples as seed examples, and it uses the document chunks (if there are any) as example contexts from which to generate additional data. That additional data is then sent to the postprocessing step to produce the final outputs. -We have heard that some users want a stand-alone SDG capability that includes only the core SDG functionality. Specifically, they already have a set of seed context/question/answer tuples and optionallly a set of document chunks. All they want from SDG is to take that input and produce an new synthetic data set as output without doing any mixing into pre-computed data or splitting into train and test. The preprocessing and postprocessing capabilities currently in SDG are not relevant to those users. +We have heard that some users want a stand-alone SDG capability that includes only the core SDG functionality. Specifically, they already have a set of seed context/question/answer tuples and optionally a set of document chunks. All they want from SDG is to take that input and produce an new synthetic data set as output without doing any mixing into pre-computed data or splitting into train and test. The preprocessing and postprocessing capabilities currently in SDG are not relevant to those users. Also as context, in the near future we are absorbing a set of updates to the core SDG functionality to make it more modularized and flexible. That might turn out to be irrelevant to this document which is focused on what to do with the non-core functionality (preprocessing and postprocessing). However, it is mentioned here in the context section in case that context winds up being useful. Furthermore, in 2025 we are hoping to have some sort of retrieval-augmented generation (RAG) capability that is either part of or tightly integrated with InstructLab. Such a capability would have significant overlap with the functionality of the preprocessing for SDG. As noted above, when a taxonomy has a knowledge qna.yaml file that references a document, SDG uses Docling to convert the file to JSON and then splits the file into chunks of appropriate size for SDG. The RAG capability would *also* want the same Docling JSON output but would need to split it into chunks that are sized appropriately for vector retrieval (i.e., that fit within the context window of the semantic encoding model). -An additional complication is the fact that InstructLab's existing "taxonomy" structure is a tree structure encoded as a git repo that can be cloned/pushed/shared using the normal git constructs and flow. A taxonomy has *staged* nodes that are presumed to already be fully addressed by the model and *unstaged* nodes that are not, which is why the first item in the list above involves identifying only the unstaged qna.yaml files. However, some users might have the essential elements of a taxonomy (seed context/question/answer tuples for both skills and knowledge plus documents for knowledge) but do not want to put that information in a tree it a git repo. For the purposes of this document, we will refer to those essential elements as "raw seed content". The "raw seed content" includes all of the things that go into a qna.yanl file. In the current code base, the way InstructLab gets to the raw seed content is by identifying unstaged qna.yaml files from a local clone of a taxonomy. However, in the future we might add functionality that allows users to simply point at some raw seed content without having to tie it to a github respository for a taxonomy. If the raw seed content includes knowledge elements (not just skills) then those knowledge elements will have references to documents. When the raw seed content is processed, the documents are fetched, converted, and chunked (the third step in the list above). For this document, we will use the term "processed seed content" to refer to the outputs of that processing. So to summarize the data structure terms being discussed here: +An additional complication is the fact that InstructLab's existing "taxonomy" structure is a tree structure encoded as a git repo that can be cloned/pushed/shared using the normal git constructs and flow. A taxonomy has *staged* nodes that are presumed to already be fully addressed by the model and *unstaged* nodes that are not, which is why the first item in the list above involves identifying only the unstaged qna.yaml files. However, some users might have the essential elements of a taxonomy (seed context/question/answer tuples for both skills and knowledge plus documents for knowledge) but do not want to put that information in a tree it a git repo. For the purposes of this document, we will refer to those essential elements as "raw seed content". The "raw seed content" includes all of the things that go into a qna.yanl file. In the current code base, the way InstructLab gets to the raw seed content is by identifying unstaged qna.yaml files from a local clone of a taxonomy. However, in the future we might add functionality that allows users to simply point at some raw seed content without having to tie it to a github repository for a taxonomy. If the raw seed content includes knowledge elements (not just skills) then those knowledge elements will have references to documents. When the raw seed content is processed, the documents are fetched, converted, and chunked (the third step in the list above). For this document, we will use the term "processed seed content" to refer to the outputs of that processing. So to summarize the data structure terms being discussed here: - *Raw seed content* -- A set of elements each of which has a set of context/question/answer tuples. Some elements may be *knowledge* elements which also have references to documents. -- *Processed seed content* -- The same as raw seed content except all references to documents are replaced with a set of document chunks of approrpriate size for SDG. +- *Processed seed content* -- The same as raw seed content except all references to documents are replaced with a set of document chunks of appropriate size for SDG. - *Taxonomy* -- A tree structure encoded as a git repo. Some leaves of the taxonomy are unstaged, indicating that they should be used for raw seed content. ## Question 1: What user flows should be supported? @@ -71,10 +71,10 @@ Pro: Con: - The core logic of SDG is inherently complex and represents some of the most sophisticated and differentiating elements of InstructLab. For that reason, it would be nice to have it in its own repository by itself. New contributors to that core logic find it challenging enough to navigate the core functionality without having to also figure out where the core logic starts and the preprocessing and postprocessing capabilities end. This could be mitigated by having better technical documentation (README, comments) for the SDG library. -- To the extent that the plan is for SDG to be run independently, then there will be tooling built around the SDG repo. The more tooling built around just running SDG independently the more risk of breaking contracts for that tooling. The more functionality living in SDG that isn't SDG, the more surface area there is to break. +- To the extent that the plan is for SDG to be run independently, then there will be tooling built around the SDG repo. The more tooling built around just running SDG independently the more risk of breaking contracts for that tooling. The more functionality living in SDG that isn't SDG, the more surface area there is to break. - As noted in the Context section earlier, in the near future we are absorbing a set of updates to the core SDG functionality. Absorbing those updates is somewhat simpler if the core SDG logic is all alone in a repository of its own. - Keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. We certainly *could* have well documented API contracts for preprocessing and postprocessing and core SDG functionality that makes it clear how they interact even when both of these exist in the same repository, but it is probably more likely that we *will* do so if they are separated. -- The logic behind the core SDG algorithms are mainly developed and maintained by the Red Hat AI Innovations team (commonly referred to as the "research" team because many people on that team used to work for IBM Research) while the logic behind the preprocessing and postprocessing is mainly developed and maintained by the Red Hat AI engineering "data" team. Having multiple teams working on a component increases the amount of coordination required. Note, however, that preprocessing, postprocessing and core SDG all belong to the entire InstructLab commmunity and *not* Red Hat (much less any one team in Red Hat). So the teams really need to keep collaborating with the entire community at all times and not get into a mindset of "owning" a single piece of code. +- The logic behind the core SDG algorithms are mainly developed and maintained by the Red Hat AI Innovations team (commonly referred to as the "research" team because many people on that team used to work for IBM Research) while the logic behind the preprocessing and postprocessing is mainly developed and maintained by the Red Hat AI engineering "data" team. Having multiple teams working on a component increases the amount of coordination required. Note, however, that preprocessing, postprocessing and core SDG all belong to the entire InstructLab community and *not* Red Hat (much less any one team in Red Hat). So the teams really need to keep collaborating with the entire community at all times and not get into a mindset of "owning" a single piece of code. - The expected RAG functionality in 2025 will have some complex interactions with both preprocessing and postprocessing, perhaps even involving user flows in which the core SDG functionality is not needed. In that case, it would be confusing to have the code path for RAG include a call out to the SDG library for doing preprocessing but not actually doing the core SDG. - It would just be simpler to explain to all stakeholders if the functionality that I've been calling "core SDG" was really just called "SDG". We can't do that now because the SDG library has preprocessing and postprocessing in it too. From f05ecf40f3976a2e230b211be058e49bb283d455 Mon Sep 17 00:00:00 2001 From: Bill Murdock Date: Thu, 21 Nov 2024 13:06:40 -0500 Subject: [PATCH 06/11] Update sdg-refactor.md Signed-off-by: Bill Murdock --- docs/sdg/sdg-refactor.md | 46 ++++++++++++++++++++++++++-------------- 1 file changed, 30 insertions(+), 16 deletions(-) diff --git a/docs/sdg/sdg-refactor.md b/docs/sdg/sdg-refactor.md index d2eeddea..067bc384 100644 --- a/docs/sdg/sdg-refactor.md +++ b/docs/sdg/sdg-refactor.md @@ -19,7 +19,7 @@ Also as context, in the near future we are absorbing a set of updates to the cor Furthermore, in 2025 we are hoping to have some sort of retrieval-augmented generation (RAG) capability that is either part of or tightly integrated with InstructLab. Such a capability would have significant overlap with the functionality of the preprocessing for SDG. As noted above, when a taxonomy has a knowledge qna.yaml file that references a document, SDG uses Docling to convert the file to JSON and then splits the file into chunks of appropriate size for SDG. The RAG capability would *also* want the same Docling JSON output but would need to split it into chunks that are sized appropriately for vector retrieval (i.e., that fit within the context window of the semantic encoding model). -An additional complication is the fact that InstructLab's existing "taxonomy" structure is a tree structure encoded as a git repo that can be cloned/pushed/shared using the normal git constructs and flow. A taxonomy has *staged* nodes that are presumed to already be fully addressed by the model and *unstaged* nodes that are not, which is why the first item in the list above involves identifying only the unstaged qna.yaml files. However, some users might have the essential elements of a taxonomy (seed context/question/answer tuples for both skills and knowledge plus documents for knowledge) but do not want to put that information in a tree it a git repo. For the purposes of this document, we will refer to those essential elements as "raw seed content". The "raw seed content" includes all of the things that go into a qna.yanl file. In the current code base, the way InstructLab gets to the raw seed content is by identifying unstaged qna.yaml files from a local clone of a taxonomy. However, in the future we might add functionality that allows users to simply point at some raw seed content without having to tie it to a github repository for a taxonomy. If the raw seed content includes knowledge elements (not just skills) then those knowledge elements will have references to documents. When the raw seed content is processed, the documents are fetched, converted, and chunked (the third step in the list above). For this document, we will use the term "processed seed content" to refer to the outputs of that processing. So to summarize the data structure terms being discussed here: +An additional complication is the fact that InstructLab's existing "taxonomy" structure is a tree structure encoded as a git repo that can be cloned/pushed/shared using the normal git constructs and flow. A taxonomy has *staged* nodes that are presumed to already be fully addressed by the model and *unstaged* nodes that are not, which is why the first item in the list above involves identifying only the unstaged qna.yaml files. However, some users might have the essential elements of a taxonomy (seed context/question/answer tuples for both skills and knowledge plus documents for knowledge) but do not want to put that information in a tree it a git repo. For the purposes of this document, we will refer to those essential elements as "raw seed content". The "raw seed content" includes all of the things that go into a qna.yaml file. In the current code base, the way InstructLab gets to the raw seed content is by identifying unstaged qna.yaml files from a local clone of a taxonomy. However, in the future we might add functionality that allows users to simply point at some raw seed content without having to tie it to a github repository for a taxonomy. If the raw seed content includes knowledge elements (not just skills) then those knowledge elements will have references to documents. When the raw seed content is processed, the documents are fetched, converted, and chunked (the third step in the list above). For this document, we will use the term "processed seed content" to refer to the outputs of that processing. So to summarize the data structure terms being discussed here: - *Raw seed content* -- A set of elements each of which has a set of context/question/answer tuples. Some elements may be *knowledge* elements which also have references to documents. - *Processed seed content* -- The same as raw seed content except all references to documents are replaced with a set of document chunks of appropriate size for SDG. @@ -29,7 +29,7 @@ An additional complication is the fact that InstructLab's existing "taxonomy" st Here are some user flows that seem like they might be valuable: -1. User installs the full InstructLab (CLI and/or GUI). They want any of the following using CLI or GUI interactions: +1. User installs the full InstructLab (command-line interface and/or graphical interface). They want any of the following using command-line or graphical interactions: - 1.1. They have raw seed content. They want to run the full pipeline including SDG and model training and evaluation. - 1.2. They have raw seed content. They want to run SDG and then evaluate an existing model on the outputs of that SDG. - 1.3. They have raw seed content. They want to run SDG only. @@ -44,9 +44,9 @@ Here are some user flows that seem like they might be valuable: If I understand the latest guidance from our product management, the flows that our users want us to support here are 1.1., 1.2., 1.3, 1.3.1, and 1.6. In an earlier draft of this proposal, I had said that I thought product management also wanted 2.2, but the latest guidance doesn't seem consistent with that understanding. I am still not sure, so more clarification would be helpful. -## Question 2: What should the commands be in the CLI? +## Question 2: What should the commands be in the command-line interface? -One way to support both 1.3.1 and 1.3.2 would be to have separate CLI commands for the preprocessing, core SDG, and postprocessing step . Alternatively, a single CLI command that does all of these and also saves the outputs of preprocessing to disk would support 1.3.1 but *not* 1.3.2. Even if we only want to support 1.3.1, having separate CLI commands for each step might be desirable because it is just more intuitive that if a user wants to save the outputs of preprocessing to disk to have a command to do that instead of having it be a "side effect" of an omnibus SDG command. Here is a rough outline of what the separate commands would be: +One way to support both 1.3.1 and 1.3.2 would be to have separate commands for the preprocessing, core SDG, and postprocessing step . Alternatively, a single command that does all of these and also saves the outputs of preprocessing to disk would support 1.3.1 but *not* 1.3.2. Even if we only want to support 1.3.1, having separate commands for each step might be desirable because it is just more intuitive that if a user wants to save the outputs of preprocessing to disk to have a command to do that instead of having it be a "side effect" of an omnibus SDG command. Here is a rough outline of what the separate commands would be: - `ilab data prep` would handle all the preprocessing (the first three bullets in the Context section above, plus any additional preprocessing we add in the future). - `ilab data generate` would take as input some data in the same format that `ilab data prep` produces and would run the core synthetic data generation *only*. Note that this is a breaking change from the current behavior of `ilab data generate`, but that may be acceptable because the user base is still small. @@ -95,34 +95,34 @@ Cons: - Avoids all the pros of Option 1. - Having a separate repository with its own library brings in an enormous amount of overhead in maintaining that repository (e.g., CI/CD). -- Having a separate repository with its own library also brings in an enormous amount of overhead in maintaining the CLI repository's dependency on all of those libraries. -- Does not allow user flow 2.1 (because that flow explicitly excludes installing the CLI) but maybe that's OK because it is not a priority and anyway the users could approximate that flow by also installing the ingestion library. +- Having a separate repository with its own library also brings in an enormous amount of overhead in maintaining the `instructlab/instructlab` repository's dependency on all of those libraries. +- Does not allow user flow 2.1 (because that flow includes installing *only* the SDG repository and requires preprocessing). That's OK because it is not a priority and anyway the users could approximate that flow by also installing the ingestion library. Conclusion: - The cost of having a separate repository is so high that we would only consider this option as a last resort. -### Option 3: Move preprocessing and postprocessing into the CLI repository +### Option 3: Move preprocessing and postprocessing into the instructlab/instructlab repository Pro: -- The CLI already has a lot of "supporting" (non-core) functionality. It contains most user facing logic aside from what we call the "core" parts of the workflow (SDG, Train, Eval). Since the preprocessing and postprocessing are non-code parts of SDG, this change would respect established precedent. Examples of existing functionality that follow this pattern include all of the following and more: +- The `instructlab/instructlab` repository already has a lot of "supporting" (non-core) functionality. It contains most user facing logic aside from what we call the "core" parts of the workflow (SDG, Train, Eval). Since the preprocessing and postprocessing are non-code parts of SDG, this change would respect established precedent. Examples of existing functionality that follow this pattern include all of the following and more: - download - serve - chat - list - edit - init -- Supporting user flow 1.3.2 requires separate CLI commands for preprocessing and core SDG. This is slightly simpler if preprocessing is implemented in CLI. If preprocessing remains in the SDG library instead then the CLI would need to make separate calls to the SDG library for preprocessing and core SDG to support user flow 1.3.2. That adds a little complexity. +- Supporting user flow 1.3.2 requires separate commands for preprocessing and core SDG. This is slightly simpler if preprocessing is implemented in the `instructlab/instructlab` repository. If preprocessing remains in the SDG library instead then the code in the `instructlab/instructlab` repository would need to make separate calls to the SDG library for preprocessing and core SDG to support user flow 1.3.2. That adds a little complexity. - Avoids some of the cons of Option 1, but see below for some overlap. - Avoids some of the cons of Option 2, but see below for some overlap. Con: - Avoids the pros of both Option 1 and Option 2. -- As with Option 1, this approach involves a lot of coordination. There are a lot of stakeholders involved in the CLI and locating preprocessing and postprocessing there drags those stakeholders into issues relating to preprocessing and postprocessing. However, as with Option 1, coordinating across stakeholders is something an open source project needs to do well anyway to remain engaged with the community. -- As with Option 1, this approach suffers from the fact that keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. In the case of Option 1, the interconnected components that would not have as much pressure to be documented would be preprocessing/postprocessing and core SDG. In the case of Option 3, the interconnected components that would not have as much pressure to be documented would be the CLI and preprocessing/postprocessing. However, in both cases, this con could be alleviated by just having the discipline to document the APIs well even without such pressure. -- As with Option 2, this approach would not enable user flow 2.1. Maybe that's fine since it is not on our requirements list. +- As with Option 1, this approach involves a lot of coordination. There are a lot of stakeholders involved in the `instructlab/instructlab` repository and locating preprocessing and postprocessing there drags those stakeholders into issues relating to preprocessing and postprocessing. However, as with Option 1, coordinating across stakeholders is something an open source project needs to do well anyway to remain engaged with the community. +- As with Option 1, this approach suffers from the fact that keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. In the case of Option 1, the interconnected components that would not have as much pressure to be documented would be preprocessing/postprocessing and core SDG. In the case of Option 3, the interconnected components that would not have as much pressure to be documented would be the command-line interface code and preprocessing/postprocessing. However, in both cases, this con could be alleviated by just having the discipline to document the APIs well even without such pressure. +- As with Option 2, this approach would not enable user flow 2.1. That's fine since it is not on our requirements list. Conclusion: @@ -130,7 +130,7 @@ Conclusion: ### Option 4: Preprocessing and postprocessing go to different locations -We could also mix and match any of the above options separately for preprocessing and postprocessing. For example, preprocessing could move to the CLI repo and postprocessing could stay in the SDG repo. Or preprocessing could move to a new repository and postprocessing could move to a different new repository or the same new repository. Enumerating all possible permutations of where each could go and enumerating pros and cons of each of them would make this document unbearably long. If anyone wants to advocate for a small number of specific permutations, we will add them to this document. +We could also mix and match any of the above options separately for preprocessing and postprocessing. For example, preprocessing could move to the `instructlab/instructlab` repository and postprocessing could stay in the SDG repo. Or preprocessing could move to a new repository and postprocessing could move to a different new repository or the same new repository. Enumerating all possible permutations of where each could go and enumerating pros and cons of each of them would make this document unbearably long. If anyone wants to advocate for a small number of specific permutations, we will add them to this document. ## Question 4: Should preprocessing, postprocessing, and core SDG be separate Python packages? @@ -140,8 +140,22 @@ If we choose Option 1 (leave preprocessing and postprocessing in SDG) then we st Since this is a draft, no decisions are made yet. However, here are the current draft decisions: +- The SDG codebase will be refactored in order to modularize based on pre-processing, data generation, and post-processing steps. - We will support the following user flows: 1.1., 1.2., 1.3, 1.3.1, 1.3.2, 2.1, and 2.1 as documented in the Question 1 section above. -- We will adopt the updates to the CLI that will be documented in Question 2 above. -- We will move preprocessing to the CLI repository as described in Question 3: Option 3. -- We will move preprocessing to the CLI repository as described in Question 3: Option 3. +- We will adopt the updates to the command-line interface that will be documented in Question 2 above. +- Pre-processing logic for SDG will be moved into the `instructlab/instructlab` repository as discussed in Option 3 above. +- Post-processing logic for SDG will be moved into the `instructlab/instructlab` repository as discussed in Option 3 above. +- The SDG codebase will be designed around the principle of "dataset in, dataset out". - We will not separate preprocessing, postprocessing, and SDG into separate packages. + +## Status + +- Proposed + +## Consequences + +Some of the consequences are covered earlier in the pros and cons for Option 3. Here is a brief recap of the most important of those: + +- SDG preprocessing and postprocessing will join a wide variety of glue/data-format capablities in that repository, increasing consistency. +- In the future changes to the kinds of content that SDG takes as inputs will require changes across both the SDG repository and the `instructlab/instructlab` repository. +- There will be less pressure to have a clear and well documented separation between the library APIs and the command-line interface for these functions because both are located in the same repository. We will mitigate this consequence by being disciplined about the separation. From 7c336b1b90084d1dfd0b63e46c0796bb8febe98c Mon Sep 17 00:00:00 2001 From: Bill Murdock Date: Thu, 21 Nov 2024 13:14:07 -0500 Subject: [PATCH 07/11] Update .spellcheck-en-custom.txt Signed-off-by: Bill Murdock --- .spellcheck-en-custom.txt | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt index 5b57c755..4a0cec11 100644 --- a/.spellcheck-en-custom.txt +++ b/.spellcheck-en-custom.txt @@ -40,6 +40,7 @@ Dependabot dev disambiguating ditaa +Docling docstring dr Dropdown @@ -62,6 +63,7 @@ gguf GGUFs ggufs GiB +github Gmail GPTDolomite gpu @@ -77,6 +79,7 @@ instantiation instructlab io ISA +init iters Jie JIT @@ -105,6 +108,8 @@ mixtral MLX mlx MMLU +modularize +modularized Nakamura num NVidia @@ -128,14 +133,17 @@ PNG POC Podman podman +postprocessing pre preprint +preprocessing PR's pyenv PyPI pyproject PyTorch qlora +qna quantized Quantizing Radeon @@ -189,6 +197,7 @@ triagers UI ui unquantized +unstaged USM UX venv From 850300f840c79c955e00dcad32f6b1009202f105 Mon Sep 17 00:00:00 2001 From: Bill Murdock Date: Thu, 21 Nov 2024 13:19:46 -0500 Subject: [PATCH 08/11] Fix spell error Signed-off-by: Bill Murdock --- docs/sdg/sdg-refactor.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/sdg/sdg-refactor.md b/docs/sdg/sdg-refactor.md index 067bc384..b46ed719 100644 --- a/docs/sdg/sdg-refactor.md +++ b/docs/sdg/sdg-refactor.md @@ -138,8 +138,6 @@ If we choose Option 1 (leave preprocessing and postprocessing in SDG) then we st ## Decisions -Since this is a draft, no decisions are made yet. However, here are the current draft decisions: - - The SDG codebase will be refactored in order to modularize based on pre-processing, data generation, and post-processing steps. - We will support the following user flows: 1.1., 1.2., 1.3, 1.3.1, 1.3.2, 2.1, and 2.1 as documented in the Question 1 section above. - We will adopt the updates to the command-line interface that will be documented in Question 2 above. @@ -156,6 +154,6 @@ Since this is a draft, no decisions are made yet. However, here are the current Some of the consequences are covered earlier in the pros and cons for Option 3. Here is a brief recap of the most important of those: -- SDG preprocessing and postprocessing will join a wide variety of glue/data-format capablities in that repository, increasing consistency. +- SDG preprocessing and postprocessing will join a wide variety of glue/data-format capabilities in that repository, increasing consistency. - In the future changes to the kinds of content that SDG takes as inputs will require changes across both the SDG repository and the `instructlab/instructlab` repository. - There will be less pressure to have a clear and well documented separation between the library APIs and the command-line interface for these functions because both are located in the same repository. We will mitigate this consequence by being disciplined about the separation. From 4c3c4dac6410d936c8f4dc410be058cf0947e562 Mon Sep 17 00:00:00 2001 From: Bill Murdock Date: Wed, 11 Dec 2024 16:19:50 -0500 Subject: [PATCH 09/11] Use new name for core repo. Also, less specifics about CLI. Signed-off-by: Bill Murdock --- docs/sdg/sdg-refactor.md | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/docs/sdg/sdg-refactor.md b/docs/sdg/sdg-refactor.md index b46ed719..bb22c960 100644 --- a/docs/sdg/sdg-refactor.md +++ b/docs/sdg/sdg-refactor.md @@ -11,11 +11,11 @@ The existing synthetic data generation (SDG) repository includes several related - Mix the outputs with some pre-computed data sets when applicable - Split the data into train and test -Of all of these, only the one emphasized (*Given the seed data ... generate ... tuples*) is core SDG functionality. The others are essentially preprocessing and postprocessing steps to enable the core SDG functionality and produce outputs usable for future steps. In the current flow, preprocessing has a taxonomy with some new seed data added to it as input. The output of preprocessing includes a set of context/question/answer tuples for both knowledge and skill taxonomy nodes. For knowledge taxonomy nodes it also includes a set of document chunks. SDG uses the context/question/answer tuples as seed examples, and it uses the document chunks (if there are any) as example contexts from which to generate additional data. That additional data is then sent to the postprocessing step to produce the final outputs. +Of all of these, only the one emphasized (*Given the seed data ... generate ... tuples*) is SDG functionality. The others are essentially preprocessing and postprocessing steps to enable the SDG functionality and produce outputs usable for future steps. In the current flow, preprocessing has a taxonomy with some new seed data added to it as input. The output of preprocessing includes a set of context/question/answer tuples for both knowledge and skill taxonomy nodes. For knowledge taxonomy nodes it also includes a set of document chunks. SDG uses the context/question/answer tuples as seed examples, and it uses the document chunks (if there are any) as example contexts from which to generate additional data. That additional data is then sent to the postprocessing step to produce the final outputs. -We have heard that some users want a stand-alone SDG capability that includes only the core SDG functionality. Specifically, they already have a set of seed context/question/answer tuples and optionally a set of document chunks. All they want from SDG is to take that input and produce an new synthetic data set as output without doing any mixing into pre-computed data or splitting into train and test. The preprocessing and postprocessing capabilities currently in SDG are not relevant to those users. +We have heard that some users want a stand-alone SDG capability that includes only the SDG functionality. Specifically, they already have a set of seed context/question/answer tuples and optionally a set of document chunks. All they want from SDG is to take that input and produce an new synthetic data set as output without doing any mixing into pre-computed data or splitting into train and test. The preprocessing and postprocessing capabilities currently in SDG are not relevant to those users. -Also as context, in the near future we are absorbing a set of updates to the core SDG functionality to make it more modularized and flexible. That might turn out to be irrelevant to this document which is focused on what to do with the non-core functionality (preprocessing and postprocessing). However, it is mentioned here in the context section in case that context winds up being useful. +Also as context, in the near future we are absorbing a set of updates to the SDG functionality to make it more modularized and flexible. That might turn out to be irrelevant to this document which is focused on what to do with the preprocessing and postprocessing. However, it is mentioned here in the context section in case that context winds up being useful. Furthermore, in 2025 we are hoping to have some sort of retrieval-augmented generation (RAG) capability that is either part of or tightly integrated with InstructLab. Such a capability would have significant overlap with the functionality of the preprocessing for SDG. As noted above, when a taxonomy has a knowledge qna.yaml file that references a document, SDG uses Docling to convert the file to JSON and then splits the file into chunks of appropriate size for SDG. The RAG capability would *also* want the same Docling JSON output but would need to split it into chunks that are sized appropriately for vector retrieval (i.e., that fit within the context window of the semantic encoding model). @@ -46,11 +46,11 @@ If I understand the latest guidance from our product management, the flows that ## Question 2: What should the commands be in the command-line interface? -One way to support both 1.3.1 and 1.3.2 would be to have separate commands for the preprocessing, core SDG, and postprocessing step . Alternatively, a single command that does all of these and also saves the outputs of preprocessing to disk would support 1.3.1 but *not* 1.3.2. Even if we only want to support 1.3.1, having separate commands for each step might be desirable because it is just more intuitive that if a user wants to save the outputs of preprocessing to disk to have a command to do that instead of having it be a "side effect" of an omnibus SDG command. Here is a rough outline of what the separate commands would be: +One way to support both 1.3.1 and 1.3.2 would be to have separate commands for the preprocessing, SDG, and postprocessing step . Alternatively, a single command that does all of these and also saves the outputs of preprocessing to disk would support 1.3.1 but *not* 1.3.2. Even if we only want to support 1.3.1, having separate commands for each step might be desirable because it is just more intuitive that if a user wants to save the outputs of preprocessing to disk to have a command to do that instead of having it be a "side effect" of an omnibus SDG command. Here is a rough outline of what the separate commands would be: -- `ilab data prep` would handle all the preprocessing (the first three bullets in the Context section above, plus any additional preprocessing we add in the future). -- `ilab data generate` would take as input some data in the same format that `ilab data prep` produces and would run the core synthetic data generation *only*. Note that this is a breaking change from the current behavior of `ilab data generate`, but that may be acceptable because the user base is still small. -- `ilab data process` would take as input some data in the same format that `ilab data generate` produces and would run the postprocessing (the last two bullets in the Context section above, plus any additional postprocessing we add in the future). +- There would be a command to handle all the preprocessing (the first three bullets in the Context section above, plus any additional preprocessing we add in the future). +- There would be a command that runs the synthetic data generation *only*. If this command replaces the existing `ilab data generate`, that would be a breaking change from the current behavior of `ilab data generate`, but that may be acceptable because the user base is still small. +- There would be one or more commands to handle all the postprocessing. This includes data mixing and arguably other postprocessing depending on exactly how one defines "data mixing". Detailed technical specifications for these commands are outside the scope of this document and should appear in a future document instead. @@ -64,13 +64,13 @@ Currently there is no documentation that I know of that explains how to do 2.1 o Pro: -- Future changes to the input format for preprocessing and/or the output format for postprocessing (e.g., adding more expressive power to the taxonomy format) require changes to the core SDG *and* the preprocessing/postprocessing. That's easier to do if they are in the same repository because they can be done in a single PR instead of multiple PRs that need to be coordinated. +- Future changes to the input format for preprocessing and/or the output format for postprocessing (e.g., adding more expressive power to the taxonomy format) require changes to SDG *and* the preprocessing/postprocessing. That's easier to do if they are in the same repository because they can be done in a single PR instead of multiple PRs that need to be coordinated. - It is simpler to leave things where they are. - If we're not totally sure which of the options we want, then it might make more sense to stick with this option for now since it avoids doing a work to move preprocessing and postprocessing *now* that could then be followed by more work to move preprocessing and postprocessing *again* after we decide where it goes. Con: -- The core logic of SDG is inherently complex and represents some of the most sophisticated and differentiating elements of InstructLab. For that reason, it would be nice to have it in its own repository by itself. New contributors to that core logic find it challenging enough to navigate the core functionality without having to also figure out where the core logic starts and the preprocessing and postprocessing capabilities end. This could be mitigated by having better technical documentation (README, comments) for the SDG library. +- The logic of SDG is inherently complex and represents some of the most sophisticated and differentiating elements of InstructLab. For that reason, it would be nice to have it in its own repository by itself. New contributors to that core logic find it challenging enough to navigate the core functionality without having to also figure out where the core logic starts and the preprocessing and postprocessing capabilities end. This could be mitigated by having better technical documentation (README, comments) for the SDG library. - To the extent that the plan is for SDG to be run independently, then there will be tooling built around the SDG repo. The more tooling built around just running SDG independently the more risk of breaking contracts for that tooling. The more functionality living in SDG that isn't SDG, the more surface area there is to break. - As noted in the Context section earlier, in the near future we are absorbing a set of updates to the core SDG functionality. Absorbing those updates is somewhat simpler if the core SDG logic is all alone in a repository of its own. - Keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. We certainly *could* have well documented API contracts for preprocessing and postprocessing and core SDG functionality that makes it clear how they interact even when both of these exist in the same repository, but it is probably more likely that we *will* do so if they are separated. @@ -95,32 +95,32 @@ Cons: - Avoids all the pros of Option 1. - Having a separate repository with its own library brings in an enormous amount of overhead in maintaining that repository (e.g., CI/CD). -- Having a separate repository with its own library also brings in an enormous amount of overhead in maintaining the `instructlab/instructlab` repository's dependency on all of those libraries. +- Having a separate repository with its own library also brings in an enormous amount of overhead in maintaining the core (`instructlab/instructlab`) repository's dependency on all of those libraries. - Does not allow user flow 2.1 (because that flow includes installing *only* the SDG repository and requires preprocessing). That's OK because it is not a priority and anyway the users could approximate that flow by also installing the ingestion library. Conclusion: - The cost of having a separate repository is so high that we would only consider this option as a last resort. -### Option 3: Move preprocessing and postprocessing into the instructlab/instructlab repository +### Option 3: Move preprocessing and postprocessing into the core repository Pro: -- The `instructlab/instructlab` repository already has a lot of "supporting" (non-core) functionality. It contains most user facing logic aside from what we call the "core" parts of the workflow (SDG, Train, Eval). Since the preprocessing and postprocessing are non-code parts of SDG, this change would respect established precedent. Examples of existing functionality that follow this pattern include all of the following and more: +- The core (`instructlab/instructlab`) repository already has a lot of "supporting" functionality. It contains most user facing logic aside from what we call the central parts of the workflow that have their own libraries (SDG, Train, Eval). Since the preprocessing and postprocessing are non-central parts of SDG, this change would respect established precedent. Examples of existing functionality that follow this pattern include all of the following and more: - download - serve - chat - list - edit - init -- Supporting user flow 1.3.2 requires separate commands for preprocessing and core SDG. This is slightly simpler if preprocessing is implemented in the `instructlab/instructlab` repository. If preprocessing remains in the SDG library instead then the code in the `instructlab/instructlab` repository would need to make separate calls to the SDG library for preprocessing and core SDG to support user flow 1.3.2. That adds a little complexity. +- Supporting user flow 1.3.2 requires separate commands for preprocessing and core SDG. This is slightly simpler if preprocessing is implemented in the core repository. If preprocessing remains in the SDG library instead then the code in the core repository would need to make separate calls to the SDG library for preprocessing and core SDG to support user flow 1.3.2. That adds a little complexity. - Avoids some of the cons of Option 1, but see below for some overlap. - Avoids some of the cons of Option 2, but see below for some overlap. Con: - Avoids the pros of both Option 1 and Option 2. -- As with Option 1, this approach involves a lot of coordination. There are a lot of stakeholders involved in the `instructlab/instructlab` repository and locating preprocessing and postprocessing there drags those stakeholders into issues relating to preprocessing and postprocessing. However, as with Option 1, coordinating across stakeholders is something an open source project needs to do well anyway to remain engaged with the community. +- As with Option 1, this approach involves a lot of coordination. There are a lot of stakeholders involved in the core repository and locating preprocessing and postprocessing there drags those stakeholders into issues relating to preprocessing and postprocessing. However, as with Option 1, coordinating across stakeholders is something an open source project needs to do well anyway to remain engaged with the community. - As with Option 1, this approach suffers from the fact that keeping interconnected components in the same repository provides less pressure to consistently document the API contracts between them. In the case of Option 1, the interconnected components that would not have as much pressure to be documented would be preprocessing/postprocessing and core SDG. In the case of Option 3, the interconnected components that would not have as much pressure to be documented would be the command-line interface code and preprocessing/postprocessing. However, in both cases, this con could be alleviated by just having the discipline to document the APIs well even without such pressure. - As with Option 2, this approach would not enable user flow 2.1. That's fine since it is not on our requirements list. @@ -130,7 +130,7 @@ Conclusion: ### Option 4: Preprocessing and postprocessing go to different locations -We could also mix and match any of the above options separately for preprocessing and postprocessing. For example, preprocessing could move to the `instructlab/instructlab` repository and postprocessing could stay in the SDG repo. Or preprocessing could move to a new repository and postprocessing could move to a different new repository or the same new repository. Enumerating all possible permutations of where each could go and enumerating pros and cons of each of them would make this document unbearably long. If anyone wants to advocate for a small number of specific permutations, we will add them to this document. +We could also mix and match any of the above options separately for preprocessing and postprocessing. For example, preprocessing could move to the core repository and postprocessing could stay in the SDG repo. Or preprocessing could move to a new repository and postprocessing could move to a different new repository or the same new repository. Enumerating all possible permutations of where each could go and enumerating pros and cons of each of them would make this document unbearably long. If anyone wants to advocate for a small number of specific permutations, we will add them to this document. ## Question 4: Should preprocessing, postprocessing, and core SDG be separate Python packages? @@ -141,8 +141,8 @@ If we choose Option 1 (leave preprocessing and postprocessing in SDG) then we st - The SDG codebase will be refactored in order to modularize based on pre-processing, data generation, and post-processing steps. - We will support the following user flows: 1.1., 1.2., 1.3, 1.3.1, 1.3.2, 2.1, and 2.1 as documented in the Question 1 section above. - We will adopt the updates to the command-line interface that will be documented in Question 2 above. -- Pre-processing logic for SDG will be moved into the `instructlab/instructlab` repository as discussed in Option 3 above. -- Post-processing logic for SDG will be moved into the `instructlab/instructlab` repository as discussed in Option 3 above. +- Pre-processing logic for SDG will be moved into the core repository as discussed in Option 3 above. +- Post-processing logic for SDG will be moved into the core repository as discussed in Option 3 above. - The SDG codebase will be designed around the principle of "dataset in, dataset out". - We will not separate preprocessing, postprocessing, and SDG into separate packages. From a268c7b432db6bd530d4cf1d58602adfaa56d3b7 Mon Sep 17 00:00:00 2001 From: Bill Murdock Date: Wed, 11 Dec 2024 16:27:57 -0500 Subject: [PATCH 10/11] More changes in response to review comments Signed-off-by: Bill Murdock --- docs/sdg/sdg-refactor.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/sdg/sdg-refactor.md b/docs/sdg/sdg-refactor.md index bb22c960..5a45cea1 100644 --- a/docs/sdg/sdg-refactor.md +++ b/docs/sdg/sdg-refactor.md @@ -1,5 +1,9 @@ # Refactor preprocessing and postprocessing in SDG +## Goals + +We want to modularize the parts of the codebase that deal with the data augmentation phase of the end to end workflow. In order to modularize it effectively, we need to identify and distinguish pre-processing, data generation, and post-processing. Each of these elements need to be located somewhere. This document discusses pros and cons of different options and proposes specific conclusions. + ## Context The existing synthetic data generation (SDG) repository includes several related pieces of functionality: @@ -126,7 +130,7 @@ Con: Conclusion: -- This seems like a reasonable option. The cons are mostly manageable. However, overall the pros of Option 1 seem more compelling. +- This seems like a reasonable option. The cons are mostly manageable. ### Option 4: Preprocessing and postprocessing go to different locations From b1e1a8393898e81b3d62af87a2b949a986d1542b Mon Sep 17 00:00:00 2001 From: Bill Murdock Date: Fri, 13 Dec 2024 12:22:21 -0500 Subject: [PATCH 11/11] Update sdg-refactor.md Signed-off-by: Bill Murdock --- docs/sdg/sdg-refactor.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/sdg/sdg-refactor.md b/docs/sdg/sdg-refactor.md index 5a45cea1..4c0e9c22 100644 --- a/docs/sdg/sdg-refactor.md +++ b/docs/sdg/sdg-refactor.md @@ -2,7 +2,13 @@ ## Goals -We want to modularize the parts of the codebase that deal with the data augmentation phase of the end to end workflow. In order to modularize it effectively, we need to identify and distinguish pre-processing, data generation, and post-processing. Each of these elements need to be located somewhere. This document discusses pros and cons of different options and proposes specific conclusions. +We want to modularize the parts of the codebase that deal with the data augmentation phase of the end to end workflow. In order to modularize it effectively, we need to identify and distinguish pre-processing, data generation, and post-processing. Each of these elements need to be located somewhere. This document discusses pros and cons of different options and proposes specific conclusions. Specifically, it concludes: + +- The synthetic data generation will remain in the SDG repository. +- The preprocessing that is used for synthetic data generation (e.g., document conversion) will move to the core repository. +- The postprocessing that is used for synthetic data generation (e.g., data mixing) will move to the core repository. + +Ensuring that *only* synthetic data generation is in the SDG repository ensures that this component has a clear, well-defined mission. Furthermore, moving preprocessing and postprocessing to core will make it easier for those capabilities to be used by other components in the future. For example, some of the same preprocessing that is done for SDG (e.g., document conversion) is also useful for indexing content for RAG. ## Context