-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce document for instructlab-sdk #184
Conversation
95d2b47
to
7bcad36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good document overall and a step in the right direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question on my end :) And it may be too far out of scope; feel free to ignore it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good idea. A lot of work though, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking good to me, a few comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really an SDK? Is this not just systematization of interfaces in Python modules, that is, Python APIs? For example, there is no "SDK" artifact to download.
This is actually about cleanly modularizing the codebase, isn't it?
The goal of this is to have folks who are currently haphazardly consuming different parts of InstructLab to have a centralized "core" of code that is versioned and static. What we call it, I don't feel strongly about, but it is pretty close to an SDK I believe. If there is some sort of standardization we need to fall into, we can. but other projects like llama stack do something similar but they do package it separately. We could scope packaging it separately but having it live within the |
instructlab.core.system.info | ||
instructlab.core.rag.ingest | ||
instructlab.core.rag.convert | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent with the Python SDK should be to serve the Data Scientists and AI engineers doing experiments as the primary use. Then the constructs required for embedding or extending InstructLab from other products and platforms.
For a data science experience I propose to be inspired to what sklearn or keras or pytorch high level structure provide for this persona.
For example, what would it take to have an experience similar to :
from instructlab import InstructLab, SDG, Train, Eval
ilab = InstructLab(taxonomy=<path_to_taxonomy>) # and on initizalization, the taxonomy is diff & loaded
# like in Pandas DataFrame.describe()
ilab.describe() # prints a summary of the taxonomy diff and attributes
ilab_df = ilab.data.ingest() # ingest the data from the taxonomy and return a dataset as a Pandas DataFrame
oa_client = {
"url": "https://api.openactive.io/v1",
"api_key": "YOUR_API_KEY"
}
##
# SDG interactions
##
ilab_sdg=SDG(client=oa_client,
data=ilab_df,
teacher_model={<model_and_attributes>},
) # ilab SDG class
ilab_sdg.load_pipeline(<path_to_pipeline>) # load a pipeline from a file
for block in ilab_sdg.pipeline.get_blocks():
# do block customization
ilab_sdg.pipeline[block].something()
ilab_sdg.pipeline[block].block.foo = "bar"
ilab_sdg.run(
callback=<callback_function>, # a function to call after executing each block (e.g. to report progress, or save intermediate results)
) # generate the SDG data
##
# Training interactions
##
ilab_train = Train(dataset=ilab_sdg.data,
student_model={<model_and_attributes>},
) # ilab Train class
ilab_train.run(
callback=<callback_function>, # a function to call after each cycle/teration/epoch loop is run (e.g. to report progress, or save intermediate results, or to stop the training if certain stopping criteria are met)
) # execute the Training loop
ilab_train.model_save(<path_to_save>, resolution={}, model_format={gguf|safe_tensors|onnx|etc}) # save the model to a file in the specified format
##
# Evaluation
###
ilab_eval = Eval(dataset=[<path_to_eval_dataset>],
endpoints=[<path_to_endpoints_for_eval>],
) # ilab Eval class
ilab_eval.evals(
[list_of_evaluations], # a list of evaluations to run (e.g. accuracy, precision, dk-bench, etc.)
)
ilab_eval.run(
callback=<callback_function>, # a function to call after each evaluation is run (e.g. to report progress, or save intermediate results)
)
ilab_eval.summary() # print a summary of the evaluation results
ilab_eval.save(<path_to_save>, format={jsonl|parquet|csv|xls}) # save the evaluation results to a file
ilab_eval.export_report(<path_to_save>, format={html|pdf}) # report with the evaluation results
I would propose to focus on accelerating the goal of the persona (the data scientists), and not on mapping or exposing the InstructLab internal architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is possibly a good end goal, but would require significant re-architecture of InstructLab, and the libraries (what they expose).
For the purposes of this dev-doc, I can incorporate some of this into what I am proposing, but for an alpha SDK, keeping the interactions as simple yet expandable as possible is my goal.
So we should aim to not require library adjustments/changes at first, and then we can add functionality once the structure is in place.
Let me incorporate some of this and I will update the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @williamcaban has a point here about the persona we're building this SDK for. Who are we making this SDK for? Datascientist? People trying to build REST-API based instructlab services? The design of the SDK might vary based on who our target user of the SDK would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely important to understand the persona and goal of the SDK. William is proposing a fairly large re-architecting of the APIs, while this dev doc seems mostly focused on exposing existing CLI flows and functionality via a Python SDK. Is our goal to expose the existing end-to-end flow via a Python SDK? Or to provide a new way to interact with various InstructLab components that's more granular and flexible in how you compose your own end-to-end workflow from the pieces we're exposing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am making some updates to the PR (some are already up), aiming to meld the two approaches a bit, lmk what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the updates look like a reasonable enough place for work to start.
The actual Python code and list of APIs is only illustrative, right? From reading this, it's my understanding that the actual parameters and methods on individual classes shown here are just an example since you call out future work is to design the actual SDK based on the structure above and negotiate user contracts with library maintainers. For example, if SDG points out that the Data.Generator
constructor needs different params or that we have to actually use the taxonomy diff results as input to data generation as opposed to just checking if there is a diff, this kind of thing can be ironed out later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One additional point I want to raise - what we definitely don't want is to be exposing pure library functionalities through a class in the Core repo - users that want that should just be importing the libraries directly.
As Charlie notes above what we want here is an SDK for the opinionated ilab
workflow that this package is centered around - for example, our Data class wouldn't expose every public function in the SDG library, but it would expose functions we have within src/instructlab/data
such as generate_data
and list_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbrowning yep, the code is pretty much illustrative. The structure of classes I think is what I am focusing on the most. The actual arguments, functions within the Data
class, etc can be ironed out later on. If we were to figure that out now, it would take forever!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea. We should incorporate real experience from current consumers into the design of the Python public API.
docs/sdk/instructlab-sdk.md
Outdated
data_paths = instructlab.core.data.generate(...) | ||
some_custom_handling(data_paths) | ||
instructlab.core.model.train(data_path=..., strategy=lab_multiphase) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an issue with the proposal itself since the snippet is only illustrative, but I'd consider whether ilab SDK should promote using file paths for any artifacts. I'd rather have ilab encapsulate the knowledge about storage options (for models and datasets) and then allow SDK users to refer to the artifacts by their names. (This would be in line with the spirit of instructlab/instructlab#1832 and some other issues under instructlab/instructlab#1871)
ae451c3
to
5cc519f
Compare
@cdoern I want to revisit and re-emphasize the point I raised earlier. I really do not see how this is an SDK. This is about Python API design and modularization. The act of logically grouping operations, giving them good names, and understandable function signatures - of defining the "nouns" and "verbs" in the application - is an act of... wait for it... domain modeling! The work being proposed here is fundamentally a modeling and modularization effort - it is not about creating an SDK. |
instructlab.core.system.info | ||
instructlab.core.rag.ingest | ||
instructlab.core.rag.convert | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One additional point I want to raise - what we definitely don't want is to be exposing pure library functionalities through a class in the Core repo - users that want that should just be importing the libraries directly.
As Charlie notes above what we want here is an SDK for the opinionated ilab
workflow that this package is centered around - for example, our Data class wouldn't expose every public function in the SDG library, but it would expose functions we have within src/instructlab/data
such as generate_data
and list_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the design changes represented in this document.
One consideration I'd raise as we proceed is that we may want to avoid the object-oriented pattern that the HF Trainer object is representative of, and prefer a functional programming convention.
The ravioli-code "Trainer" / "Generator" classes are a problematic design pattern elsewhere in this problem space. instructlab/training
has avoided duplicating this convention because it's harder to reason about and encourages large procedural instance methods.
Instead, it's more readable, hackable, and composable to have an SDK expose functional building blocks w/ attention to the interfaces informing the call.
For illustration, why ever use:
trainer = Trainer(
data=data,
model=model,
epochs=5,
lr=1e-4,
...
)
trainer.train()
when one could just run:
train(
data=data,
model=model,
epochs=5,
lr=1e-4,
...
)
Agreed with @JamesKunstle on NOT adopting object oriented primitives to describe process flow (ilab methodology being a pipeline of processes with - hopefully pure - inputs and outputs). (As an anecdote: We have Java-style RAG implementation in tree with factories and all, which is very hard to follow and unravel due to all the layers of indirection, and I don't think this is a good pattern to continue in this code base, esp. for "SDK".) @anastasds I wonder what makes something an SDK (SDK being Software Development Kit) if not a "pack of compute primitives meant to be consumed externally", which this proposal is (?) (Obviously, proper modeling is part of any good API and SDK design; but this seems wider.) AFAIU a primary problem being solved here "current consumers of InstructLab are finding themselves importing different library private and public APIs", which is being solved by giving these consumers something to consume that is meant to be consumed. This is SDK. |
@booxter in that general of a sense, every source codebase is an SDK. An Having |
9c37dd2
to
dad44a8
Compare
@booxter @JamesKunstle given the consumption patterns of who is going to use this while also trying to the best of our ability to align with InstructLab's core values here is what I propose: rather than a "spaghetti" structure like: from instructlab.core import Data
sdg_client = Data.Generator(<args>)
sdg_client.generate() I propose instead to at least do from instructlab.core import Data, Config
config_object = Config.init(auto_detect=True)
data_client = Data(taxonomy_path="", cfg=config_object)
data_client.generate_data() I think this structure at a minimum is what we need for the following reasons:
In general usability, searchability (is that a word?), and our inheritance structure would suffer from not having at least top level cmd classes. @nathan-weinberg do you agree with the above? |
Indeed - I think from a usability perspective it makes sense that the SDK classes would align with the CLI command groups - ultimately we're exposing the same (if not more) functionality through a different interface, but keeping the naming conventions of our original interface (the CLI) is a good thing IMO. I also echo what @JamesKunstle says above about having sub-classes like |
+1. Keeping the existing click command group hierarchy reflected in "SDK" while avoiding popping up classes is a good balance. Represent entities with objects and processes with methods / functions (Not with a "factory" to execute processes or to produce objects). On the question of whether something is SDK or not - the main point is the intent. If the intent is for external consumers to take it and integrate with it, then it's SDK. Whether it's shipped in the same artifact is an implementation detail. That said, as long as the name for this "thing" is not just "API" (which will be ambiguous for upcoming REST API), but e.g. "Python API" then I am ok with renaming too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correctly directed. We can flush out details (verbiage etc.) during development phase. What I'd encourage is to always refer to this as Python API/SDK going forward.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a fine start to me and I am fine for this to merge as is and drive the initial alpha API. However, I think there is a big gap here because the proposed API doesn't include something that let's you send in a query and get back a response (both with and without RAG). This would fill the role of ilab model chat
but in an API-user friendly way. Hopefully that can be added in a fast follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is further development needed:
- As @jwm4 pointed out, there's no calls for chat / rag chat
- As @booxter pointed out, we should refer to it as "Python API/SDK"
Question:
If I was working on an API/SDK for say SDG instead of something like
instructlab.core.Data.generate_data
Where the namespace is
instructlab.core.*
Would the namespace be
instructlab.data.*
eg
instructlab.data.generate_data
?
I am going to approve this because it seems we've gotten a lot of reviewers, thoughtful comments, edits and updates and the basics seem like what is needed for a good start. I would like to see the two points above addressed in a subsequent draft.
thanks @mairin! a few notes on the feedback:
as for the question: I think if I am understanding correctly, if users want to use the SDG library python API directly, they would instead use Do you want me to open the draft before merging this or can we do that as a follow up? |
Nah ill just merge it now, and follow up is fine. |
FWIW, we're planning to also wrap the RAG pipeline in a typical chat completion API too (for a variety of reasons, e.g., ease of integration with the existing eval code), so this might also fill that need. That might become part of this SDK, but it seems debatable since it is using a pre-existing API. Something to consider for round 2 of this work.... |
this document outlines the structure of the instructlab-sdk, the purpose, and the scope. The implementation specifics will be dealt with on the subsequent implementation PRs.