introduce document for instructlab-sdk #184

cdoern · 2025-01-24T19:55:47Z

this document outlines the structure of the instructlab-sdk, the purpose, and the scope. The implementation specifics will be dealt with on the subsequent implementation PRs.

docs/sdk/instructlab-sdk.md

RobotSail

This is a good document overall and a step in the right direction.

nimbinatus

Just one question on my end :) And it may be too far out of scope; feel free to ignore it!

docs/sdk/instructlab-sdk.md

jwm4

This seems like a good idea. A lot of work though, of course.

nathan-weinberg

Overall looking good to me, a few comments

docs/sdk/instructlab-sdk.md

anastasds

Is this really an SDK? Is this not just systematization of interfaces in Python modules, that is, Python APIs? For example, there is no "SDK" artifact to download.

This is actually about cleanly modularizing the codebase, isn't it?

.spellcheck-en-custom.txt

docs/sdk/instructlab-sdk.md

cdoern · 2025-01-27T20:50:44Z

@anastasds:

Is this really an SDK? Is this not just systematization of interfaces in Python modules, that is, Python APIs? For example, there is no "SDK" artifact to download.
This is actually about cleanly modularizing the codebase, isn't it?

The goal of this is to have folks who are currently haphazardly consuming different parts of InstructLab to have a centralized "core" of code that is versioned and static. What we call it, I don't feel strongly about, but it is pretty close to an SDK I believe. If there is some sort of standardization we need to fall into, we can. but other projects like llama stack do something similar but they do package it separately.

We could scope packaging it separately but having it live within the instructlab repo.

docs/sdk/instructlab-sdk.md

williamcaban · 2025-01-27T23:47:02Z

docs/sdk/instructlab-sdk.md

+instructlab.core.system.info
+instructlab.core.rag.ingest
+instructlab.core.rag.convert
+```


The intent with the Python SDK should be to serve the Data Scientists and AI engineers doing experiments as the primary use. Then the constructs required for embedding or extending InstructLab from other products and platforms.

For a data science experience I propose to be inspired to what sklearn or keras or pytorch high level structure provide for this persona.

For example, what would it take to have an experience similar to :

from instructlab import InstructLab, SDG, Train, Eval ilab = InstructLab(taxonomy=<path_to_taxonomy>) # and on initizalization, the taxonomy is diff & loaded # like in Pandas DataFrame.describe() ilab.describe() # prints a summary of the taxonomy diff and attributes ilab_df = ilab.data.ingest() # ingest the data from the taxonomy and return a dataset as a Pandas DataFrame oa_client = { "url": "https://api.openactive.io/v1", "api_key": "YOUR_API_KEY" } ## # SDG interactions ## ilab_sdg=SDG(client=oa_client, data=ilab_df, teacher_model={<model_and_attributes>}, ) # ilab SDG class ilab_sdg.load_pipeline(<path_to_pipeline>) # load a pipeline from a file for block in ilab_sdg.pipeline.get_blocks(): # do block customization ilab_sdg.pipeline[block].something() ilab_sdg.pipeline[block].block.foo = "bar" ilab_sdg.run( callback=<callback_function>, # a function to call after executing each block (e.g. to report progress, or save intermediate results) ) # generate the SDG data ## # Training interactions ## ilab_train = Train(dataset=ilab_sdg.data, student_model={<model_and_attributes>}, ) # ilab Train class ilab_train.run( callback=<callback_function>, # a function to call after each cycle/teration/epoch loop is run (e.g. to report progress, or save intermediate results, or to stop the training if certain stopping criteria are met) ) # execute the Training loop ilab_train.model_save(<path_to_save>, resolution={}, model_format={gguf|safe_tensors|onnx|etc}) # save the model to a file in the specified format ## # Evaluation ### ilab_eval = Eval(dataset=[<path_to_eval_dataset>], endpoints=[<path_to_endpoints_for_eval>], ) # ilab Eval class ilab_eval.evals( [list_of_evaluations], # a list of evaluations to run (e.g. accuracy, precision, dk-bench, etc.) ) ilab_eval.run( callback=<callback_function>, # a function to call after each evaluation is run (e.g. to report progress, or save intermediate results) ) ilab_eval.summary() # print a summary of the evaluation results ilab_eval.save(<path_to_save>, format={jsonl|parquet|csv|xls}) # save the evaluation results to a file ilab_eval.export_report(<path_to_save>, format={html|pdf}) # report with the evaluation results

I would propose to focus on accelerating the goal of the persona (the data scientists), and not on mapping or exposing the InstructLab internal architecture.

I think this is possibly a good end goal, but would require significant re-architecture of InstructLab, and the libraries (what they expose).

For the purposes of this dev-doc, I can incorporate some of this into what I am proposing, but for an alpha SDK, keeping the interactions as simple yet expandable as possible is my goal.

So we should aim to not require library adjustments/changes at first, and then we can add functionality once the structure is in place.

Let me incorporate some of this and I will update the PR

I think @williamcaban has a point here about the persona we're building this SDK for. Who are we making this SDK for? Datascientist? People trying to build REST-API based instructlab services? The design of the SDK might vary based on who our target user of the SDK would be.

It's definitely important to understand the persona and goal of the SDK. William is proposing a fairly large re-architecting of the APIs, while this dev doc seems mostly focused on exposing existing CLI flows and functionality via a Python SDK. Is our goal to expose the existing end-to-end flow via a Python SDK? Or to provide a new way to interact with various InstructLab components that's more granular and flexible in how you compose your own end-to-end workflow from the pieces we're exposing?

I am making some updates to the PR (some are already up), aiming to meld the two approaches a bit, lmk what you think

I think the updates look like a reasonable enough place for work to start.

The actual Python code and list of APIs is only illustrative, right? From reading this, it's my understanding that the actual parameters and methods on individual classes shown here are just an example since you call out future work is to design the actual SDK based on the structure above and negotiate user contracts with library maintainers. For example, if SDG points out that the Data.Generator constructor needs different params or that we have to actually use the taxonomy diff results as input to data generation as opposed to just checking if there is a diff, this kind of thing can be ironed out later?

One additional point I want to raise - what we definitely don't want is to be exposing pure library functionalities through a class in the Core repo - users that want that should just be importing the libraries directly.

As Charlie notes above what we want here is an SDK for the opinionated ilab workflow that this package is centered around - for example, our Data class wouldn't expose every public function in the SDG library, but it would expose functions we have within src/instructlab/data such as generate_data and list_data

@bbrowning yep, the code is pretty much illustrative. The structure of classes I think is what I am focusing on the most. The actual arguments, functions within the Data class, etc can be ironed out later on. If we were to figure that out now, it would take forever!

booxter

This is a great idea. We should incorporate real experience from current consumers into the design of the Python public API.

docs/sdk/instructlab-sdk.md

booxter · 2025-01-28T17:28:36Z

docs/sdk/instructlab-sdk.md

+    data_paths = instructlab.core.data.generate(...)
+    some_custom_handling(data_paths)
+    instructlab.core.model.train(data_path=..., strategy=lab_multiphase)
+```


Not an issue with the proposal itself since the snippet is only illustrative, but I'd consider whether ilab SDK should promote using file paths for any artifacts. I'd rather have ilab encapsulate the knowledge about storage options (for models and datasets) and then allow SDK users to refer to the artifacts by their names. (This would be in line with the spirit of instructlab/instructlab#1832 and some other issues under instructlab/instructlab#1871)

docs/sdk/instructlab-sdk.md

anastasds · 2025-01-29T16:55:20Z

@cdoern I want to revisit and re-emphasize the point I raised earlier.

I really do not see how this is an SDK. This is about Python API design and modularization.

The act of logically grouping operations, giving them good names, and understandable function signatures - of defining the "nouns" and "verbs" in the application - is an act of... wait for it... domain modeling!

The work being proposed here is fundamentally a modeling and modularization effort - it is not about creating an SDK.
(And I fully support it!)

docs/sdk/instructlab-sdk.md

nathan-weinberg · 2025-01-29T18:52:47Z

docs/sdk/instructlab-sdk.md

+instructlab.core.system.info
+instructlab.core.rag.ingest
+instructlab.core.rag.convert
+```


One additional point I want to raise - what we definitely don't want is to be exposing pure library functionalities through a class in the Core repo - users that want that should just be importing the libraries directly.

As Charlie notes above what we want here is an SDK for the opinionated ilab workflow that this package is centered around - for example, our Data class wouldn't expose every public function in the SDG library, but it would expose functions we have within src/instructlab/data such as generate_data and list_data

docs/sdk/instructlab-sdk.md

JamesKunstle

I really like the design changes represented in this document.
One consideration I'd raise as we proceed is that we may want to avoid the object-oriented pattern that the HF Trainer object is representative of, and prefer a functional programming convention.
The ravioli-code "Trainer" / "Generator" classes are a problematic design pattern elsewhere in this problem space. instructlab/training has avoided duplicating this convention because it's harder to reason about and encourages large procedural instance methods.

Instead, it's more readable, hackable, and composable to have an SDK expose functional building blocks w/ attention to the interfaces informing the call.

For illustration, why ever use:

trainer = Trainer(
    data=data,
    model=model,
    epochs=5,
    lr=1e-4,
    ...
    )

trainer.train()

when one could just run:

train(
    data=data,
    model=model,
    epochs=5,
    lr=1e-4,
    ...
    )

booxter · 2025-01-30T17:46:05Z

Agreed with @JamesKunstle on NOT adopting object oriented primitives to describe process flow (ilab methodology being a pipeline of processes with - hopefully pure - inputs and outputs). (As an anecdote: We have Java-style RAG implementation in tree with factories and all, which is very hard to follow and unravel due to all the layers of indirection, and I don't think this is a good pattern to continue in this code base, esp. for "SDK".)

@anastasds I wonder what makes something an SDK (SDK being Software Development Kit) if not a "pack of compute primitives meant to be consumed externally", which this proposal is (?) (Obviously, proper modeling is part of any good API and SDK design; but this seems wider.)

AFAIU a primary problem being solved here "current consumers of InstructLab are finding themselves importing different library private and public APIs", which is being solved by giving these consumers something to consume that is meant to be consumed. This is SDK.

anastasds · 2025-01-30T17:52:21Z

@booxter in that general of a sense, every source codebase is an SDK.

An sdk package that contains all functions, classes, constants, etc meant to be used for custom development would reasonably be called an SDK. It would unarguably be an SDK if it were its own additional artifact that could be optionally downloaded by those intending to develop custom functionality, e.g. pip install instructlab[sdk].

Having *.api.* or *.core.* submodules interspersed throughout is just Python APIs.

cdoern · 2025-01-30T19:18:43Z

@booxter @JamesKunstle given the consumption patterns of who is going to use this while also trying to the best of our ability to align with InstructLab's core values here is what I propose:

rather than a "spaghetti" structure like:

from instructlab.core import Data

sdg_client = Data.Generator(<args>)
sdg_client.generate()

I propose instead to at least do

from instructlab.core import Data, Config


config_object = Config.init(auto_detect=True)
data_client = Data(taxonomy_path="", cfg=config_object)
data_client.generate_data()

I think this structure at a minimum is what we need for the following reasons:

"parent" classes align well with the ethos of the command line structure. a Data class can be instantiated with common values meant to be inherited by other commands so args are not necessary each time
current consumers expect a hierarchical structure of InstructLab, rather than doing something like from instructlab.core import generate_data
adding a level of abstraction allows the codebase to meld better with this new "core" package. Alternatively moving key functions into a core package at a top level would make the codebase confusing to understand as opposed to being grouped under a class like Data

In general usability, searchability (is that a word?), and our inheritance structure would suffer from not having at least top level cmd classes.

@nathan-weinberg do you agree with the above?

nathan-weinberg · 2025-01-30T19:46:16Z

@booxter @JamesKunstle given the consumption patterns of who is going to use this while also trying to the best of our ability to align with InstructLab's core values here is what I propose:

rather than a "spaghetti" structure like:
from instructlab.core import Data

sdg_client = Data.Generator(<args>)
sdg_client.generate()
I propose instead to at least do
from instructlab.core import Data, Config


config_object = Config.init(auto_detect=True)
data_client = Data(taxonomy_path="", cfg=config_object)
data_client.generate_data()
I think this structure at a minimum is what we need for the following reasons:
1. "parent" classes align well with the ethos of the command line structure. a Data class can be instantiated with common values meant to be inherited by other commands so args are not necessary each time

2. current consumers expect a hierarchical structure of InstructLab, rather than doing something like `from instructlab.core import generate_data`

3. adding a level of abstraction allows the codebase to meld better with this new "core" package. Alternatively moving key functions into a core package at a top level would make the codebase confusing to understand as opposed to being grouped under a class like `Data`
In general usability, searchability (is that a word?), and our inheritance structure would suffer from not having at least top level cmd classes.

@nathan-weinberg do you agree with the above?

Indeed - I think from a usability perspective it makes sense that the SDK classes would align with the CLI command groups - ultimately we're exposing the same (if not more) functionality through a different interface, but keeping the naming conventions of our original interface (the CLI) is a good thing IMO.

I also echo what @JamesKunstle says above about having sub-classes like trainers and generators etc. etc. - it's overengineering in my opinion, though I could be convinced otherwise given a realistic example.

booxter · 2025-01-30T22:06:02Z

+1. Keeping the existing click command group hierarchy reflected in "SDK" while avoiding popping up classes is a good balance. Represent entities with objects and processes with methods / functions (Not with a "factory" to execute processes or to produce objects).

On the question of whether something is SDK or not - the main point is the intent. If the intent is for external consumers to take it and integrate with it, then it's SDK. Whether it's shipped in the same artifact is an implementation detail.

That said, as long as the name for this "thing" is not just "API" (which will be ambiguous for upcoming REST API), but e.g. "Python API" then I am ok with renaming too.

booxter

This is correctly directed. We can flush out details (verbiage etc.) during development phase. What I'd encourage is to always refer to this as Python API/SDK going forward.

docs/sdk/instructlab-sdk.md

Signed-off-by: Charlie Doern <cdoern@redhat.com>

jwm4

This seems like a fine start to me and I am fine for this to merge as is and drive the initial alpha API. However, I think there is a big gap here because the proposed API doesn't include something that let's you send in a query and get back a response (both with and without RAG). This would fill the role of ilab model chat but in an API-user friendly way. Hopefully that can be added in a fast follow.

mairin

Here is further development needed:

As @jwm4 pointed out, there's no calls for chat / rag chat
As @booxter pointed out, we should refer to it as "Python API/SDK"

Question:

If I was working on an API/SDK for say SDG instead of something like

instructlab.core.Data.generate_data

Where the namespace is

instructlab.core.*

Would the namespace be

instructlab.data.*

eg
instructlab.data.generate_data?

I am going to approve this because it seems we've gotten a lot of reviewers, thoughtful comments, edits and updates and the basics seem like what is needed for a good start. I would like to see the two points above addressed in a subsequent draft.

cdoern · 2025-02-01T00:08:01Z

thanks @mairin!

a few notes on the feedback:

I initially had chat as an endpoint but both William and Oleg wanted to table that for now since users might initially opt to just use chat completion requests manually from their server: introduce document for instructlab-sdk #184 (comment)
Yeah, I can do a follow up switching all instances of SDK to Python SDK.

as for the question:

I think if I am understanding correctly, if users want to use the SDG library python API directly, they would instead use instructlab-sdg.utils.taxonomy_to_leaf_nodes for example. the library APIs are meant to be the smaller pieces of what something like generate_data would orchestrate. This fits the overall design shift some of the libraries are going for: They want to deal with functionality only pertaining to the inputs/outputs of their library rather than orchestration. Since today, generate_data is basically an orchestration of multiple SDG functions, I think core (instructlab/instructlab) would be the only owner of something like that

Do you want me to open the draft before merging this or can we do that as a follow up?

mairin · 2025-02-01T00:20:03Z

Nah ill just merge it now, and follow up is fine.

jwm4 · 2025-02-01T02:21:16Z

since users might initially opt to just use chat completion requests manually from their server

FWIW, we're planning to also wrap the RAG pipeline in a typical chat completion API too (for a variety of reasons, e.g., ease of integration with the existing eval code), so this might also fill that need. That might become part of this SDK, but it seems debatable since it is using a pre-existing API. Something to consider for round 2 of this work....

cdoern force-pushed the sdk branch 3 times, most recently from 95d2b47 to 7bcad36 Compare January 24, 2025 20:58

RobotSail reviewed Jan 24, 2025

View reviewed changes

docs/sdk/instructlab-sdk.md Outdated Show resolved Hide resolved

RobotSail reviewed Jan 24, 2025

View reviewed changes

docs/sdk/instructlab-sdk.md Show resolved Hide resolved

RobotSail reviewed Jan 24, 2025

View reviewed changes

docs/sdk/instructlab-sdk.md Show resolved Hide resolved

RobotSail reviewed Jan 24, 2025

View reviewed changes

docs/sdk/instructlab-sdk.md Outdated Show resolved Hide resolved

RobotSail approved these changes Jan 24, 2025

View reviewed changes

nimbinatus reviewed Jan 24, 2025

View reviewed changes

docs/sdk/instructlab-sdk.md Outdated Show resolved Hide resolved

jwm4 approved these changes Jan 24, 2025

View reviewed changes

nathan-weinberg reviewed Jan 27, 2025

View reviewed changes

docs/sdk/instructlab-sdk.md Outdated Show resolved Hide resolved

docs/sdk/instructlab-sdk.md Outdated Show resolved Hide resolved

docs/sdk/instructlab-sdk.md Outdated Show resolved Hide resolved

docs/sdk/instructlab-sdk.md Show resolved Hide resolved

anastasds reviewed Jan 27, 2025

View reviewed changes

.spellcheck-en-custom.txt Outdated Show resolved Hide resolved

docs/sdk/instructlab-sdk.md Outdated Show resolved Hide resolved

cdoern force-pushed the sdk branch from 7bcad36 to a6ffccc Compare January 27, 2025 20:54

williamcaban reviewed Jan 27, 2025

View reviewed changes

booxter reviewed Jan 28, 2025

View reviewed changes

alimaredia reviewed Jan 28, 2025

View reviewed changes

docs/sdk/instructlab-sdk.md Outdated Show resolved Hide resolved

alimaredia reviewed Jan 28, 2025

View reviewed changes

docs/sdk/instructlab-sdk.md Show resolved Hide resolved

cdoern force-pushed the sdk branch 2 times, most recently from ae451c3 to 5cc519f Compare January 29, 2025 16:16

jwm4 mentioned this pull request Jan 29, 2025

ADR: Custom Exceptions Patterns #186

Open

cdoern force-pushed the sdk branch from 5cc519f to 874ef22 Compare January 29, 2025 18:26

cdoern requested review from nathan-weinberg, anastasds and williamcaban January 29, 2025 18:26

nathan-weinberg reviewed Jan 29, 2025

View reviewed changes

JamesKunstle reviewed Jan 29, 2025

View reviewed changes

docs/sdk/instructlab-sdk.md Outdated Show resolved Hide resolved

JamesKunstle reviewed Jan 29, 2025

View reviewed changes

cdoern force-pushed the sdk branch 2 times, most recently from 9c37dd2 to dad44a8 Compare January 30, 2025 18:29

anastasds mentioned this pull request Jan 30, 2025

InstructLab Maintainer nomination for Anastas Stoyanovsky instructlab/instructlab#2932

Closed

booxter approved these changes Jan 30, 2025

View reviewed changes

docs/sdk/instructlab-sdk.md Show resolved Hide resolved

docs/sdk/instructlab-sdk.md Show resolved Hide resolved

docs/sdk/instructlab-sdk.md Show resolved Hide resolved

cdoern force-pushed the sdk branch from dad44a8 to 109c183 Compare January 31, 2025 16:45

cdoern requested review from nathan-weinberg and JamesKunstle January 31, 2025 17:17

introduce document for instructlab-sdk

cf0bf3f

Signed-off-by: Charlie Doern <cdoern@redhat.com>

cdoern force-pushed the sdk branch from 109c183 to cf0bf3f Compare January 31, 2025 17:28

nathan-weinberg approved these changes Jan 31, 2025

View reviewed changes

booxter approved these changes Jan 31, 2025

View reviewed changes

jwm4 approved these changes Jan 31, 2025

View reviewed changes

mairin approved these changes Jan 31, 2025

View reviewed changes

mairin merged commit b232636 into instructlab:main Feb 1, 2025
4 checks passed

nathan-weinberg mentioned this pull request Feb 3, 2025

Create Dev Doc for API Design instructlab/instructlab#3043

Closed

introduce document for instructlab-sdk #184

introduce document for instructlab-sdk #184

Uh oh!

Conversation

cdoern commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

nimbinatus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jwm4 left a comment

Choose a reason for hiding this comment

Uh oh!

nathan-weinberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anastasds left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cdoern commented Jan 27, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

booxter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anastasds commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JamesKunstle left a comment

Choose a reason for hiding this comment

Uh oh!

booxter commented Jan 30, 2025

Uh oh!

cdoern commented Jan 24, 2025 •

edited

Loading

anastasds commented Jan 29, 2025 •

edited

Loading

cdoern commented Jan 30, 2025 •

edited

Loading