Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce document for instructlab-sdk #184

Merged
merged 1 commit into from
Feb 1, 2025
Merged

Conversation

cdoern
Copy link
Contributor

@cdoern cdoern commented Jan 24, 2025

this document outlines the structure of the instructlab-sdk, the purpose, and the scope. The implementation specifics will be dealt with on the subsequent implementation PRs.

@cdoern cdoern force-pushed the sdk branch 3 times, most recently from 95d2b47 to 7bcad36 Compare January 24, 2025 20:58
Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good document overall and a step in the right direction.

Copy link
Member

@nimbinatus nimbinatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question on my end :) And it may be too far out of scope; feel free to ignore it!

Copy link
Contributor

@jwm4 jwm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good idea. A lot of work though, of course.

Copy link
Member

@nathan-weinberg nathan-weinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good to me, a few comments

Copy link
Contributor

@anastasds anastasds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really an SDK? Is this not just systematization of interfaces in Python modules, that is, Python APIs? For example, there is no "SDK" artifact to download.

This is actually about cleanly modularizing the codebase, isn't it?

@cdoern
Copy link
Contributor Author

cdoern commented Jan 27, 2025

@anastasds:

Is this really an SDK? Is this not just systematization of interfaces in Python modules, that is, Python APIs? For example, there is no "SDK" artifact to download.
This is actually about cleanly modularizing the codebase, isn't it?

The goal of this is to have folks who are currently haphazardly consuming different parts of InstructLab to have a centralized "core" of code that is versioned and static. What we call it, I don't feel strongly about, but it is pretty close to an SDK I believe. If there is some sort of standardization we need to fall into, we can. but other projects like llama stack do something similar but they do package it separately.

We could scope packaging it separately but having it live within the instructlab repo.

instructlab.core.system.info
instructlab.core.rag.ingest
instructlab.core.rag.convert
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent with the Python SDK should be to serve the Data Scientists and AI engineers doing experiments as the primary use. Then the constructs required for embedding or extending InstructLab from other products and platforms.

For a data science experience I propose to be inspired to what sklearn or keras or pytorch high level structure provide for this persona.

For example, what would it take to have an experience similar to :

from instructlab import InstructLab, SDG, Train, Eval

ilab = InstructLab(taxonomy=<path_to_taxonomy>) # and on initizalization, the taxonomy is diff & loaded

# like in Pandas DataFrame.describe()
ilab.describe() # prints a summary of the taxonomy diff and attributes
ilab_df = ilab.data.ingest() # ingest the data from the taxonomy and return a dataset as a Pandas DataFrame

oa_client = {
    "url": "https://api.openactive.io/v1",
    "api_key": "YOUR_API_KEY"
}

##
# SDG interactions
##
ilab_sdg=SDG(client=oa_client,
             data=ilab_df,
             teacher_model={<model_and_attributes>},
             ) # ilab SDG class

ilab_sdg.load_pipeline(<path_to_pipeline>) # load a pipeline from a file

for block in ilab_sdg.pipeline.get_blocks():
    # do block customization
    ilab_sdg.pipeline[block].something()
    ilab_sdg.pipeline[block].block.foo = "bar"

ilab_sdg.run(
    callback=<callback_function>, # a function to call after executing each block (e.g. to report progress, or save intermediate results)
) # generate the SDG data

##
# Training interactions
##
ilab_train = Train(dataset=ilab_sdg.data, 
                   student_model={<model_and_attributes>},
                     ) # ilab Train class

ilab_train.run(
    callback=<callback_function>, # a function to call after each cycle/teration/epoch loop is run (e.g. to report progress, or save intermediate results, or to stop the training if certain stopping criteria are met)
) # execute the Training loop

ilab_train.model_save(<path_to_save>, resolution={}, model_format={gguf|safe_tensors|onnx|etc}) # save the model to a file in the specified format

##
# Evaluation
###

ilab_eval = Eval(dataset=[<path_to_eval_dataset>], 
                 endpoints=[<path_to_endpoints_for_eval>], 
                 ) # ilab Eval class

ilab_eval.evals(
    [list_of_evaluations], # a list of evaluations to run (e.g. accuracy, precision, dk-bench, etc.)
)

ilab_eval.run(
    callback=<callback_function>, # a function to call after each evaluation is run (e.g. to report progress, or save intermediate results)
)

ilab_eval.summary() # print a summary of the evaluation results

ilab_eval.save(<path_to_save>, format={jsonl|parquet|csv|xls}) # save the evaluation results to a file

ilab_eval.export_report(<path_to_save>, format={html|pdf}) # report with  the evaluation results

I would propose to focus on accelerating the goal of the persona (the data scientists), and not on mapping or exposing the InstructLab internal architecture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is possibly a good end goal, but would require significant re-architecture of InstructLab, and the libraries (what they expose).

For the purposes of this dev-doc, I can incorporate some of this into what I am proposing, but for an alpha SDK, keeping the interactions as simple yet expandable as possible is my goal.

So we should aim to not require library adjustments/changes at first, and then we can add functionality once the structure is in place.

Let me incorporate some of this and I will update the PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @williamcaban has a point here about the persona we're building this SDK for. Who are we making this SDK for? Datascientist? People trying to build REST-API based instructlab services? The design of the SDK might vary based on who our target user of the SDK would be.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely important to understand the persona and goal of the SDK. William is proposing a fairly large re-architecting of the APIs, while this dev doc seems mostly focused on exposing existing CLI flows and functionality via a Python SDK. Is our goal to expose the existing end-to-end flow via a Python SDK? Or to provide a new way to interact with various InstructLab components that's more granular and flexible in how you compose your own end-to-end workflow from the pieces we're exposing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am making some updates to the PR (some are already up), aiming to meld the two approaches a bit, lmk what you think

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the updates look like a reasonable enough place for work to start.

The actual Python code and list of APIs is only illustrative, right? From reading this, it's my understanding that the actual parameters and methods on individual classes shown here are just an example since you call out future work is to design the actual SDK based on the structure above and negotiate user contracts with library maintainers. For example, if SDG points out that the Data.Generator constructor needs different params or that we have to actually use the taxonomy diff results as input to data generation as opposed to just checking if there is a diff, this kind of thing can be ironed out later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional point I want to raise - what we definitely don't want is to be exposing pure library functionalities through a class in the Core repo - users that want that should just be importing the libraries directly.

As Charlie notes above what we want here is an SDK for the opinionated ilab workflow that this package is centered around - for example, our Data class wouldn't expose every public function in the SDG library, but it would expose functions we have within src/instructlab/data such as generate_data and list_data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbrowning yep, the code is pretty much illustrative. The structure of classes I think is what I am focusing on the most. The actual arguments, functions within the Data class, etc can be ironed out later on. If we were to figure that out now, it would take forever!

Copy link

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea. We should incorporate real experience from current consumers into the design of the Python public API.

data_paths = instructlab.core.data.generate(...)
some_custom_handling(data_paths)
instructlab.core.model.train(data_path=..., strategy=lab_multiphase)
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an issue with the proposal itself since the snippet is only illustrative, but I'd consider whether ilab SDK should promote using file paths for any artifacts. I'd rather have ilab encapsulate the knowledge about storage options (for models and datasets) and then allow SDK users to refer to the artifacts by their names. (This would be in line with the spirit of instructlab/instructlab#1832 and some other issues under instructlab/instructlab#1871)

@cdoern cdoern force-pushed the sdk branch 2 times, most recently from ae451c3 to 5cc519f Compare January 29, 2025 16:16
@anastasds
Copy link
Contributor

anastasds commented Jan 29, 2025

@cdoern I want to revisit and re-emphasize the point I raised earlier.

I really do not see how this is an SDK. This is about Python API design and modularization.

The act of logically grouping operations, giving them good names, and understandable function signatures - of defining the "nouns" and "verbs" in the application - is an act of... wait for it... domain modeling!

The work being proposed here is fundamentally a modeling and modularization effort - it is not about creating an SDK.
(And I fully support it!)

instructlab.core.system.info
instructlab.core.rag.ingest
instructlab.core.rag.convert
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional point I want to raise - what we definitely don't want is to be exposing pure library functionalities through a class in the Core repo - users that want that should just be importing the libraries directly.

As Charlie notes above what we want here is an SDK for the opinionated ilab workflow that this package is centered around - for example, our Data class wouldn't expose every public function in the SDG library, but it would expose functions we have within src/instructlab/data such as generate_data and list_data

Copy link

@JamesKunstle JamesKunstle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the design changes represented in this document.
One consideration I'd raise as we proceed is that we may want to avoid the object-oriented pattern that the HF Trainer object is representative of, and prefer a functional programming convention.
The ravioli-code "Trainer" / "Generator" classes are a problematic design pattern elsewhere in this problem space. instructlab/training has avoided duplicating this convention because it's harder to reason about and encourages large procedural instance methods.

Instead, it's more readable, hackable, and composable to have an SDK expose functional building blocks w/ attention to the interfaces informing the call.

For illustration, why ever use:

trainer = Trainer(
    data=data,
    model=model,
    epochs=5,
    lr=1e-4,
    ...
    )

trainer.train()

when one could just run:

train(
    data=data,
    model=model,
    epochs=5,
    lr=1e-4,
    ...
    )

@booxter
Copy link

booxter commented Jan 30, 2025

Agreed with @JamesKunstle on NOT adopting object oriented primitives to describe process flow (ilab methodology being a pipeline of processes with - hopefully pure - inputs and outputs). (As an anecdote: We have Java-style RAG implementation in tree with factories and all, which is very hard to follow and unravel due to all the layers of indirection, and I don't think this is a good pattern to continue in this code base, esp. for "SDK".)


@anastasds I wonder what makes something an SDK (SDK being Software Development Kit) if not a "pack of compute primitives meant to be consumed externally", which this proposal is (?) (Obviously, proper modeling is part of any good API and SDK design; but this seems wider.)

AFAIU a primary problem being solved here "current consumers of InstructLab are finding themselves importing different library private and public APIs", which is being solved by giving these consumers something to consume that is meant to be consumed. This is SDK.

@anastasds
Copy link
Contributor

@booxter in that general of a sense, every source codebase is an SDK.

An sdk package that contains all functions, classes, constants, etc meant to be used for custom development would reasonably be called an SDK. It would unarguably be an SDK if it were its own additional artifact that could be optionally downloaded by those intending to develop custom functionality, e.g. pip install instructlab[sdk].

Having *.api.* or *.core.* submodules interspersed throughout is just Python APIs.

@cdoern cdoern force-pushed the sdk branch 2 times, most recently from 9c37dd2 to dad44a8 Compare January 30, 2025 18:29
@cdoern
Copy link
Contributor Author

cdoern commented Jan 30, 2025

@booxter @JamesKunstle given the consumption patterns of who is going to use this while also trying to the best of our ability to align with InstructLab's core values here is what I propose:

rather than a "spaghetti" structure like:

from instructlab.core import Data

sdg_client = Data.Generator(<args>)
sdg_client.generate()

I propose instead to at least do

from instructlab.core import Data, Config


config_object = Config.init(auto_detect=True)
data_client = Data(taxonomy_path="", cfg=config_object)
data_client.generate_data()

I think this structure at a minimum is what we need for the following reasons:

  1. "parent" classes align well with the ethos of the command line structure. a Data class can be instantiated with common values meant to be inherited by other commands so args are not necessary each time
  2. current consumers expect a hierarchical structure of InstructLab, rather than doing something like from instructlab.core import generate_data
  3. adding a level of abstraction allows the codebase to meld better with this new "core" package. Alternatively moving key functions into a core package at a top level would make the codebase confusing to understand as opposed to being grouped under a class like Data

In general usability, searchability (is that a word?), and our inheritance structure would suffer from not having at least top level cmd classes.

@nathan-weinberg do you agree with the above?

@nathan-weinberg
Copy link
Member

@booxter @JamesKunstle given the consumption patterns of who is going to use this while also trying to the best of our ability to align with InstructLab's core values here is what I propose:

rather than a "spaghetti" structure like:

from instructlab.core import Data

sdg_client = Data.Generator(<args>)
sdg_client.generate()

I propose instead to at least do

from instructlab.core import Data, Config


config_object = Config.init(auto_detect=True)
data_client = Data(taxonomy_path="", cfg=config_object)
data_client.generate_data()

I think this structure at a minimum is what we need for the following reasons:

1. "parent" classes align well with the ethos of the command line structure. a Data class can be instantiated with common values meant to be inherited by other commands so args are not necessary each time

2. current consumers expect a hierarchical structure of InstructLab, rather than doing something like `from instructlab.core import generate_data`

3. adding a level of abstraction allows the codebase to meld better with this new "core" package. Alternatively moving key functions into a core package at a top level would make the codebase confusing to understand as opposed to being grouped under a class like `Data`

In general usability, searchability (is that a word?), and our inheritance structure would suffer from not having at least top level cmd classes.

@nathan-weinberg do you agree with the above?

Indeed - I think from a usability perspective it makes sense that the SDK classes would align with the CLI command groups - ultimately we're exposing the same (if not more) functionality through a different interface, but keeping the naming conventions of our original interface (the CLI) is a good thing IMO.

I also echo what @JamesKunstle says above about having sub-classes like trainers and generators etc. etc. - it's overengineering in my opinion, though I could be convinced otherwise given a realistic example.

@booxter
Copy link

booxter commented Jan 30, 2025

+1. Keeping the existing click command group hierarchy reflected in "SDK" while avoiding popping up classes is a good balance. Represent entities with objects and processes with methods / functions (Not with a "factory" to execute processes or to produce objects).


On the question of whether something is SDK or not - the main point is the intent. If the intent is for external consumers to take it and integrate with it, then it's SDK. Whether it's shipped in the same artifact is an implementation detail.

That said, as long as the name for this "thing" is not just "API" (which will be ambiguous for upcoming REST API), but e.g. "Python API" then I am ok with renaming too.

Copy link

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correctly directed. We can flush out details (verbiage etc.) during development phase. What I'd encourage is to always refer to this as Python API/SDK going forward.

Signed-off-by: Charlie Doern <cdoern@redhat.com>
Copy link
Contributor

@jwm4 jwm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a fine start to me and I am fine for this to merge as is and drive the initial alpha API. However, I think there is a big gap here because the proposed API doesn't include something that let's you send in a query and get back a response (both with and without RAG). This would fill the role of ilab model chat but in an API-user friendly way. Hopefully that can be added in a fast follow.

Copy link
Member

@mairin mairin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is further development needed:

  • As @jwm4 pointed out, there's no calls for chat / rag chat
  • As @booxter pointed out, we should refer to it as "Python API/SDK"

Question:

If I was working on an API/SDK for say SDG instead of something like

instructlab.core.Data.generate_data

Where the namespace is

instructlab.core.*

Would the namespace be

instructlab.data.*

eg
instructlab.data.generate_data?

I am going to approve this because it seems we've gotten a lot of reviewers, thoughtful comments, edits and updates and the basics seem like what is needed for a good start. I would like to see the two points above addressed in a subsequent draft.

@cdoern
Copy link
Contributor Author

cdoern commented Feb 1, 2025

thanks @mairin!

a few notes on the feedback:

  1. I initially had chat as an endpoint but both William and Oleg wanted to table that for now since users might initially opt to just use chat completion requests manually from their server: introduce document for instructlab-sdk #184 (comment)
  2. Yeah, I can do a follow up switching all instances of SDK to Python SDK.

as for the question:

I think if I am understanding correctly, if users want to use the SDG library python API directly, they would instead use instructlab-sdg.utils.taxonomy_to_leaf_nodes for example. the library APIs are meant to be the smaller pieces of what something like generate_data would orchestrate. This fits the overall design shift some of the libraries are going for: They want to deal with functionality only pertaining to the inputs/outputs of their library rather than orchestration. Since today, generate_data is basically an orchestration of multiple SDG functions, I think core (instructlab/instructlab) would be the only owner of something like that

Do you want me to open the draft before merging this or can we do that as a follow up?

@mairin
Copy link
Member

mairin commented Feb 1, 2025

Nah ill just merge it now, and follow up is fine.

@mairin mairin merged commit b232636 into instructlab:main Feb 1, 2025
4 checks passed
@jwm4
Copy link
Contributor

jwm4 commented Feb 1, 2025

since users might initially opt to just use chat completion requests manually from their server

FWIW, we're planning to also wrap the RAG pipeline in a typical chat completion API too (for a variety of reasons, e.g., ease of integration with the existing eval code), so this might also fill that need. That might become part of this SDK, but it seems debatable since it is using a pre-existing API. Something to consider for round 2 of this work....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.