Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce document for instructlab-sdk #184

Merged
merged 1 commit into from
Feb 1, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .spellcheck-en-custom.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ agentic
Akash
AMDGPU
Anil
API
api
arge
args
arXiv
Expand Down Expand Up @@ -215,6 +217,8 @@ Salawu
scalable
SDG
sdg
SDK
sdk
semvar
sexualized
SHA
Expand Down
153 changes: 153 additions & 0 deletions docs/sdk/instructlab-sdk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# InstructLab Python SDK

## Motivation

Today, the only way to "drive" the InstructLab opinionated workflow is via the `ilab` CLI. While this process provides a succinct way for everyday users to initialize a config, generate synthetic data, train a model, and evaluate it: the guardrails are quite limiting both in what a user can do and what the development team can add as features exposed directly to the user over time.

Additionally, current consumers of InstructLab are finding themselves importing different library private and public APIs in combination with CLI functionality to achieve the workflows they want. While these more advanced usage patterns are not for everyone, providing ways to run bespoke and piecemeal workflows in a standardized and safe way for the community is a necessity.

Unifying these various ranges of advanced workflows under an overarching `InstructLab Python SDK` will allow for new usage patterns and a clearer story on what InstructLab can and should provide as user accessible endpoints.

While each library can and _should_ have their own publicly accessible SDK, not all functionality being added to SDG, Training, and Eval needs to be correlated directly to the "InstructLab workflow". This Python SDK should, as the CLI does, expose an opinionated flow that uses functionality from the various libraries. The InstructLab SDK should be derived from the library APIs, not the other way around. SDG, for example, currently has a `generate_data` method, meant to only be accessed by InstructLab. This method simply calls other publicly available SDG functionality. Orchestration of the InstructLab flow like this should not be of concern to the individual libraries and instead be handled by the overarching InstructLab SDK which will maintain the user contracts. The InstructLab SDK will need to work within the bounds of what the Libraries expose as public APIs.

The benefit of the above is that the opinionated flow can be accessed in a more nuanced and piecemeal way while also gaining the potential for more advanced features. Say a consumer wants to:

1. Setup a custom config file for ilab (optional)
2. Initialize a taxonomy
3. Ensure their taxonomy is valid
4. Ingest some data for RAG and SDG (SDG coming soon)
5. Generate synthetic data using an InstructLab pipeline
6. Do some custom handling per their use case
7. Fine-tune a model using the custom config they initialized for their hardware
8. Evaluate their model after training using various benchmarks

A user could do this if they had an SDK.

(the structure of the SDK and actual arguments is discussed below)

However today, users are forced to run a sequence of commands tailored to only work with the proper directory structure on the system.

## Major Goals

1. Modularize the InstructLab workflow such that any part can be run independently
2. Allow users to choose whether or not to take advantage of the config/system-profile method of running InstructLab. Meaning they do not need any pre-existing configuration to run the SDK.
3. Standardize user contracts for the existing functionality of the InstructLab workflow. Existing CLI commands should be using the SDK once past click parsing, not separate code.
4. Define Contracts loose enough that functionality can be expanded as more advanced features are released.
5. Document SDK usage in upcoming InstructLab releases.

## Non-Goals

1. Exposing all library functionality immediately
2. Replacing CLI
3. Shipping an SDK that is generally available as opposed to v1alpha1 or v1beta1.

## Design

### Versioning

The SDK would start at version v1alpha1 such that it can change/break at any time for the first few iterations as libraries adjust their API surface.

### Structure

This SDK should live in a net new package inside of `instructlab/instructlab` preferably to limit unnecessary imports in a new repository. The SDK could be imported as `instructlab.core...`

The user surface initially should look like this:

`instructlab.core` contains all SDK definitions. Users can `from instructlab.core import...` to use specific SDK classes

For most of the existing InstructLab command groups, there should be a class:

`from instructlab.core import, Config, Taxonomy, Data, Model, RAG, System`

The full list of classes and their methods for now (subject to change during development process):

```console
instructlab.core.Config
instructlab.core.Config.init
instructlab.core.Config.show (get)
instructlab.core.Taxonomy
instructlab.core.Taxonomy.diff
instructlab.core.System
instructlab.core.System.info
instructlab.core.Data
instructlab.core.Data.ingest
instructlab.core.Data.generate_data
instructlab.core.Model
instructlab.core.Model.serve
instructlab.core.Model.train_model
instructlab.core.Model.process_data (calling the training library's data process class in a safe way)
instructlab.core.Model.evaluate_mt_bench
instructlab.core.Model.evaluate_dk_bench
instructlab.core.Model.evaluate_mmlu_bench
instructlab.core.RAG.ingest
instructlab.core.RAG.convert
```

a brief example:

```python

from instructlab.core import Config, Taxonomy, Data, Model

config_object = Config.init(...)
diff = Taxonomy.diff()

if diff:
data_client = Data(data_path="", teacher_model="", num_cpus="", taxonomy_path="",)

# not in v1alpha1
data_path = data_client.ingest()
# not in v1alpha1

openai_compat_client = some_server()

data_jsonls = data_client.generate_data(client=openai_compat_client, data=data_path)

some_custom_handling(data_jsonls)

# you can either use a config obj or pass trainer args
model_client = Model(student_model=path_to_student_model, configuration=config_object)

model_path = model_client.train_model()

# since we initialized the model client with the config, the training args are passed implicitly
eval_output = model_client.mt_bench(model_path=model_path)
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent with the Python SDK should be to serve the Data Scientists and AI engineers doing experiments as the primary use. Then the constructs required for embedding or extending InstructLab from other products and platforms.

For a data science experience I propose to be inspired to what sklearn or keras or pytorch high level structure provide for this persona.

For example, what would it take to have an experience similar to :

from instructlab import InstructLab, SDG, Train, Eval

ilab = InstructLab(taxonomy=<path_to_taxonomy>) # and on initizalization, the taxonomy is diff & loaded

# like in Pandas DataFrame.describe()
ilab.describe() # prints a summary of the taxonomy diff and attributes
ilab_df = ilab.data.ingest() # ingest the data from the taxonomy and return a dataset as a Pandas DataFrame

oa_client = {
    "url": "https://api.openactive.io/v1",
    "api_key": "YOUR_API_KEY"
}

##
# SDG interactions
##
ilab_sdg=SDG(client=oa_client,
             data=ilab_df,
             teacher_model={<model_and_attributes>},
             ) # ilab SDG class

ilab_sdg.load_pipeline(<path_to_pipeline>) # load a pipeline from a file

for block in ilab_sdg.pipeline.get_blocks():
    # do block customization
    ilab_sdg.pipeline[block].something()
    ilab_sdg.pipeline[block].block.foo = "bar"

ilab_sdg.run(
    callback=<callback_function>, # a function to call after executing each block (e.g. to report progress, or save intermediate results)
) # generate the SDG data

##
# Training interactions
##
ilab_train = Train(dataset=ilab_sdg.data, 
                   student_model={<model_and_attributes>},
                     ) # ilab Train class

ilab_train.run(
    callback=<callback_function>, # a function to call after each cycle/teration/epoch loop is run (e.g. to report progress, or save intermediate results, or to stop the training if certain stopping criteria are met)
) # execute the Training loop

ilab_train.model_save(<path_to_save>, resolution={}, model_format={gguf|safe_tensors|onnx|etc}) # save the model to a file in the specified format

##
# Evaluation
###

ilab_eval = Eval(dataset=[<path_to_eval_dataset>], 
                 endpoints=[<path_to_endpoints_for_eval>], 
                 ) # ilab Eval class

ilab_eval.evals(
    [list_of_evaluations], # a list of evaluations to run (e.g. accuracy, precision, dk-bench, etc.)
)

ilab_eval.run(
    callback=<callback_function>, # a function to call after each evaluation is run (e.g. to report progress, or save intermediate results)
)

ilab_eval.summary() # print a summary of the evaluation results

ilab_eval.save(<path_to_save>, format={jsonl|parquet|csv|xls}) # save the evaluation results to a file

ilab_eval.export_report(<path_to_save>, format={html|pdf}) # report with  the evaluation results

I would propose to focus on accelerating the goal of the persona (the data scientists), and not on mapping or exposing the InstructLab internal architecture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is possibly a good end goal, but would require significant re-architecture of InstructLab, and the libraries (what they expose).

For the purposes of this dev-doc, I can incorporate some of this into what I am proposing, but for an alpha SDK, keeping the interactions as simple yet expandable as possible is my goal.

So we should aim to not require library adjustments/changes at first, and then we can add functionality once the structure is in place.

Let me incorporate some of this and I will update the PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @williamcaban has a point here about the persona we're building this SDK for. Who are we making this SDK for? Datascientist? People trying to build REST-API based instructlab services? The design of the SDK might vary based on who our target user of the SDK would be.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely important to understand the persona and goal of the SDK. William is proposing a fairly large re-architecting of the APIs, while this dev doc seems mostly focused on exposing existing CLI flows and functionality via a Python SDK. Is our goal to expose the existing end-to-end flow via a Python SDK? Or to provide a new way to interact with various InstructLab components that's more granular and flexible in how you compose your own end-to-end workflow from the pieces we're exposing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am making some updates to the PR (some are already up), aiming to meld the two approaches a bit, lmk what you think

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the updates look like a reasonable enough place for work to start.

The actual Python code and list of APIs is only illustrative, right? From reading this, it's my understanding that the actual parameters and methods on individual classes shown here are just an example since you call out future work is to design the actual SDK based on the structure above and negotiate user contracts with library maintainers. For example, if SDG points out that the Data.Generator constructor needs different params or that we have to actually use the taxonomy diff results as input to data generation as opposed to just checking if there is a diff, this kind of thing can be ironed out later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional point I want to raise - what we definitely don't want is to be exposing pure library functionalities through a class in the Core repo - users that want that should just be importing the libraries directly.

As Charlie notes above what we want here is an SDK for the opinionated ilab workflow that this package is centered around - for example, our Data class wouldn't expose every public function in the SDG library, but it would expose functions we have within src/instructlab/data such as generate_data and list_data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbrowning yep, the code is pretty much illustrative. The structure of classes I think is what I am focusing on the most. The actual arguments, functions within the Data class, etc can be ironed out later on. If we were to figure that out now, it would take forever!


The above example utilizes the configuration object to instantiate the `Model` class. However, a user could also pass `training_args=` directly to `model_client.train_model` to override the configuration class defaults. This allows the SDK to utilize the System Profiles of the ilab CLI but not rely on them too much.

Presumably, the distinct methods under each class will grow, which is why I am opting to make very distinct classes per command group. Another benefit to the parent classes is that individual methods can inherit defaults from the instantiation of the object.

These initial exposed functions can expand to include any new functionality that is more SDK oriented from the various libraries. For example, if SDG adds something like subset selection, teacher as annotator, data mixing, etc we could expose an `instructlab.core.Data.annotate` or `instructlab.core.Data.mix` that could be invoked in sequence in a user's script with other parts of the ilab workflow. Some things make _less_ sense to be exposed via a CLI, but still are critical to ensuring users get a good model and properly generated data.

There are certain things that only exist in `ilab` currently and functionality that is going to be moving these such as data ingestion, RAG, etc. Forming an SDK for `instructlab` allows us to capture all of these concerns under one API.

These endpoints in combination with the curated InstructLab Config File will open up these workflows to users and allow InstructLab to be easily incorporated into other projects. Allowing people to run things like data generation, and full fine-tuning via an SDK that pulls in their pre-existing `config.yaml` but also can be run independently will open new avenues for InstructLab adoption and extensibility.

## Changes to the CLI

The `ilab` CLI will need to adapt to this new structure. Commands like `ilab data generate` should, in terms of code, follow this flow:

1. `src/instructlab/cli/data/generate.py`
2. `src/instructlab/data/generate.py`
3. `src/instructlab/process.py`
4. `src/instructlab/core/data/generate.py`

So generally: cli -> process management package to kick off a sub-process -> internal handling package -> core SDK (public definitions) -> library code, is the flow.

The flow of the CLI today is such that the cli package for a command (`src/instructlab/cli/data/generate.py`) parses the command line options, manages creating a sub-process, and passes control to the core code (`/src/instructlab/core/data/generate.py`). This then calls out to the library APIs

The internal handling package is necessary as it allows us to split off a sub-process when it makes the most sense for us before calling the library code directly. This is how the CLI works today.

The difference with an SDK is that we would eventually want to end up executing `core/data/generator.py`, the actual publicly consumable python SDK. This will ensure that the CLI can do whatever custom handling it needs to do on top, but eventually it must boil down to the `core` package which uses publicly available methods from the various libraries.

## Scope of work

In upcoming releases the InstructLab team should aim to:

1. Design the SDK given the structure above
2. Converse with Library maintainers to negotiate user contracts
3. Begin work to re-architect how the CLI works using the SDK
4. Publish an alpha SDK for public consumption

After this initial work, the team can scope adding net new functionality that is not in the CLI to the SDK.