Simplify the `.predict` API for easier contributions #1717

blythed · 2024-01-29T10:02:18Z

blythed
Jan 29, 2024
Maintainer

@jieguangzhou to provide ideas how we can simplify the preprocess -> forward -> postprocess abstraction without losing
low-code aspect.

jieguangzhou · 2024-01-30T08:35:15Z

jieguangzhou
Jan 30, 2024
Collaborator

import abc
import dataclasses as dc
import typing as t

from superduperdb.base.artifact import Artifact
from superduperdb.ext.utils import ensure_initialized


@dc.dataclass
class SuperduperModel(abc.ABC):
    preprocess: t.Union[t.Callable, Artifact, None] = None
    postprocess: t.Union[t.Callable, Artifact, None] = None
    lazy_loading: bool = True

    def __post_init__(self):
        if not isinstance(self.preprocess, Artifact):
            self.preprocess = Artifact(self.preprocess)
        if not isinstance(self.postprocess, Artifact):
            self.postprocess = Artifact(self.postprocess)

        if not self.lazy_loading:
            self.init()

    @ensure_initialized
    def predict(self, X, *kwargs):
        if self.preprocess:
            X = self.preprocess.artifact(X)

        y = self._predict(X)

        if self.postprocess:
            y = self.postprocess.artifact(y)

        return y

    @ensure_initialized
    def batch_predict(self, Xs, *kwargs):
        if self.preprocess:
            Xs = self.preprocess.artifact(Xs)

        ys = self._batch_predict(Xs)

        if self.postprocess:
            ys = self.postprocess.artifact(ys)

        return ys


    def db_predict(self, *args, *kwargs):
        # same as _Predict.predict
        datas = fetch_data_from_db()
        outputs = self.batch_predict(datas, *args, *kwargs)
        save_datas = convert_outputs_to_datas(outputs)
        save_to_db(save_datas)

    def db_train(self, *args, **kwargs):
        # same as _Train.train
        datas = fetch_data_from_db()
        self.train(datas, *args, **kwargs)
        update_model_to_db(self)
        ...

    def init(self):
        pass

    @abc.abstractmethod
    def _predict(self, X, **kwargs):
        pass

    @abc.abstractmethod
    def _batch_predict(self, Xs, **kwargs):
        pass


    def train(self, datasets, *args, **kwargs):
        # same as _Train.train
        raise NotImplementedError



#############################################################
# The following is how to create a new model class

@dc.dataclass
class OpenAI(SuperduperModel):
    model: str = "gpt-4"

    # Lazy loading using init
    def init(self):
        from openai import Client
        self.client = Client()

    def __post_init__(self):
        return super().__post_init__()

    def _predict(self, X):
        return self.client.complete(X, model=self.model)

    def _batch_predict(self, Xs):
        return self.client.complete(Xs, model=self.model)


class ObjectModel(SuperduperModel):
    object: t.Union[Artifact, t.Any, None]
    predict_method: t.Optional[str] = None

    def __post_init__(self):
        super().__post_init__()

        if not isinstance(self.object, Artifact):
            self.object = Artifact(self.object)

    def _predict(self, X, **kwargs):
        func = getattr(self.object.artifact, self.predict_method)
        return func(X, **kwargs)

    def _batch_predict(self, Xs, **kwargs):
        func = getattr(self.object.artifact, self.predict_method)
        return func(Xs, **kwargs)


class APIModel(SuperduperModel):
    api = "http://localhost:8000"

    def _predict(self, X, *args, **kwargs):
        return requests.post(self.api, json=X).json()

    def _batch_predict(self, Xs, *args, **kwargs):
        return requests.post(self.api, json=Xs).json()


class Ollama(SuperduperModel):
    model_name: str = "ollama"

    def init(self):
        from ollama import Ollama

        self.model = Ollama(self.model_name)

    def _predict(self, X, **kwargs):
        return self.model.predict(X, **kwargs)

    def _batch_predict(self, Xs, **kwargs):


class Huggingface(SuperduperModel):

    model_name: str = "gpt2"

    def init(self):
        from transformers import AutoModelForCausalLM

        self.model = AutoModelForCausalLM.from_pretrained(self.model_name)

    def _predict(self, X, **kwargs):
        return self.model.generate(X, **kwargs)

    def _batch_predict(self, Xs, **kwargs):
        return self.model.generate(Xs, **kwargs)

    def train(self, datasets, *args, **kwargs):
        # same as _Train.train
        from transformers import Trainer
        ...


model = OpenAI(preprocess=..., postprocess=...)

The main point is to

remove all db-related operations when defining a new model
specify specific function definitions

All new models only need to focus on their own initialization, prediction and training functions, and do not need to care about the interaction with the db, including preprocess and postprocess.

If we need function enhancements, such as API multi-thread prediction, just write an ApiBatchPredictMixin to enhance it.

If we need to process the context in the LLM scenario, we can write a Mixin to process the context, or directly inherit the original LLM and rewrite the x processing of predict.

4 replies

blythed Jan 30, 2024
Maintainer Author

So exactly which functions do you think we should write?

blythed Jan 30, 2024
Maintainer Author

I.e. what is the exact developer contract?

_predict?
_fit?

What else?

Do I understand correctly that you would only have preprocess, postprocess etc.. on certain classes?

blythed Jan 30, 2024
Maintainer Author

What is the reason for db_predict etc..?

jieguangzhou Jan 30, 2024
Collaborator

So exactly which functions do you think we should write?

Creating a new model requires implementing these methods

init
_predict
_batch_predict(Optional, If not implemented, it will be implemented through _predict)
_fit(Optional, It is the same as train above. Change the name and use fit.)

The above functions have nothing to do with datalayer in terms of implementation. All related operations are solved in the parent class.
When a AI Engineer wants to contribute a new model, it is possible that he does not understand the entire logic of superduperdb, but he can still contribute.

would only have preprocess, postprocess etc.. on certain classes?
No, I am omitting the writing method here, there are many more, such as identifier, output_type

Try to keep this minimum core to run. My suggestion is to remove the previous attributes such as device, because it is not universal and only suitable for specific technology stacks. If necessary, we can separate the TorchModel class, skLearnModel to implement it.

For example:
This model do not need many properties.

class FakeModel(SuperduperModel):
     def _predict(self, x):
            return x + 1

Try to keep it simple but satisfy the properties using superduperdb features

What is the reason for db_predict etc..?

There are two interactive scenarios, one is to directly predict database data, and the other is real-time prediction. They now use the same interface, but the parameter meanings are basically different

My opinion on this point is not very strong and can be merged back into predict, the same as we currently have. It’s just that I think it’s not friendly to developers because its call chain is too long.

jieguangzhou · 2024-01-30T08:56:51Z

jieguangzhou
Jan 30, 2024
Collaborator

In addition suggestion(TBD), regarding the artifact conversion operation, we also leave it to the parent class to provide an artifact_list attribute.

When the class is saved, it will be automatically converted to artifact.
Automatically unpack artifact when db.load

In this way, at the model level, there is no need to convert the relationship artifact, just use it directly, otherwise we need to use like this self.object.artifact

1 reply

blythed Jan 30, 2024
Maintainer Author

I agree with the Artifact conversion part. (I'm actually working on this now.)

blythed · 2024-02-02T12:43:16Z

blythed
Feb 2, 2024
Maintainer Author

    def fit(self, X, y, ...):
        ...
        # implemented once 1x by SuperDuperDB

    model = sklearn.svm.SVC()
    model.fit(X, y)

    class 

    def _fit(self, train_dataset, valid_dataset, train_func: t.Optional[Callable] = None):
        X = []
        for r in train_dataset:
            X
        self.train_func = train_func
        # implemented for each type of model by commmunity

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify the `.predict` API for easier contributions #1717

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Simplify the .predict API for easier contributions #1717

blythed Jan 29, 2024 Maintainer

Replies: 3 comments · 5 replies

jieguangzhou Jan 30, 2024 Collaborator

blythed Jan 30, 2024 Maintainer Author

blythed Jan 30, 2024 Maintainer Author

blythed Jan 30, 2024 Maintainer Author

jieguangzhou Jan 30, 2024 Collaborator

jieguangzhou Jan 30, 2024 Collaborator

blythed Jan 30, 2024 Maintainer Author

blythed Feb 2, 2024 Maintainer Author

Simplify the `.predict` API for easier contributions #1717

blythed
Jan 29, 2024
Maintainer

Replies: 3 comments 5 replies

jieguangzhou
Jan 30, 2024
Collaborator

blythed Jan 30, 2024
Maintainer Author

blythed Jan 30, 2024
Maintainer Author

blythed Jan 30, 2024
Maintainer Author

jieguangzhou Jan 30, 2024
Collaborator

jieguangzhou
Jan 30, 2024
Collaborator

blythed Jan 30, 2024
Maintainer Author

blythed
Feb 2, 2024
Maintainer Author