Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT] Use Phoenix to run Evals using Internal Hosted LLMs #2280

Closed
amank94 opened this issue Feb 14, 2024 · 4 comments
Closed

[ENHANCEMENT] Use Phoenix to run Evals using Internal Hosted LLMs #2280

amank94 opened this issue Feb 14, 2024 · 4 comments
Labels
c/dx Developer experience c/evals enhancement New feature or request

Comments

@amank94
Copy link
Contributor

amank94 commented Feb 14, 2024

Is your feature request related to a problem? Please describe.
My organization manages and runs our own LLMs and serves them internally with a custom API. We would like to use Phoenix to run Evals using our internal LLMs.

Describe the solution you'd like
I'd like a higher level interface or abstract class I can inherit from to implement a Phoenix BaseEvalModel when using my own internally-hosted LLM that is served via a REST API. I would like to implement only two methods: one to set up an httpx client configured to send requests to a specific endpoint and a second to parse the response to an LLM output string.

  • As a user, I can manage my own Auth and keys for the request

The resulting class should be a fully Phoenix-compatible BaseEvalModel that I can use with all evals features and built-in prompt templates.

Describe alternatives you've considered

  1. LiteLLM provides the ability to wrap custom models as long as they are served on an OpenAI-compatible completions API. Not all of our internal models are served on the same endpoints, so we need something more general purpose.
  2. The barrier to implementing our own version of a BaseEvalModel seems high, as it's hard to know if we've done so correctly in a way that will work with all Phoenix eval features. Since we host and spin up many different company models for many different purposes, we would like it to be easy to configure one for use with Phoenix at any time.

Example curl command below conforming to OpenAI completions endpoint

curl --location 'https://genai.<domain>.com/api/v1/genai/completions' \
--data '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "is it a holiday?",
    "presence_penalty": 1.15,
    "max_tokens": 200
}' 

We may also pass additional headers i.e. ‘Context-Name’

An example implementation of a BaseEvalClass can be seen in the code example below

class CustomRequestModel(BaseEvalModel):
    temperature: float = 0.3
    max_tokens: int = 1024000
    top_p: float = 1
    top_k: int = 10

    def __post_init__(self) -> None:
        process = subprocess.run(['gcloud', 'auth', 'print-access-token'], capture_output=True, text=True)
        self._access_token = process.stdout.strip()
        self._headers = {
            "Authorization": f"Bearer {self._access_token}",
            "Content-Type": "application/json",
        }
        self.endpoint = "https://us-central1-aiplatform.googleapis.com/v1/projects/<id>/locations/us-central1/endpoints/<id>:predict"
        self.client = httpx.Client(headers=self._headers)
        self.async_client = httpx.AsyncClient(headers=self._headers)

    def _generate(self, prompt: str, **kwargs: Any) -> str:
        res = self.client.post(self.endpoint, json=self._format_request(prompt))
        # return res
        res.raise_for_status()
        completion = res.json()['predictions'][0]
        return completion.split("Output:")[-1]

    async def _async_generate(self, prompt: str, **kwargs: Any) -> str:
        res = await self.async_client.post(self.endpoint, json=self._format_request(prompt))
        # return res
        res.raise_for_status()
        completion = res.json()['predictions'][0]
        return completion.split("Output:")[-1]

    def _format_request(self, prompt: str) -> Dict[str, Any]:
        return {
            "instances": [
                {
                    "prompt": prompt,
                    "max_tokens": self.max_tokens,
                    "temperature": self.temperature,
                    "top_p": self.top_p,
                    "top_k": self.top_k,
                }
            ],
        }

    @property
    def encoder(self) -> "Encoding":
        ...

    @property
    def max_context_size(self) -> int:
        ...

    def get_text_from_tokens(self, tokens: List[int]) -> str:
        ...

    def get_tokens_from_text(self, text: str) -> List[int]:
        ...```
@amank94 amank94 added enhancement New feature or request triage issues that need triage labels Feb 14, 2024
@github-project-automation github-project-automation bot moved this to 📘 Todo in phoenix Feb 14, 2024
Copy link

dosubot bot commented Feb 14, 2024

🤖

That sounds like a great enhancement request for the Arize Phoenix software! It's fantastic that you've already put thought into the solution and provided an example implementation of a BaseEvalClass. Your proposed solution for a higher level interface or abstract class to work with internally-hosted LLMs via a REST API seems well-considered. I encourage you to go ahead and create a pull request with your proposed changes to the repository. Your contribution will be valuable to the community and the project. Thank you for your initiative!

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added c/dx Developer experience c/evals labels Feb 14, 2024
@mikeldking
Copy link
Contributor

Discussed with @amank94 about a few different strategies:

  1. if the API is OpenAI compatible (e.x. https://docs.anyscale.com/endpoints/model-serving/openai-migration-guide) - the user should just be able to use model_kwargs for now - this requires some testing. As a fallback we can suggest lightLLM
  2. Alternatively we can give the user the ability to implement the model interface. For this to be forward compatible we first need to complete 🗺️ [Evals] Promote evals out of experimental #2142 - @anticorrelator is scoping out the minimum interface possible there.

@mikeldking mikeldking removed the triage issues that need triage label Feb 14, 2024
mkhludnev added a commit to mkhludnev/phoenix that referenced this issue Mar 2, 2024
@mkhludnev
Copy link
Contributor

I think LiteLLMModel with model_kwargs or just with os.env is a way to go. If anyone would like to have #2423 extended with OpenAI API example I can add it.

mkhludnev added a commit to mkhludnev/phoenix that referenced this issue Mar 5, 2024
anticorrelator pushed a commit that referenced this issue Mar 7, 2024
)

* demonstrate using selfhosted ollama as Eval Model. #2280

* openAI mock test

* sweep

* move, rename

* imports
@mikeldking
Copy link
Contributor

Marking this as resolved via lightLLM for now

@github-project-automation github-project-automation bot moved this from 📘 Todo to ✅ Done in phoenix May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/dx Developer experience c/evals enhancement New feature or request
Projects
Archived in project
Development

No branches or pull requests

3 participants