[Feature]: Add efficient interface for evaluating probabilities of fixed prompt-completion pairs

# Proposed Feature
Add an efficient interface for generation probabilities on fixed prompt and completion pairs. For example:

```python
# ... load LLM or engine
prompt_completion_pairs = [
    ("1 + 1 = ", "2"),
    ("1 + 1 = ", "3"),
]
prompts, completions = list(zip(*prompt_completion_pairs))
probs = llm.completion_logprobs(prompts=prompts, completions=completions)
```

Alternatively, the interface could evaluate the probabilities of a fixed prompt with multiple generation options to better leverage prefix caching:

```python
prompt = "1 + 1 = "
completions = ["2", "3", "4"]
probs = llm.completion_logprobs(prompt=prompt, completions=completions)
```

Currently, there are interfaces in class `SamplingParams` to return the log probabilities of prompt (`prompt_logprobs`) and the generated tokens (`logprobs`). However, they are either inefficient or has incomplete support for this use case.

# Motivation
The motivation of this feature comes from LLM evaluations on multiple-choice questions (e.g., MMLU). vLLM is a popular tool adopted by mainstream LLM evaluation frameworks (e.g., [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)) for this purpose.

Using the following example:

> Question: Which of the following is true?
> (A) ABC
> (B) DEF
> The answer is:

Evaluating a base LLM on this question involves calculating the probability on each choice $P_{\text{LLM}}(\text{choice} \mid \text{question})$ and selecting the choice with the highest probability.

# Current solution

Currently, lm-evaluation-harness runs two generations and evaluate the full prompt probabilities for this purpose.

```python
question = "1 + 1 = "
choices = ["2", "3", "4"]
prompts = [question + c for c in choices]
sampling_params = SamplingParams(temperature=0, max_tokens=1, prompt_logprobs=1)
outputs = llm.generate(prompts=prompts, sampling_params=sampling_params)
```

Instead of evaluating probabilities on the *choices*, it evaluates on *question + choices* and runs through multiple generations because of the limitations in vLLM's user interface.

### Efficiency issue with current solution

The issue of using `prompt_logprobs` is that it is *very* inefficient on long prompts.

Let's use the following minimal profiling example (profiling.py):

```python
import time

import numpy as np
from vllm import LLM, SamplingParams

n_seqs = 100
vocab_size = 10_000
seq_len = 4000
data = np.random.randint(0, vocab_size, (n_seqs, seq_len)).tolist()

llm = LLM("mistral-community/Mistral-7B-v0.2", max_model_len=8000, gpu_memory_utilization=0.6)
sampling_params = SamplingParams(temperature=0, max_tokens=1, prompt_logprobs=1)

start = time.perf_counter()
outputs = llm.generate(prompts=None, prompt_token_ids=data, sampling_params=sampling_params)
end = time.perf_counter()

print(f"Inference took {end - start:.4f} seconds")
```

Running the code with vLLM's official docker image:

```bash
docker run --gpus all --shm-size=10g --rm -e HF_TOKEN=[token] -v "$(pwd):/app" --entrypoint python3 vllm/vllm-openai:v0.4.3 /app/profiling.py
```

On a single A100-40G GPU, it runs around **500 seconds** with `prompt_logprobs=1` and only **27 seconds** with no prompt_logprobs. Moreover, we can fit much longer input prompt if we turn it off.

### Analysis on the efficiency issue
A quick [search](https://github.com/search?q=repo%3Avllm-project%2Fvllm%20prompt_logprobs&type=code) of `prompt_logprobs` takes us to `Sampler.forward` [method](https://github.com/vllm-project/vllm/blob/c7f2cf2b7f67bce5842fedfdba508440fe257375/vllm/model_executor/layers/sampler.py#L49) in vllm/model_executor/layers/sampler.py.  
  
First, we noticed the shape of [logits](https://github.com/vllm-project/vllm/blob/c7f2cf2b7f67bce5842fedfdba508440fe257375/vllm/model_executor/layers/sampler.py#L51) changes from (1, vocab_size) to (input_len, vocab_size) if we set prompt_logprobs. Second, we found the [get_logprobs](https://github.com/vllm-project/vllm/blob/c7f2cf2b7f67bce5842fedfdba508440fe257375/vllm/model_executor/layers/sampler.py#L686) involves lots of python for loops and CPU-GPU communications.

# Potential Changes

I see two ways to fix the efficiency issue with the current approach.

### Option 1: Use prompt_logprobs but don't calculate on the full prompt

We could reuse `prompt_logprobs` but limit the probability calculation to the final few tokens, so we don't have to pass around the large logits array.

### Option 2: Use the sampling logprobs but constraint the generation on the choices

Currently, there is a controlled generation [interface](https://github.com/vllm-project/vllm/pull/3211) for OpenAI compatible server, but not for the offline inference.

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Add efficient interface for evaluating probabilities of fixed prompt-completion pairs #5234

Proposed Feature

Motivation

Current solution

Efficiency issue with current solution

Analysis on the efficiency issue

Potential Changes

Option 1: Use prompt_logprobs but don't calculate on the full prompt

Option 2: Use the sampling logprobs but constraint the generation on the choices

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Add efficient interface for evaluating probabilities of fixed prompt-completion pairs #5234

Description

Proposed Feature

Motivation

Current solution

Efficiency issue with current solution

Analysis on the efficiency issue

Potential Changes

Option 1: Use prompt_logprobs but don't calculate on the full prompt

Option 2: Use the sampling logprobs but constraint the generation on the choices

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions