-
DescriptionI currently tried to implement parallel processing of tokens inspired by baby-llama, i.e. I'm trying to change the dimension of tokens from [1 x N] to [M x N] to process several tokens in parallel at once. Here you can find my fork with the first experiment. Note: I'm using Apple M2 Max.ProblemResults on CPU and GPU differ. Firstly I built my experimental app (input-batches-experiment) with -DLLAMA_METAL=OFF.
But when I built my experimental app with -DLLAMA_METAL=ON, I got totally inconsistent results (only the first batch is generated correctly):
QuestionLooks like I missed something in a Metal part initialisation or I need to make changes to that part of the project. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 6 replies
-
There is a |
Beta Was this translation helpful? Give feedback.
-
How does the prompt eval speed look like? |
Beta Was this translation helpful? Give feedback.
-
@Xarbirus do you have any plan to merge your experiments into |
Beta Was this translation helpful? Give feedback.
There is a
batched
example available now for parallel decoding