Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llava Initial approach to clear images #664

Merged
merged 5 commits into from
Apr 16, 2024

Conversation

SignalRT
Copy link
Collaborator

This changes the Interactive executor to be able to continue the conversation including new images. The example (outside the executor) clears the KV_CACHE to reset the previous images.

@SignalRT
Copy link
Collaborator Author

This is ready to review.

  • I modified the example to show how to manage the conversation changing the images when we want to talk about different images.
  • I update the documentation with the examples and the ILLamaExecutor interface change.

As an example:

Llava3

@zsogitbe
Copy link
Contributor

zsogitbe commented Apr 14, 2024

SignalRT, 3 short remarks.

  • What is your aim with having several images per prompt? The context size will limit this because with adding new images the length of the answer will decrease (it is the difference between the context size and the prompt size). I am not sure if this is a feasible approach while using GPU and it will probably make everything slower with the CPU. I also think that this makes the example complicated.
    Are you sure that llama.cpp handles several images per prompt well?
  • I have noticed that llava uses 20% more GPU memory, than before. I am not sure if this has to do with the C# code or the C++ code. I was running some tests with the code of IntPtrMax and that seems to use 20% less GPU memory. Can you think of any reason for the higher GPU memory usage?
  • I think that using ModelContext.GetState() and ModelContext.LoadState(state) is a better solution (both in llama and llava). I am not sure about the KV_CACHE and from the remark of ggerganov I think that he was also not sure about it.

@martindevans
Copy link
Member

This feels a bit like a hack to workaround not having properly KV cache management in the executors tbh. But without a redesign of the way executors work with sequence IDs etc (which is what I'm working on, slowly, with BatchedExecutor) this is probably the best option for now.

@AsakusaRinne
Copy link
Collaborator

But without a redesign of the way executors work with sequence IDs etc (which is what I'm working on, slowly, with BatchedExecutor) this is probably the best option for now.

@martindevans Recently I'm also working on such things (experimental), inspired by vllm. There are a lot of works to do so I think I'll add it to preview branch or add a namespace experimental for it at first. Since you are also working on it, I'll push a draft ASAP so that we could discuss how to make it together. :) Besides I'd be more than happy to see your idea. Please feel free to ping me if there is anything I could help with.

@SignalRT
Copy link
Collaborator Author

@zsogitbe

SignalRT, 3 short remarks.

  • What is your aim with having several images per prompt? The context size will limit this because with adding new images the length of the answer will decrease (it is the difference between the context size and the prompt size). I am not sure if this is a feasible approach while using GPU and it will probably make everything slower with the CPU. I also think that this makes the example complicated.
    Are you sure that llama.cpp handles several images per prompt well?
  • I have noticed that llava uses 20% more GPU memory, than before. I am not sure if this has to do with the C# code or the C++ code. I was running some tests with the code of IntPtrMax and that seems to use 20% less GPU memory. Can you think of any reason for the higher GPU memory usage?
  • I think that using ModelContext.GetState() and ModelContext.LoadState(state) is a better solution (both in llama and llava). I am not sure about the KV_CACHE and from the remark of ggerganov I think that he was also not sure about it.
  1. The goal is to support the capabilities of the model: Multiple images for eval.run_llava haotian-liu/LLaVA#432 . You can use one image if you want. I don´t see the problem to maintain the interface to be able to pass more than one image. Other Multi Dimensional models support also several images.
  2. I cannot see with my configuration different GPU memory usage between this PR and the previous version. I cannot compare with other architectures. Do you test this with exactly the same llama/llava dll versions in all scenarios?.
  3. I give a try to your proposal without success. The model begans to mix images in the answer. This is the answer that provides with the third image of my previous demo mixing two different scenes: This is a color photograph depicting a young woman standing on a wooden bridge overlooking a canal with a cup of coffee. The woman appears to be looking off into the distance, away from her red shirt. In the background, there are trees and a building in the background.

@SignalRT
Copy link
Collaborator Author

@martindevans, @AsakusaRinne
I'm also giving a try to a new batch executor (currently outside LlamaSharp to simplify my test). I'm focusing on server concurrent usage. I'm getting some metrics on the performance of this method, I will share with you when I have the comparison done.

I will integrate this PR "as is". We will probably need to change llava support in the near future (ggerganov/llama.cpp#4216 (comment)).

@zsogitbe
Copy link
Contributor

zsogitbe commented Apr 16, 2024

Thank you for your answer SignalRT,
In general the llava code is very well done! Thank you for your effort.

After reading the reference you have provided above and taking into account that the models are trained with one image only per input record, I think that it does not have any sense to use multiple images as input for one prompt. I also do not see immediately a good use case for such a functionality. I think that the model will be confused and in the best case it will describe multiple images as it would be one image, for example as a collage or it will just describe one image.
I also think that the example would be easier to understand if there would only be one image as input. It is a bit confusing now what is happening in the example (I had to think very hard to understand your example and this should not be the aim - the example should be easy to understand and simple).
But, it is up to you to decide.

I have noticed the higher GPU usage because I have an example which does not run anymore on the GPU (out of memory), but was fitting into GPU memory before. I do not know what has happened and where, I just see the result of it. Maybe it has to do with GPU memory allocation type (automatic thread pool with overhead). I think that the GPU memory allocation type has changed in the C# code or in the C++ code. I hope that we will have some kind of option to optimize GPU memory usage because it is very important for the speed (everything is much faster with GPU).

I have tried saving/loading the state also and it did not work. After loading the context state the model produces garbage in my case. Probably a bug in state management somewhere. The kv cache clearing works.

@martindevans, @AsakusaRinne, @SignalRT, does state saving/loading work with the general llama code base? This is important because I was counting on this to work (using it to reset the context without recreating it). Or do we need to use kv cache clearing there also?

@martindevans
Copy link
Member

I'm not aware of any issues with state load/save at the moment. It's something I'm going to be actively working on soon (load/save conversations in the BatchedExecutor), so if there are any issues they'll be fixed soon :)

@SignalRT SignalRT merged commit 399e81d into SciSharp:master Apr 16, 2024
3 checks passed
@SignalRT SignalRT deleted the LLavaResetOnImageChange branch April 20, 2024 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants