Llama-3.2 11B Vision Support #9643

yukiarimo · 2024-09-25T20:00:17Z

Is it working right now in any way?

stduhpf · 2024-09-25T20:14:31Z

Most likely not. I believe its architrecture would be closer to Pixtral (which is unsupported) than to Llava.

mirek190 · 2024-09-25T23:02:40Z

Currently any new vision+text models are not supported.

If llamacpp want to exist in the future must to implement text+vision models as that will be more and more common.
Soon probably voice as well.

yukiarimo · 2024-09-25T23:14:33Z

True. Is it possible to run it somehow on macOS not using llama.cpp?

Animaxx · 2024-09-26T02:52:49Z

Not sure if MLX can run it...?

MaratG2 · 2024-09-26T06:32:13Z

I got an error that architecture is unsupported, which was coming

MoonRide303 · 2024-09-26T07:26:10Z

First milestone could be ignoring non-text capabilities, and simply let people use text input/output to interact with multimodal models.

Below output from current (llama.cpp b3827) convert_hf_to_gguf.py (so people looking for this error could find this issue):

INFO:hf-to-gguf:Loading model: Llama-3.2-11B-Vision-Instruct
ERROR:hf-to-gguf:Model MllamaForConditionalGeneration is not supported

thiswillbeyourgithub · 2024-09-26T13:13:12Z

FYI the repo owner has a related statement here:

chigkim My PoV is that adding multimodal support is a great opportunity for new people with good software architecture skills to get involved in the project. The general low to mid level patterns and details needed for the implementation are already available in the codebase - from model conversion, to data loading, backend usage and inference. It would take some high-level understanding of the project architecture in order to implement support for the vision models and extend the API in the correct way.

We really need more people with this sort of skillset, so at this point I feel it is better to wait and see if somebody will show up and take the opportunity to help out with the project long-term. Otherwise, I'm afraid we won't be able to sustain the quality of the project.

So if you know anyone with strong skills or have influence over big tech and can motivate them please do!

HanClinto · 2024-09-26T15:53:51Z

It may be helpful to draw a distinction between multimodal / vision support in the core llama.cpp library vs. multimodal / vision support in llama.cpp's server.

Multimodal support was removed from the server in #5882, but it was not removed from the core library / command-line. I believe that ggerganov's comments re: looking for new developers to support vision models in the API is talking about the server -- not the core library.

This is how wrappers (such as ollama) are still able to provide an API to serve multimodal models via llama.cpp as a back-end. Ollama provides the HTTP server, and llama.cpp still does the core processing (including multi-modal support).

Long-term, I kinda' wonder if it isn't in llama.cpp's interests to stop supporting the HTTP server altogether, and instead farm that out to other wrapper projects (such as ollama), while we focus on enhancing the capabilities of the core API.

thiswillbeyourgithub · 2024-09-26T15:57:33Z

Oh thank you immensely for that clarification. I wasn't even aware that llama.cpp had a server as it seems so redundant with the other efforts such as ollama so I agree with you. Thanks a lot!

JohannesGaessler · 2024-09-26T17:35:32Z

Long-term, I kinda' wonder if it isn't in llama.cpp's interests to stop supporting the HTTP server altogether, and instead farm that out to other wrapper projects (such as ollama), while we focus on enhancing the capabilities of the core API.

I may be misremembering, but I think ollama internally forwards its calls to a llama.cpp HTTP server. In any case, my personal opinion is that I would rather have a server be part of the project instead of having to rely on a third party.

HanClinto · 2024-09-26T18:03:28Z

Long-term, I kinda' wonder if it isn't in llama.cpp's interests to stop supporting the HTTP server altogether, and instead farm that out to other wrapper projects (such as ollama), while we focus on enhancing the capabilities of the core API.

I may be misremembering, but I think ollama internally forwards its calls to a llama.cpp HTTP server. In any case, my personal opinion is that I would rather have a server be part of the project instead of having to rely on a third party.

Sorry to take this issue off topic, but hopefully it's relevant to enough warrant continuing it here.

Thank you for the correction -- it looks like you are right!

Best I can tell, ollama maintains a fork of llama.cpp's server that branched off of b2356 (our last release version that supported multimodal). Since then, they have continued updating that branch of server.cpp to add new features, remove the web front-end that we include in ours, maintain multimodal support, etc. I'm going through the diff to try and parse through what's being brought over and what's not (I'm not fully clear on their update strategy / method), but it seems to be a full-on fork at this point.

I haven't yet figured out how much their server maintains the spirit of the refactoring from #5882, or if merging their version of server.cpp into ours would be too much of a regress. If we're going to continue this discussion much further, perhaps opening a new issue to discuss sync'ing our version of server.cpp with ollama's would be useful?

Thellton · 2024-09-27T03:19:41Z

It may be helpful to draw a distinction between multimodal / vision support in the core llama.cpp library vs. multimodal / vision support in llama.cpp's server.

Multimodal support was removed from the server in #5882, but it was not removed from the core library / command-line. I believe that ggerganov's comments re: looking for new developers to support vision models in the API is talking about the server -- not the core library.

This is how wrappers (such as ollama) are still able to provide an API to serve multimodal models via llama.cpp as a back-end. Ollama provides the HTTP server, and llama.cpp still does the core processing (including multi-modal support).

Long-term, I kinda' wonder if it isn't in llama.cpp's interests to stop supporting the HTTP server altogether, and instead farm that out to other wrapper projects (such as ollama), while we focus on enhancing the capabilities of the core API.

Honestly, I agree with that last point, maybe the server should be depreciated in favour of an API to allow others to wrap a server around it so that llamacpp can focus on its core competency whilst others handle the specifics of actually serving inference up. Having an actual server included as part of the program has been good, but is the program and the community well served by clinging to it now and into the future when it's very apparent that others are perfectly happy to address the issue of serving models as is happening with ollama?

the same might even be said of the number of inference backends (CUDA, ROCm, various CPUs of both x86 and ARM flavours, Vulkan, and SYCL) as there is technically a great deal of duplication of effort in that regard. It'd be interesting to see what the technical feasibility would be of for example of Vulkan reaching feature parity with CUDA, ROCm, and CPU would be. of course, at the end of the day, I'm basically all talk on this matter as it'd be years before I'd have the competence to contribute :/

JohannesGaessler · 2024-09-27T07:25:54Z

Open source projects aren't run like a company. There isn't a boss at the top directing people to work on specific things, people are choosing what to work on out of their own volition. Removing the server isn't going to result in more resources going towards other aspects of the project, it's going to result in the people who are currently choosing to contribute to and maintain the server being frustrated with their efforts being "wasted". The only aspect where I think there is a real zero sum game is code review since the vast majority of work is done by Georgi and slaren.

Thellton · 2024-09-27T10:45:07Z

Open source projects aren't run like a company. There isn't a boss at the top directing people to work on specific things, people are choosing what to work on out of their own volition. Removing the server isn't going to result in more resources going towards other aspects of the project, it's going to result in the people who are currently choosing to contribute to and maintain the server being frustrated with their efforts being "wasted". The only aspect where I think there is a real zero sum game is code review since the vast majority of work is done by Georgi and slaren.

yeah... kind of just accentuates just how frustrating it is to want to help, to contribute and yet knowing it'd be years before I could do so in a way that is useful :/ basically just a plain bummer.

jrp2014 · 2024-09-29T17:23:37Z

It works OK on a MacBook Pro using the example code on the hugging face page. I tried to get it to caption and keyword some pics. It doesn't seem to understand what a keyword is and produces, instead a good deal of prose. I preferred mlx-vlm + Llava 1.6, which was also a bit faster to run.

yukiarimo · 2024-09-30T17:56:19Z

@jrp2014 Can you please provide the exact code you used?

I tried loading the full model, but got zsh killed :(

vietvudanh · 2024-10-02T02:26:31Z

True. Is it possible to run it somehow on macOS not using llama.cpp?

You can use MLX. I had run Phi3.5 Vision Q4 with mlx-vlm. Not tried LLama3.2 yet.

vietvudanh · 2024-10-07T04:20:09Z

Currently any new vision+text models are not supported.
If llamacpp want to exist in the future must to implement text+vision models as that will be more and more common. Soon probably voice as well.

Keep an eye on the news, they are allegedly moving towards text+vision support. I think they understand the necessity.

https://ollama.com/blog/llama3.2

Ollama is not llama.cpp

muzhig · 2024-10-07T05:20:52Z

https://github.com/EricLBuehler/mistral.rs can run this model, as well as a bunch of others. It also supports quantization including GGUF & acceleration including Apple Silicon. It's more experimental though, less stable, less documented.

jrp2014 · 2024-10-11T21:15:25Z

@jrp2014 Can you please provide the exact code you used?

I tried loading the full model, but got zsh killed :(

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
import os
from pathlib import Path

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="mps",
)
processor = AutoProcessor.from_pretrained(model_id)

picpath = "/Users/xxx/Pictures/Processed"
pics = sorted(Path(picpath).iterdir(), key=os.path.getmtime, reverse=True)
pic = str(pics[0])
print(pic)

#url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
#image = Image.open(requests.get(url, stream=True).raw)

image = Image.open(pic)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
 #       {"type": "text", "text": "Provide a caption and a list of keywords for this image, suitable for microstock."}
 #       {"type": "text", "text": "Analyze this image and provide a title (max 70 characters), description (max 200 characters), and comma-separated keywords suitable for microstock photography websites."}
        {"type": "text", "text": "You are an AI assistant that helps people craft a clear and detailed sentence that describes the content depicted in an image.  Then generate a list of descriptive, comma-separated tags for the following image. Analyze the image carefully and produce tags that accurately represent the image. Ensure the tags are relevant."}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=500)
print(processor.decode(output[0]))

Dampfinchen · 2024-10-19T10:40:21Z

Really sad to see the lack of support for vision models like LLama Vision and countless others. Not even Llama Vision models, namegiver to llama.cpp, are supported.

It seems llama.cpp lacks talent for multimodal models currently. I truly hope new talent will arrive at some point, because I fear for the future of the project as multimodal support gets increasingly popular (infact, there's a high chance LLama 4 might even be omnimodal which would be a death blow to llama.cpp in its current state.)

Fingers crossed.

cognitivetech · 2024-10-19T11:01:57Z

Really sad to see the lack of support for vision models like LLama Vision and countless others. Not even Llama Vision models, namegiver to llama.cpp, are supported.

It seems llama.cpp lacks talent for multimodal models currently. I truly hope new talent will arrive at some point, because I fear for the future of the project as multimodal support gets increasingly popular (infact, there's a high chance LLama 4 might even be omnimodal which would be a death blow to llama.cpp in its current state.)

these low-key threats to "the developers" about the future of this project are super cringe

its entirely possible to run those models from your computer without llama.cpp.

why not use the provided code, and adapt it to whatever will fit locally?

you could even *gasp* ask chatGPT how to do it...

really embarrassing the conduct of people impatiently waiting for changes they aren't willing to do an ounce of work towards!

Dampfinchen · 2024-10-19T11:26:00Z

Really sad to see the lack of support for vision models like LLama Vision and countless others. Not even Llama Vision models, namegiver to llama.cpp, are supported.
It seems llama.cpp lacks talent for multimodal models currently. I truly hope new talent will arrive at some point, because I fear for the future of the project as multimodal support gets increasingly popular (infact, there's a high chance LLama 4 might even be omnimodal which would be a death blow to llama.cpp in its current state.)

these low-key threats to "the developers" about the future of this project are super cringe

its entirely possible to run those models from your computer without llama.cpp.

why not use the provided code, and adapt it to whatever will fit locally?

you could even gasp ask chatGPT how to do it...

really embarrassing the conduct of people impatiently waiting for changes they aren't willing to do an ounce of work towards!

There's no need to be so hostile. It wasn't meant as a threat at all, and I respect the developers and the huge amount of work that went into this project very much. Which is exactly why I have these concerns, otherwise I would just ignore it.

All I'm saying is the future is clearly multimodal and I fear that the project might get less relevant when it doesn't adapt to the changing landscape. It's unfathomable how this could be taken as a threat against the skilled developers here.

Also no, I cannot run these models without llama.cpp as I just have 6 GB VRAM. And you know fully well how huge the llama.cpp code is, it's not possible to just ask ChatGPT to code multimodal support into llama.cpp (otherwise it would already be done by now).

cognitivetech · 2024-10-19T13:16:45Z

don't act shocked and offended. this isn't the first time I saw such a message, here, and I finally am sick of it, regardless of if you were the OP of that particular negotiation tactic.

you don't need llama-cpp to serve quantize version of the full model.. that's what I mean about asking chatgpt

import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
from bitsandbytes.nn import Linear4bit

model_id = "meta-llama/Llama-3.2-11B-Vision"

# Configure 4-bit quantization
bnb_config = {
    "load_in_4bit": True,
    "bnb_4bit_use_double_quant": True,
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_compute_dtype": torch.bfloat16
}

# Load the model with 4-bit quantization and CPU offloading
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    max_memory={0: "5GiB", "cpu": "30GiB"},  # Adjust these values based on your system
    torch_dtype=torch.bfloat16,
)

processor = AutoProcessor.from_pretrained(model_id)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)

prompt = "<|image|><|begin_of_text|>If I had to write a haiku for this one"
inputs = processor(image, prompt, return_tensors="pt").to(model.device)

# Generate with reduced memory usage
with torch.cuda.amp.autocast():
    output = model.generate(**inputs, max_new_tokens=30, do_sample=True, temperature=0.7)

print(processor.decode(output[0], skip_special_tokens=True))

this code is adapted from the original, I made it load in 4 bit, and you can specify max memory to load... so you could even run t he full model, but only load some to memory.. though the 4 bit version will fit most of the way onto your gpu.

the rudeness and bullying is born entirely of ignorance of this technology

Dampfinchen · 2024-10-19T13:28:24Z

don't act shocked and offended. this isn't the first time I saw such a message, here, and I finally am sick of it, regardless of if you were the OP of that particular negotiation tactic.

you don't need llama-cpp to serve quantize version of the full model.. that's what I mean about asking chatgpt

import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
from bitsandbytes.nn import Linear4bit

model_id = "meta-llama/Llama-3.2-11B-Vision"

# Configure 4-bit quantization
bnb_config = {
    "load_in_4bit": True,
    "bnb_4bit_use_double_quant": True,
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_compute_dtype": torch.bfloat16
}

# Load the model with 4-bit quantization and CPU offloading
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    max_memory={0: "5GiB", "cpu": "30GiB"},  # Adjust these values based on your system
    torch_dtype=torch.bfloat16,
)

processor = AutoProcessor.from_pretrained(model_id)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)

prompt = "<|image|><|begin_of_text|>If I had to write a haiku for this one"
inputs = processor(image, prompt, return_tensors="pt").to(model.device)

# Generate with reduced memory usage
with torch.cuda.amp.autocast():
    output = model.generate(**inputs, max_new_tokens=30, do_sample=True, temperature=0.7)

print(processor.decode(output[0], skip_special_tokens=True))

this code is adapted from the original, I made it load in 4 bit, and you can specify max memory to load... so you could even run t he full model, but only load some to memory.. though the 4 bit version will fit most of the way onto your gpu.

the rudeness and bullying is born entirely of ignorance of this technology

I'm not acting shocked nor offendend, nor am I. I'm completely neutral. Also, I really didn't mean to come off as rude or anything like that.

The code you've presented would allow me to load the model in 4 bit, however, I could still not run it because CPU offloading via transformers/bnb is unbearably slow. LLama.cpp is much faster in that regard. It would only work for models that fit entirely in VRAM, like 8B or below.

JohannesGaessler · 2024-10-19T13:31:55Z

@Dampfinchen there is no "death blow" risk for llama.cpp from a lack of multimodal support. At this point the library is too deeply ingrained as essentially the only option that isn't based on PyTorch.

@cognitivetech I am not perceiving Dampfinchen's words as a "threat". He is free to argue in favor of vision support and I would agree that vision support would be beneficial. I just don't agree that it's of higher priority than what I'm working on right now (general GGML training support).

cognitivetech · 2024-10-19T13:36:29Z

thanks for the update Johannes.

I've been watching this thread for a couple weeks and maybe it was not a threat but waving over developers head the idea the project will become obsolete.. now the 3rd time I read this same comment here in 2 weeks

there is certainly an implied "or else" as though the developers are just clueless about the state of the ecosystem for vision support. meanwhile this is among the most highly demanded and rapidly maintained softwares on the market.

anyways.. that's my 2 cents, which I felt like were my right to air after repeatedly seeing that message, here. (or maybe it was the same message in multiple threads on vision support, tbh)

Dampfinchen · 2024-10-19T13:55:44Z

@Dampfinchen there is no "death blow" risk for llama.cpp from a lack of multimodal support. At this point the library is too deeply ingrained as essentially the only option that isn't based on PyTorch.

@cognitivetech I am not perceiving Dampfinchen's words as a "threat". He is free to argue in favor of vision support and I would agree that vision support would be beneficial. I just don't agree that it's of higher priority than what I'm working on right now (general GGML training support).

Yeah you are right, in hindsight, my choice of using "death blow" might have been exaggerated, but nice that you can see my point anyways! My poor choice of wording might have been the cause for cognitivetech's annoyance.

Training support for GGML would be excellent, I agree. I could see ggml speeding up my training immensely. With that, better LoRA adapter support without the need to merge the LoRA to a model to not experience a heavy slowdown would be amazing as well. Right now, I don't perceive multimodal support as that important either (still important, yes, but not like it has to happen right now). There's a good selection of multimodal models at this time, but it's mostly vision input rather than omnimodality (vision+text+audio input/output and trained natively with multimodality in mind.) Text generation models at this time are still very much prefered and I don't see that changing until perhaps early next year, hopefully with LLama 4 being omnimodal.

In any case, I have high hopes more talented people will deliver multimodal support for llama.cpp, specifically from companies like Microsoft and Google who have an interest of running these on low powered devices. In turn, the code could get backported to llama.cpp to benefit from it. There's simply no other inference engine that efficient on low powered devices.

mudler · 2024-10-21T09:30:18Z

JFYI ollama merged llama3.2 vision support few days ago: ollama/ollama#6963

I really have mixed feelings on the ollama community - why not helping llama.cpp and upstream the changes instead of "hard-forking" part of it? Why splitting the effort and not contribute to the upstream project here?

is there a PR here from the ollama guys that is not having attention and I'm missing ?

Animaxx · 2024-10-24T20:19:28Z

Ollama has image preparsing in golang for the vision, does there any plan to have the similar? ollama/ollama#7300 (comment)

mirek190 · 2024-10-24T20:58:46Z

...good question

jhowilbur · 2024-10-31T01:11:40Z

@jrp2014 Can you please provide the exact code you used?

I tried loading the full model, but got zsh killed :(

https://huggingface.co/blog/llama32#llamacpp--llama-cpp-python

But I'm not sure if it will work with the vision part

jrp2014 · 2024-11-01T22:31:41Z

I provided it above 3 weeks ago, coping directly from huggingface, I think. It may be that the particular set of installed python libraries etc, is what made the difference? I try to use the latest versions.

mirek190 · 2024-11-01T23:31:57Z

Mote than a month passed and no one even care to add vision support. 😅

Soon will be released llama 4 also multimodal....

jhowilbur · 2024-11-02T02:21:49Z

Ollama has image preparsing in golang for the vision, does there any plan to have the similar? ollama/ollama#7300 (comment)

With some effort, I believe it will be possible to use the Golang binding to c++
we did it with whisper.cpp
https://github.com/ggerganov/whisper.cpp/tree/master/bindings/go

To our surprise, it's calling the same libraries as those used in llama.cpp, the core to do the tensor computations, the lib GGML written in cpp.

qnixsynapse · 2024-11-18T05:40:41Z

Weird that llama.cpp still doesn't have support for llama 3.2 vision. Most of the stuff can be implemented easily including the unpad operation in ggml and cross attention in llama.cpp as they are in C/C++.

Only golang image processing needs to be implemented on the cli/server.
The operations such as image loading, resizing, cropping, conversion to different color spaces, and more can be easily done with opencv cpp library.

cc @ngxson Can llamacpp use opencv?

Edit: I am copy pasting the C++ code of ollama's go code. Feel free to fix it, improve it and use it!

#include <opencv2/opencv.hpp>
#include <vector>
#include <cmath>
#include <numeric>
#include <stdexcept>

using namespace cv;

std::vector<Point> GetSupportedAspectRatios(int maxTiles) {
    std::vector<Point> ratios;
    for (int w = 0; w < maxTiles; ++w) {
        for (int h = 0; h < maxTiles; ++h) {
            if ((w + 1) * (h + 1) <= maxTiles) {
                ratios.emplace_back(w + 1, h + 1);
            }
        }
    }
    return ratios;
}

int clip(int a, int a_min, int a_max) {
    return std::min(std::max(a, a_min), a_max);
}

Point getImageSizeFitToCanvas(Point imageSize, Point canvasSize, int tileSize) {
    int targetWidth = clip(imageSize.x, tileSize, canvasSize.x);
    int targetHeight = clip(imageSize.y, tileSize, canvasSize.y);

    double scaleWidth = static_cast<double>(targetWidth) / imageSize.x;
    double scaleHeight = static_cast<double>(targetHeight) / imageSize.y;

    int w, h;
    if (scaleWidth < scaleHeight) {
        w = targetWidth;
        h = std::min(static_cast<int>(std::floor(imageSize.y * scaleWidth)), targetHeight);
    } else {
        w = std::min(static_cast<int>(std::floor(imageSize.x * scaleHeight)), targetWidth);
        h = targetHeight;
    }

    return Point(w, h);
}

Point getOptimalTiledCanvas(Point imageSize, int maxImageTiles, int tileSize) {
    std::vector<Point> possibleTileArrangements = GetSupportedAspectRatios(maxImageTiles);
    std::vector<Point> possibleCanvasSizes;

    for (const auto& pta : possibleTileArrangements) {
        possibleCanvasSizes.emplace_back(pta.x * tileSize, pta.y * tileSize);
    }

    std::vector<double> scales;
    for (const auto& pcs : possibleCanvasSizes) {
        double scaleHeight = static_cast<double>(pcs.y) / imageSize.y;
        double scaleWidth = static_cast<double>(pcs.x) / imageSize.x;
        scales.push_back(std::min(scaleWidth, scaleHeight));
    }

    double minUpscale = 0, maxDownscale = 0;
    bool upscale = false;
    for (double s : scales) {
        if (s > 1.0) {
            upscale = true;
            if (minUpscale == 0) {
                minUpscale = s;
            } else {
                minUpscale = std::min(minUpscale, s);
            }
        } else {
            maxDownscale = std::max(maxDownscale, s);
        }
    }

    double selectedScale = upscale ? minUpscale : maxDownscale;
    Point selectedCanvas;
    for (size_t n = 0; n < possibleCanvasSizes.size(); ++n) {
        if (scales[n] == selectedScale) {
            if (selectedCanvas == Point(0, 0) || 
                (possibleCanvasSizes[n].x * possibleCanvasSizes[n].y < selectedCanvas.x * selectedCanvas.y)) {
                selectedCanvas = possibleCanvasSizes[n];
            }
        }
    }
    return selectedCanvas;
}

std::vector<Mat> splitToTiles(const Mat& img, Point numTilesSize) {
    int tileWidth = img.cols / numTilesSize.x;
    int tileHeight = img.rows / numTilesSize.y;
    std::vector<Mat> tiles;

    for (int h = 0; h < numTilesSize.y; ++h) {
        for (int w = 0; w < numTilesSize.x; ++w) {
            Rect roi(tileWidth * w, tileHeight * h, tileWidth, tileHeight);
            tiles.push_back(img(roi));
        }
    }

    return tiles;
}

Mat compositeImage(const Mat& img) {
    Mat dst(img.size(), CV_8UC3, Scalar(255, 255, 255));
    if (img.channels() == 4) {
        Mat imgConverted;
        cvtColor(img, imgConverted, COLOR_BGRA2BGR);
        imgConverted.copyTo(dst);
    } else {
        img.copyTo(dst);
    }
    return dst;
}

Mat ResizeImage(const Mat& img, const std::string& format, Point outputSize, int maxImageTiles) {
    Mat processedImg = img.clone();
    if (format == "png" && img.channels() == 4) {
        processedImg = compositeImage(img);
    }

    Point canvasSize = getOptimalTiledCanvas(img.size(), maxImageTiles, outputSize.y);
    Point aspectRatio(canvasSize.x / outputSize.y, canvasSize.y / outputSize.y);
    Point newSize = getImageSizeFitToCanvas(img.size(), canvasSize, outputSize.y);

    Mat resizedImg;
    resize(processedImg, resizedImg, Size(newSize.x, newSize.y), 0, 0, INTER_LINEAR);

    return resizedImg;
}

Mat PadImage(const Mat& img, Point outputSize, Point aspectRatio) {
    Point paddedSize(outputSize.x * aspectRatio.x, outputSize.y * aspectRatio.y);
    Mat paddedImg(paddedSize.y, paddedSize.x, img.type(), Scalar(0, 0, 0));
    img.copyTo(paddedImg(Rect(0, 0, img.cols, img.rows)));
    return paddedImg;
}

std::vector<float> PackImages(const Mat& img, Point aspectRatio, const std::array<float, 3>& mean, const std::array<float, 3>& std) {
    std::vector<Mat> subImages = splitToTiles(img, aspectRatio);
    std::vector<float> pixelVals;

    for (const auto& subImg : subImages) {
        for (int y = 0; y < subImg.rows; ++y) {
            for (int x = 0; x < subImg.cols; ++x) {
                Vec3b color = subImg.at<Vec3b>(y, x);
                for (int c = 0; c < 3; ++c) {
                    float normalized = (color[c] / 255.0f - mean[c]) / std[c];
                    pixelVals.push_back(normalized);
                }
            }
        }
    }

    return pixelVals;
}

std::pair<std::vector<float>, int> Preprocess(const std::vector<uchar>& imageData) {
    Mat img = imdecode(imageData, IMREAD_UNCHANGED);
    if (img.empty()) {
        throw std::runtime_error("Failed to decode image");
    }

    Point outputSize(560, 560);
    int maxTiles = 4;

    std::array<float, 3> mean = {0.48145466, 0.4578275, 0.40821073};
    std::array<float, 3> std = {0.26862954, 0.26130258, 0.27577711};

    Mat resizedImg = ResizeImage(img, "jpg", outputSize, maxTiles);
    Point aspectRatio = Point(resizedImg.cols / outputSize.x, resizedImg.rows / outputSize.y);

    Mat paddedImg = PadImage(resizedImg, outputSize, aspectRatio);
    std::vector<float> data = PackImages(paddedImg, aspectRatio, mean, std);

    auto supportedRatios = GetSupportedAspectRatios(maxTiles);
    int aspectRatioIndex = std::distance(supportedRatios.begin(), std::find(supportedRatios.begin(), supportedRatios.end(), aspectRatio)) + 1;

    return {data, aspectRatioIndex};
}

Thellton · 2024-11-18T23:56:56Z

JFYI ollama merged llama3.2 vision support few days ago: ollama/ollama#6963

I really have mixed feelings on the ollama community - why not helping llama.cpp and upstream the changes instead of "hard-forking" part of it? Why splitting the effort and not contribute to the upstream project here?

is there a PR here from the ollama guys that is not having attention and I'm missing ?

Apparently they've never submitted an issue or created an pull request in llamacpp at least as Ollama itself...

boom-bang · 2024-12-05T17:17:36Z

Currently, to perform inference on LoRA / QLoRA fine-tuned versions of the latest multimodal language models (that are vital in real world projects instead of almost useful zero-shot versions of LLMs) , we are limited to using solutions like Transformers or Unsloth inference, that can be good for development and testing but are not good enough for production ready environments.

Many inference solutions, such as Ollama or vLLM, depend on llama.cpp for GGUF support. Therefore, it's crucial for the community that llama.cpp advances in this area.

ngxson · 2024-12-05T17:24:10Z

The operations such as image loading, resizing, cropping, conversion to different color spaces, and more can be easily done with opencv cpp library.

@qnixsynapse I have no strong preference to that, but opencv is quite a huge library and most operations that you list is just a small portion of it (see https://github.com/opencv/opencv/blob/4.x/include/opencv2/opencv.hpp). So I think it will be quite wasteful to include the whole opencv.hpp into the project.

Also, most operations that you listed can already be done with stb_image, and in fact clip.cpp do just that. So I'm not sure we why need to use opencv here.

qnixsynapse · 2024-12-06T03:07:46Z

I think there should be no problem in implementing then. I haven't checked clip.cpp yet. But if everything is there in clip.cpp then I guess there should be no problem in implementing LLaMA Vision support here. (Please correct me if I'm wrong here).

boom-bang · 2024-12-08T18:18:43Z

I think there should be no problem in implementing then. I haven't checked clip.cpp yet. But if everything is there in clip.cpp then I guess there should be no problem in implementing LLaMA Vision support here. (Please correct me if I'm wrong here).

Hey, @qnixsynapse, got a chance to check clip.cpp?

danbev · 2025-01-06T10:12:00Z

I've been looking into this and have something working which can be found here.

This builds upon the vision api proposal (well kind of at least, I did not want to change anything there so this is just implemented along side the current vision implementation). I mainly wanted to post this comment to show that some progress/work is being made on this.

HanGyeol-Yoo · 2025-01-06T10:19:36Z

I've been looking into this and have something working which can be found here.

This builds upon the vision api proposal (well kind of at least, I did not want to change anything there so this is just implemented along side the current vision implementation). I mainly wanted to post this comment to show that some progress/work is being made on this.

does it support Ollama?

danbev · 2025-01-06T11:44:36Z

does it support Ollama?

I'm not sure I understand your question fully but I'll try to answer.

Ollama has had support for Llama 3.2 Vision for quite some time and I've looked at how they implemented their support when working on this. But there are differences, for example in llama.cpp's case there is only a single .gguf model which contains both the vision encoder and the language model, whereas Ollama has two.

Does that address your question or did you have something else in mind when you say supports Ollama?

This was referenced Sep 26, 2024

Add the new Multi-Modal model of mistral AI: pixtral-12b mudler/LocalAI#3535

Open

llama3.2 vision models mudler/LocalAI#3669

Open

giladgd mentioned this issue Sep 26, 2024

feat: pass an image as part of the evaluation withcatai/node-llama-cpp#88

Open

hahuyhoang411 mentioned this issue Oct 22, 2024

Llama3.2-11b-Vision janhq/models#25

Open

LostRuins mentioned this issue Nov 11, 2024

Llama-3.2 11B Vision Support LostRuins/koboldcpp#1209

Open

wirthual mentioned this issue Nov 13, 2024

Bug: Cannot use llama 3.2 vision Mozilla-Ocho/llamafile#615

Closed

XuGW-Kevin mentioned this issue Dec 2, 2024

Local run instructions please PKU-YuanGroup/LLaVA-CoT#14

Closed

LFsWang mentioned this issue Dec 10, 2024

feat: Support Llama-3.2-11B-Vision WasmEdge/WasmEdge#3911

Open

Llama-3.2 11B Vision Support #9643

Llama-3.2 11B Vision Support #9643

Comments

yukiarimo commented Sep 25, 2024

stduhpf commented Sep 25, 2024

mirek190 commented Sep 25, 2024

yukiarimo commented Sep 25, 2024

Animaxx commented Sep 26, 2024

MaratG2 commented Sep 26, 2024

MoonRide303 commented Sep 26, 2024 • edited Loading

thiswillbeyourgithub commented Sep 26, 2024

HanClinto commented Sep 26, 2024 • edited Loading

thiswillbeyourgithub commented Sep 26, 2024

JohannesGaessler commented Sep 26, 2024

HanClinto commented Sep 26, 2024

Thellton commented Sep 27, 2024

JohannesGaessler commented Sep 27, 2024

Thellton commented Sep 27, 2024

jrp2014 commented Sep 29, 2024

yukiarimo commented Sep 30, 2024

vietvudanh commented Oct 2, 2024 • edited Loading

vietvudanh commented Oct 7, 2024

muzhig commented Oct 7, 2024 • edited Loading

jrp2014 commented Oct 11, 2024

Dampfinchen commented Oct 19, 2024 • edited Loading

cognitivetech commented Oct 19, 2024 • edited Loading

Dampfinchen commented Oct 19, 2024

cognitivetech commented Oct 19, 2024 • edited Loading

Dampfinchen commented Oct 19, 2024 • edited Loading

JohannesGaessler commented Oct 19, 2024 • edited Loading

cognitivetech commented Oct 19, 2024 • edited Loading

Dampfinchen commented Oct 19, 2024

mudler commented Oct 21, 2024

Animaxx commented Oct 24, 2024

mirek190 commented Oct 24, 2024

jhowilbur commented Oct 31, 2024 • edited Loading

jrp2014 commented Nov 1, 2024

mirek190 commented Nov 1, 2024 • edited Loading

jhowilbur commented Nov 2, 2024 • edited Loading

qnixsynapse commented Nov 18, 2024 • edited Loading

Thellton commented Nov 18, 2024 • edited Loading

boom-bang commented Dec 5, 2024

ngxson commented Dec 5, 2024 • edited Loading

qnixsynapse commented Dec 6, 2024

boom-bang commented Dec 8, 2024

danbev commented Jan 6, 2025

HanGyeol-Yoo commented Jan 6, 2025

danbev commented Jan 6, 2025

MoonRide303 commented Sep 26, 2024 •

edited

Loading

HanClinto commented Sep 26, 2024 •

edited

Loading

vietvudanh commented Oct 2, 2024 •

edited

Loading

muzhig commented Oct 7, 2024 •

edited

Loading

Dampfinchen commented Oct 19, 2024 •

edited

Loading

cognitivetech commented Oct 19, 2024 •

edited

Loading

cognitivetech commented Oct 19, 2024 •

edited

Loading

Dampfinchen commented Oct 19, 2024 •

edited

Loading

JohannesGaessler commented Oct 19, 2024 •

edited

Loading

cognitivetech commented Oct 19, 2024 •

edited

Loading

jhowilbur commented Oct 31, 2024 •

edited

Loading

mirek190 commented Nov 1, 2024 •

edited

Loading

jhowilbur commented Nov 2, 2024 •

edited

Loading

qnixsynapse commented Nov 18, 2024 •

edited

Loading

Thellton commented Nov 18, 2024 •

edited

Loading

ngxson commented Dec 5, 2024 •

edited

Loading