Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldnt run any vision model #935

Open
GraphicalDot opened this issue Nov 26, 2024 · 20 comments
Open

Couldnt run any vision model #935

GraphicalDot opened this issue Nov 26, 2024 · 20 comments
Labels
bug Something isn't working

Comments

@GraphicalDot
Copy link

Describe the bug

Hey everyone,

I’m trying to run vision models in Rust on my M4 Pro (48GB RAM). After some research, I found Mistral.rs, which seems like the best library out there for running vision models locally. However, I’ve been running into some serious roadblocks, and I’m hoping someone here can help!
What I Tried

Running Vision Models Locally: I tried running the following commands:

cargo run --features metal --release -- -i --isq Q4K vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllama

cargo run --features metal --release -- -i vision-plain -m Qwen/Qwen2-VL-2B-Instruct -a qwen2vl

Neither of these worked. When I tried to process an image using Qwen2-VL-2B-Instruct, I got the following error:


> \image /Users/sauravverma/Desktop/theMeme.png describe the3 image

thread '<unnamed>' panicked at mistralrs-core/src/vision_models/qwen2vl/inputs_processor.rs:265:30:

Preprocessing failed: Msg("Num channels must match number of mean and std.")

This means the preprocessing step failed. Not sure how to fix this.

  1. Quantization Runtime Issues: The commands above download the entire model and perform runtime quantization. This consumes a huge amount of resources and isn't feasible for my setup.

  2. Hosting as a Server: I tried running the model as an HTTP server using mistralrs-server:

./mistralrs-server gguf -m /Users/sauravverma/.pyano/models/ -f Llama-3.2-11B-Vision-Instruct.Q4_K_M.gguf

This gave me the following error:

thread 'main' panicked at mistralrs-core/src/gguf/content.rs:94:22:

called \Result::unwrap()` on an `Err` value: Unknown GGUF architecture `mllama``

However, when I tried running another model:

./mistralrs-server -p 52554 gguf -m /Users/sauravverma/.pyano/models/ -f MiniCPM-V-2_6-Q6_K_L.gguf

What I Need Help With

Fixing the Preprocessing Issue:

    How do I resolve the Num channels must match number of mean and std. error for Qwen2-VL-2B-Instruct?

Avoiding Runtime Quantization:

    Is there a way to pre-quantize the models or avoid the heavy resource consumption during runtime quantization?

Using the HTTP Server for Inference:

    The server starts successfully for some models, but there’s no documentation on how to send an image and get predictions. Has anyone managed to do this?

Latest commit or version

Latest commit

@GraphicalDot GraphicalDot added the bug Something isn't working label Nov 26, 2024
@EricLBuehler
Copy link
Owner

Hi @GraphicalDot! Thanks for the issue. I see there are some problems, I've enumerated and addressed them below:

  1. The Qwen2-VL model gives the image preprocessing error.
    This error specfically means that the image does not have three channels, as the processor expects. Always cast image to rgb8 for qwenvl2 #936 should address this.
  2. ISQ downloads the entire model and quantizes inplace.
    This can be solved by using UQFF. Docs are here, and the model collection is here - including vision models.
  3. When you tried running the server, you used the Llama 3.2 Vision model in GGUF form. Llama.cpp does not officially support this model, so we do not want to support anything nonstandard. I would recommend using the ISQ or UQFF versions of the model instead:
cargo run --release --features metal -- --port 1234 vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff
  • Can you please run
git pull

To test out the Qwen2-VL model as I have merged #936?

  • Can you please let me know if the error for the Llama 3.2 model persists as you reported?

Thanks!

@GraphicalDot
Copy link
Author

Thanks for the prompt response.

Took a git pull
and ran this command

 cargo run --release --features metal -- --port 1234 vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff

However, it first downloaded the weights and the .uqff file. The server started smoothly.

I used this script to test the model
https://github.com/EricLBuehler/mistral.rs/blob/master/examples/server/llama_vision.py as it is.

The error that i got from the server is

2024-11-26T16:01:57.374399Z ERROR mistralrs_core::engine: prompt step - Model failed with error: Msg("The number of images in each batch [2] should be the same as the number of images [1]. The model cannot support a different number of images per patch. Perhaps you forgot a `<|image|>` tag?")
2024-11-26T16:02:03.736493Z ERROR mistralrs_core::engine: prompt step - Model failed with error: Msg("The number of images in each batch [2] should be the same as the number of images [1]. The model cannot support a different number of images per patch. Perhaps you forgot a `<|image|>` tag?")
2024-11-26T16:02:06.550921Z ERROR mistralrs_core::engine: prompt step - Model failed with error: Msg("The number of images in each batch [2] should be the same as the number of images [1]. The model cannot support a different number of images per patch. Perhaps you forgot a `<|image|>` tag?")

The error i got from the python script is

InternalServerError: Error code: 500 - {'message': 'The number of images in each batch [2] should be the same as the number of images [1]. The model cannot support a different number of images per patch. Perhaps you forgot a `<|image|>` tag?', 'partial_response': {'id': '3', 'choices': [{'finish_reason': 'error', 'index': 0, 'message': {'content': '', 'role': 'assistant', 'tool_calls': []}, 'logprobs': None}], 'created': 1732636926, 'model': 'EricB/Llama-3.2-11B-Vision-Instruct-UQFF', 'system_fingerprint': 'local', 'object': 'chat.completion', 'usage': {'completion_tokens': 0, 'prompt_tokens': 28, 'total_tokens': 28, 'avg_tok_per_sec': 14000.0, 'avg_prompt_tok_per_sec': None, 'avg_compl_tok_per_sec': None, 'total_time_sec': 0.002, 'total_prompt_time_sec': 0.0, 'total_completion_time_sec': 0.0}}}

This is for llama3.2 vision models.

@EricLBuehler
Copy link
Owner

@GraphicalDot I merged #937, can you please try the server again after git pull?

@GraphicalDot
Copy link
Author

Thanks a ton!
It worked.
However, it consumed more than 30GB of VRAM and almost 100% GPU on my Mac M4 48GB machine. Is there a way to reduce this memory footprint?

What else needs to be done to run Qwen/Qwen2-VL-2B-Instruct?

Could you please guide me on what needs to be done to support new upcoming models for vision?

@EricLBuehler
Copy link
Owner

@GraphicalDot I think it should run Qwen2-VL now.

If it's using 100% GPU, that good! It means that it is using the resources efficiently. Memory footprint, however, is different. Is the 30GB during inference or during loading? Does it spike or does it stay about constant? If you could provide a graph that would be great!

What other models do you have in mind? Right now we have Phi 3.5, the Llava models, Idefics 2, Llama 3.2 V, and Qwen2-VL.

@GraphicalDot
Copy link
Author

Please see the attached video.

https://youtu.be/EuVoU65eJds

It increases and then maximizes at around 30GB during inference only. Almost similar to how llama.cpp does during inference. It consumes around 5GB of Vram while the server is running.

@GraphicalDot
Copy link
Author

Do I need to convert https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct this to .uqff format before running it like i did with Lllama3.2-11B vision model?

DO I also need to do the same for Flux Architecture models? i.e converting them to .uqff format?

@EricLBuehler
Copy link
Owner

@GraphicalDot I merged #938 which brings the usage down significantly on my machine. Can you please confirm it works for you after git pull?

Do I need to convert https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct this to .uqff format before running it like i did with Lllama3.2-11B vision model?

Yes.

@GraphicalDot
Copy link
Author

GraphicalDot commented Nov 27, 2024

Yes, Now it consumes only 7-8 GB of ram on my M4 Pro machine.
This is huge!!

The other models that we are trying to run with are:-

HuggingFaceTB/SmolVLM-Instruct
openbmb/MiniCPM-V-2_6
google/paligemma-3b-mix-448

Second question:-
Is there a way to just provide the path of locally stored .uqff file rather then providing the hugging face repo name

./mistralrs-server --port 52554 vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff
To
./mistralrs-server --port 52554 vision-plain -a vllama --from-uqff pat/to/local/.uqff

@openmynet
Copy link
Contributor

@GraphicalDot

create

cd H:\ml\Qwen2-VL-2B-Instruct
E:\dev\rust\mistral.rs\target\release\mistralrs-server.exe --isq Q4K -i vision-plain -m ./ -a  qwen2vl --write-uqff ./uqff/q4k.uqff

load

cd uqff
E:\dev\rust\mistral.rs\target\release\mistralrs-server.exe  -i vision-plain -a  qwen2vl -m ./ --from-uqff ./q4k.uqff

@EricLBuehler
Copy link
Owner

@GraphicalDot, following up:

  1. The other models you wanted were: HuggingFaceTB/SmolVLM-Instruct, openbmb/MiniCPM-V-2_6 and google/paligemma-3b-mix-448.

I added HuggingFaceTB/SmolVLM-Instruct and Idefics 3 support!

  1. Is there a way to just provide the path of locally stored .uqff file rather than providing the hugging face repo name

Yes, here:

./mistralrs-server  --port 52554 vision-plain -m path/to/local -a vllama --from-uqff uqff_file_name.uqff

Please let me know if there are any other questions!

@GraphicalDot
Copy link
Author

GraphicalDot commented Dec 1, 2024

This has stopped working

➜  mistral.rs git:(master) cargo run --release --features metal -- -i vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff
   Compiling candle-metal-kernels v0.8.0 (https://github.com/EricLBuehler/candle.git?rev=e5685ce#e5685cee)
   Compiling mistralrs-quant v0.3.4 (/Users/sauravverma/programs/pyano/mistral.rs/mistralrs-quant)
   Compiling mistralrs-core v0.3.4 (/Users/sauravverma/programs/pyano/mistral.rs/mistralrs-core)
   Compiling candle-core v0.8.0 (https://github.com/EricLBuehler/candle.git?rev=e5685ce#e5685cee)
   Compiling candle-nn v0.8.0 (https://github.com/EricLBuehler/candle.git?rev=e5685ce#e5685cee)
   Compiling mistralrs-vision v0.3.4 (/Users/sauravverma/programs/pyano/mistral.rs/mistralrs-vision)
   Compiling mistralrs-server v0.3.4 (/Users/sauravverma/programs/pyano/mistral.rs/mistralrs-server)
    Finished `release` profile [optimized] target(s) in 34.93s
     Running `target/release/mistralrs-server -i vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff`
2024-12-01T07:26:52.358118Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-12-01T07:26:52.358171Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-12-01T07:26:52.358241Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2024-12-01T07:26:52.358360Z  INFO candle_hf_hub: Token file not found "/Users/sauravverma/.cache/huggingface/token"
2024-12-01T07:26:52.358407Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/sauravverma/.cache/huggingface/token", using no HF token.
2024-12-01T07:26:52.358919Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-01T07:26:52.359165Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-01T07:26:52.788467Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["residual.safetensors"]
2024-12-01T07:26:53.049264Z  INFO mistralrs_core::pipeline::vision: Loading `generation_config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-01T07:26:53.380426Z  INFO mistralrs_core::pipeline::vision: Loading `preprocessor_config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-01T07:26:53.790561Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer_config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-01T07:26:54.053908Z  INFO candle_hf_hub: Token file not found "/Users/sauravverma/.cache/huggingface/token"
2024-12-01T07:26:54.055947Z  INFO mistralrs_core::pipeline::vision: Loading model `EricB/Llama-3.2-11B-Vision-Instruct-UQFF` on metal[4294968573].
2024-12-01T07:26:54.056867Z  INFO mistralrs_core::pipeline::vision: Model config: MLlamaConfig { vision_config: MLlamaVisionConfig { hidden_size: 1280, hidden_act: Gelu, num_hidden_layers: 32, num_global_layers: 8, num_attention_heads: 16, num_channels: 3, intermediate_size: 5120, vision_output_dim: 7680, image_size: 560, patch_size: 14, norm_eps: 1e-5, max_num_tiles: 4, intermediate_layers_indices: [3, 7, 15, 23, 30], supported_aspect_ratios: [(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (3, 1), (4, 1)] }, text_config: MLlamaTextConfig { rope_scaling: Some(MLlamaRopeScaling { rope_type: Llama3, factor: Some(8.0), original_max_position_embeddings: 8192, attention_factor: None, beta_fast: None, beta_slow: None, short_factor: None, long_factor: None, low_freq_factor: Some(1.0), high_freq_factor: Some(4.0) }), vocab_size: 128256, hidden_size: 4096, hidden_act: Silu, num_hidden_layers: 40, num_attention_heads: 32, num_key_value_heads: 8, intermediate_size: 14336, rope_theta: 500000.0, rms_norm_eps: 1e-5, max_position_embeddings: 131072, tie_word_embeddings: false, cross_attention_layers: [3, 8, 13, 18, 23, 28, 33, 38], use_flash_attn: false, quantization_config: None } }
2024-12-01T07:26:54.065800Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 362/362 [00:03<00:00, 38650.50it/s]
2024-12-01T07:26:59.949592Z  INFO mistralrs_core::pipeline::isq: Loaded in-situ quantization artifacts into 464 total tensors. Took 1.88s
2024-12-01T07:27:00.098696Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2024-12-01T07:27:00.108598Z  INFO mistralrs_server: Model loaded.
2024-12-01T07:27:00.108719Z  INFO mistralrs_core: Beginning dummy run.
thread '<unnamed>' panicked at mistralrs-core/src/attention.rs:151:17:
Error:
xcrun: error: unable to find utility "metal", not a developer tool or in PATH

note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2024-12-01T07:27:00.148505Z  WARN mistralrs_core: Dummy run failed!
2024-12-01T07:27:00.148546Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: ["\n", ":", "\"", "*"], multiplier: 0.0, base: 1.75, allowed_length: 2 }) }
====================
Welcome to interactive mode! Because this model is a vision model, you can enter prompts and chat with the model.

To specify a message with an image, use the `\image` command detailed below.

Commands:
- `\help`: Display this message.
- `\exit`: Quit interactive mode.
- `\system <system message here>`:
    Add a system message to the chat without running the model.
    Ex: `\system Always respond as a pirate.`
- `\image <image URL or local path here> <message here>`:
    Add a message paired with an image. The image will be fed to the model as if it were the first item in this prompt.
    You do not need to modify your prompt for specific models.
    Ex: `\image path/to/image.jpg Describe what is in this image.`
====================
> \image /Users/sauravverma/Desktop/theMeme.png describe this image
thread 'main' panicked at mistralrs-server/src/interactive_mode.rs:402:32:
called `Result::unwrap()` on an `Err` value: SendError { .. }

If I am trying to run it in the server mode, It just hangs there

➜  mistral.rs git:(master) cargo run --release --features metal -- --port 1234 vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff
    Finished `release` profile [optimized] target(s) in 0.17s
     Running `target/release/mistralrs-server --port 1234 vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff`
2024-12-01T07:28:10.786623Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-12-01T07:28:10.786665Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-12-01T07:28:10.786721Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2024-12-01T07:28:10.786830Z  INFO candle_hf_hub: Token file not found "/Users/sauravverma/.cache/huggingface/token"
2024-12-01T07:28:10.786869Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/sauravverma/.cache/huggingface/token", using no HF token.
2024-12-01T07:28:10.788073Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-01T07:28:10.788164Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-01T07:28:11.063779Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["residual.safetensors"]
2024-12-01T07:28:11.748672Z  INFO mistralrs_core::pipeline::vision: Loading `generation_config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-01T07:28:11.996777Z  INFO mistralrs_core::pipeline::vision: Loading `preprocessor_config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-01T07:28:12.267768Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer_config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-01T07:28:12.537370Z  INFO candle_hf_hub: Token file not found "/Users/sauravverma/.cache/huggingface/token"
2024-12-01T07:28:12.537533Z  INFO mistralrs_core::pipeline::vision: Loading model `EricB/Llama-3.2-11B-Vision-Instruct-UQFF` on metal[4294968573].
2024-12-01T07:28:12.537941Z  INFO mistralrs_core::pipeline::vision: Model config: MLlamaConfig { vision_config: MLlamaVisionConfig { hidden_size: 1280, hidden_act: Gelu, num_hidden_layers: 32, num_global_layers: 8, num_attention_heads: 16, num_channels: 3, intermediate_size: 5120, vision_output_dim: 7680, image_size: 560, patch_size: 14, norm_eps: 1e-5, max_num_tiles: 4, intermediate_layers_indices: [3, 7, 15, 23, 30], supported_aspect_ratios: [(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (3, 1), (4, 1)] }, text_config: MLlamaTextConfig { rope_scaling: Some(MLlamaRopeScaling { rope_type: Llama3, factor: Some(8.0), original_max_position_embeddings: 8192, attention_factor: None, beta_fast: None, beta_slow: None, short_factor: None, long_factor: None, low_freq_factor: Some(1.0), high_freq_factor: Some(4.0) }), vocab_size: 128256, hidden_size: 4096, hidden_act: Silu, num_hidden_layers: 40, num_attention_heads: 32, num_key_value_heads: 8, intermediate_size: 14336, rope_theta: 500000.0, rms_norm_eps: 1e-5, max_position_embeddings: 131072, tie_word_embeddings: false, cross_attention_layers: [3, 8, 13, 18, 23, 28, 33, 38], use_flash_attn: false, quantization_config: None } }
2024-12-01T07:28:12.540692Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 362/362 [00:00<00:00, 2665.96it/s]
2024-12-01T07:28:14.085225Z  INFO mistralrs_core::pipeline::isq: Loaded in-situ quantization artifacts into 464 total tensors. Took 0.49s
2024-12-01T07:28:14.232502Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2024-12-01T07:28:14.242518Z  INFO mistralrs_server: Model loaded.
2024-12-01T07:28:14.242639Z  INFO mistralrs_core: Beginning dummy run.
thread '<unnamed>' panicked at mistralrs-core/src/attention.rs:151:17:
Error:
xcrun: error: unable to find utility "metal", not a developer tool or in PATH

note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2024-12-01T07:28:14.260873Z  WARN mistralrs_core: Dummy run failed!
2024-12-01T07:28:14.261758Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.
2024-12-01T07:28:18.927620Z  WARN mistralrs_core: Engine is dead, rebooting
2024-12-01T07:28:18.927694Z  INFO mistralrs_core: Successfully rebooted engine and updated sender + engine handler
2024-12-01T07:28:18.930773Z ERROR mistralrs_core::engine: prompt step - Model failed with error: Msg("The number of images in each batch [0] should be the same as the number of images [1]. The model cannot support a different number of images per patch. Perhaps you forgot a `<|image|>` tag?")

Also getting weird errors
If i am compiling it without "metal", It is working fine but super slow since it is doing all processing on CPU.

➜  mistral.rs git:(master) cargo build --release
   Compiling mistralrs-pyo3 v0.3.4 (/Users/sauravverma/programs/pyano/mistral.rs/mistralrs-pyo3)
   Compiling mistralrs-server v0.3.4 (/Users/sauravverma/programs/pyano/mistral.rs/mistralrs-server)
    Finished `release` profile [optimized] target(s) in 4.86s
➜  mistral.rs git:(master) cd target/release
➜  release git:(master) ./mistralrs-server -i vision-plain -m HuggingFaceTB/SmolVLM-Instruct -a idefics3
2024-12-01T07:37:59.756973Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-12-01T07:37:59.757001Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-12-01T07:37:59.757120Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2024-12-01T07:37:59.757241Z  INFO candle_hf_hub: Token file not found "/Users/sauravverma/.cache/huggingface/token"
2024-12-01T07:37:59.757278Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/sauravverma/.cache/huggingface/token", using no HF token.
2024-12-01T07:37:59.757682Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-01T07:37:59.757935Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-01T07:38:00.036602Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2024-12-01T07:38:00.285183Z  INFO mistralrs_core::pipeline::vision: Loading `generation_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-01T07:38:00.537227Z  INFO mistralrs_core::pipeline::vision: Loading `preprocessor_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-01T07:38:00.792704Z  INFO mistralrs_core::pipeline::vision: Loading `processor_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-01T07:38:00.793168Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-01T07:38:01.141014Z  INFO mistralrs_core::pipeline::vision: Loading `chat_template.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-01T07:38:01.143514Z  INFO mistralrs_core::pipeline::vision: Loading model `HuggingFaceTB/SmolVLM-Instruct` on cpu.
2024-12-01T07:38:01.144704Z  INFO mistralrs_core::pipeline::vision: Model config: Idefics3Config { image_token_id: 49153, vision_config: Idefics3VisionConfig { hidden_size: 1152, intermediate_size: 4304, num_hidden_layers: 27, num_attention_heads: 16, num_channels: 3, image_size: 384, patch_size: 14, hidden_act: GeluPytorchTanh, layer_norm_eps: 1e-6 }, text_config: Config { hidden_size: 2048, intermediate_size: 8192, vocab_size: 49155, num_hidden_layers: 24, num_attention_heads: 32, num_key_value_heads: 32, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 273768.0, max_position_embeddings: 16384, rope_scaling: None, quantization_config: None, tie_word_embeddings: false }, scale_factor: 3 }
2024-12-01T07:38:01.146303Z  INFO mistralrs_core::utils::normal: DType selected is F16.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 657/657 [00:06<00:00, 222.06it/s]
2024-12-01T07:38:07.639998Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|im_start|>", "<|endoftext|>", eos_toks = "<end_of_utterance>", "<|im_end|>", unk_tok = <|endoftext|>
2024-12-01T07:38:07.642127Z  INFO mistralrs_server: Model loaded.
2024-12-01T07:38:07.642271Z  INFO mistralrs_core: Beginning dummy run.
2024-12-01T07:38:07.790400Z  INFO mistralrs_core: Dummy run completed in 0.148121875s.
2024-12-01T07:38:07.790463Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: ["\n", ":", "\"", "*"], multiplier: 0.0, base: 1.75, allowed_length: 2 }) }
====================
Welcome to interactive mode! Because this model is a vision model, you can enter prompts and chat with the model.

To specify a message with an image, use the `\image` command detailed below.

Commands:
- `\help`: Display this message.
- `\exit`: Quit interactive mode.
- `\system <system message here>`:
    Add a system message to the chat without running the model.
    Ex: `\system Always respond as a pirate.`
- `\image <image URL or local path here> <message here>`:
    Add a message paired with an image. The image will be fed to the model as if it were the first item in this prompt.
    You do not need to modify your prompt for specific models.
    Ex: `\image path/to/image.jpg Describe what is in this image.`
====================
> \image /Users/sauravverma/Desktop/theMeme.png what is this image
The image is a meme. It is a humorous image that uses text and an image to make a point. The image is a meme because it is funny and it is meant to be shared. The meme is about the first and second time founders. The first time founder is a skeleton and the second time founder is a man. The first time founder is lying on the floor and the second time founder is standing in front of a computer. The text in the meme says "What happened to him?" and "He believed that being good at coding was enough to succeed".

However, Compiling it with metal doesnt show any error but fails on inference.
I could also see GPU being used on the console.

➜  release git:(master) cargo build --release --features metal
   Compiling mistralrs-pyo3 v0.3.4 (/Users/sauravverma/programs/pyano/mistral.rs/mistralrs-pyo3)
    Finished `release` profile [optimized] target(s) in 1.93s
➜  release git:(master) ./mistralrs-server -i vision-plain -m HuggingFaceTB/SmolVLM-Instruct -a idefics3
2024-12-01T07:47:11.925774Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-12-01T07:47:11.925806Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-12-01T07:47:11.925843Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2024-12-01T07:47:11.925904Z  INFO candle_hf_hub: Token file not found "/Users/sauravverma/.cache/huggingface/token"
2024-12-01T07:47:11.925935Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/sauravverma/.cache/huggingface/token", using no HF token.
2024-12-01T07:47:11.926320Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-01T07:47:11.926412Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-01T07:47:12.452289Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2024-12-01T07:47:12.707290Z  INFO mistralrs_core::pipeline::vision: Loading `generation_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-01T07:47:12.988580Z  INFO mistralrs_core::pipeline::vision: Loading `preprocessor_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-01T07:47:13.296110Z  INFO mistralrs_core::pipeline::vision: Loading `processor_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-01T07:47:13.296516Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-01T07:47:13.704873Z  INFO mistralrs_core::pipeline::vision: Loading `chat_template.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-01T07:47:13.706042Z  INFO mistralrs_core::pipeline::vision: Loading model `HuggingFaceTB/SmolVLM-Instruct` on metal[4294968573].
2024-12-01T07:47:13.706938Z  INFO mistralrs_core::pipeline::vision: Model config: Idefics3Config { image_token_id: 49153, vision_config: Idefics3VisionConfig { hidden_size: 1152, intermediate_size: 4304, num_hidden_layers: 27, num_attention_heads: 16, num_channels: 3, image_size: 384, patch_size: 14, hidden_act: GeluPytorchTanh, layer_norm_eps: 1e-6 }, text_config: Config { hidden_size: 2048, intermediate_size: 8192, vocab_size: 49155, num_hidden_layers: 24, num_attention_heads: 32, num_key_value_heads: 32, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 273768.0, max_position_embeddings: 16384, rope_scaling: None, quantization_config: None, tie_word_embeddings: false }, scale_factor: 3 }
2024-12-01T07:47:13.714774Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 657/657 [00:00<00:00, 1217.42it/s]
2024-12-01T07:47:14.497649Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|im_start|>", "<|endoftext|>", eos_toks = "<end_of_utterance>", "<|im_end|>", unk_tok = <|endoftext|>
2024-12-01T07:47:14.499939Z  INFO mistralrs_server: Model loaded.
2024-12-01T07:47:14.500067Z  INFO mistralrs_core: Beginning dummy run.
2024-12-01T07:47:14.622586Z  INFO mistralrs_core: Dummy run completed in 0.122513208s.
2024-12-01T07:47:14.622623Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: ["\n", ":", "\"", "*"], multiplier: 0.0, base: 1.75, allowed_length: 2 }) }
====================
Welcome to interactive mode! Because this model is a vision model, you can enter prompts and chat with the model.

To specify a message with an image, use the `\image` command detailed below.

Commands:
- `\help`: Display this message.
- `\exit`: Quit interactive mode.
- `\system <system message here>`:
    Add a system message to the chat without running the model.
    Ex: `\system Always respond as a pirate.`
- `\image <image URL or local path here> <message here>`:
    Add a message paired with an image. The image will be fed to the model as if it were the first item in this prompt.
    You do not need to modify your prompt for specific models.
    Ex: `\image path/to/image.jpg Describe what is in this image.`
====================
> \image /Users/sauravverma/Desktop/theMeme.png explain this image
thread '<unnamed>' panicked at mistralrs-core/src/attention.rs:151:17:
Error:
xcrun: error: unable to find utility "metal", not a developer tool or in PATH

note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

>

@EricLBuehler
Copy link
Owner

@GraphicalDot I don't think this error originates from mistral.rs, but instead maybe from some change in your system?

I would recommend checking out this issue: gfx-rs/gfx#2309.

Can you run the following:

xcode-select --install

@GraphicalDot
Copy link
Author

GraphicalDot commented Dec 2, 2024

I pulled the code two days ago, and this problem started occurring. Earlier, there was no issue.
Installing the full Xcode is not an option for us.

@GraphicalDot
Copy link
Author

I reverted to old commit 739ea3e
and everything worked fine without this error.
so on the current commit, if i try to run it fails with this error

cargo run --release --features metal -- --port 1234 vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff
    Finished `release` profile [optimized] target(s) in 0.49s
     Running `target/release/mistralrs-server --port 1234 vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff`
2024-12-02T12:05:31.350595Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-12-02T12:05:31.350734Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-12-02T12:05:31.350859Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2024-12-02T12:05:31.351067Z  INFO candle_hf_hub: Token file not found "/Users/sauravverma/.cache/huggingface/token"
2024-12-02T12:05:31.351160Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/sauravverma/.cache/huggingface/token", using no HF token.
2024-12-02T12:05:31.351760Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-02T12:05:31.352234Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-02T12:05:32.269373Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["residual.safetensors"]
2024-12-02T12:05:32.513859Z  INFO mistralrs_core::pipeline::vision: Loading `generation_config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-02T12:05:33.184666Z  INFO mistralrs_core::pipeline::vision: Loading `preprocessor_config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-02T12:05:33.424477Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer_config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-02T12:05:33.689879Z  INFO candle_hf_hub: Token file not found "/Users/sauravverma/.cache/huggingface/token"
2024-12-02T12:05:33.691861Z  INFO mistralrs_core::pipeline::vision: Loading model `EricB/Llama-3.2-11B-Vision-Instruct-UQFF` on metal[4294968379].
2024-12-02T12:05:33.692827Z  INFO mistralrs_core::pipeline::vision: Model config: MLlamaConfig { vision_config: MLlamaVisionConfig { hidden_size: 1280, hidden_act: Gelu, num_hidden_layers: 32, num_global_layers: 8, num_attention_heads: 16, num_channels: 3, intermediate_size: 5120, vision_output_dim: 7680, image_size: 560, patch_size: 14, norm_eps: 1e-5, max_num_tiles: 4, intermediate_layers_indices: [3, 7, 15, 23, 30], supported_aspect_ratios: [(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (3, 1), (4, 1)] }, text_config: MLlamaTextConfig { rope_scaling: Some(MLlamaRopeScaling { rope_type: Llama3, factor: Some(8.0), original_max_position_embeddings: 8192, attention_factor: None, beta_fast: None, beta_slow: None, short_factor: None, long_factor: None, low_freq_factor: Some(1.0), high_freq_factor: Some(4.0) }), vocab_size: 128256, hidden_size: 4096, hidden_act: Silu, num_hidden_layers: 40, num_attention_heads: 32, num_key_value_heads: 8, intermediate_size: 14336, rope_theta: 500000.0, rms_norm_eps: 1e-5, max_position_embeddings: 131072, tie_word_embeddings: false, cross_attention_layers: [3, 8, 13, 18, 23, 28, 33, 38], use_flash_attn: false, quantization_config: None } }
2024-12-02T12:05:33.704609Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 362/362 [00:03<00:00, 919.11it/s]
2024-12-02T12:05:39.395677Z  INFO mistralrs_core::pipeline::isq: Loaded in-situ quantization artifacts into 464 total tensors. Took 1.69s
2024-12-02T12:05:39.524137Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2024-12-02T12:05:39.533197Z  INFO mistralrs_server: Model loaded.
2024-12-02T12:05:39.533285Z  INFO mistralrs_core: Beginning dummy run.
thread '<unnamed>' panicked at mistralrs-core/src/attention.rs:151:17:
Error:
xcrun: error: unable to find utility "metal", not a developer tool or in PATH

note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2024-12-02T12:05:39.553481Z  WARN mistralrs_core: Dummy run failed!

i reverted to an older commit

git checkout 739ea3e5ffd497f421de508e5803e4ec9ee1f3a5

and ran the same command and everything started working again.

 cargo run --release --features metal -- --port 1234 vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff
   Compiling mistralrs-quant v0.3.2 (/Users/sauravverma/programs/pyano/mistral.rs/mistralrs-quant)
   Compiling mistralrs-core v0.3.2 (/Users/sauravverma/programs/pyano/mistral.rs/mistralrs-core)
   Compiling mistralrs-server v0.3.2 (/Users/sauravverma/programs/pyano/mistral.rs/mistralrs-server)
    Finished `release` profile [optimized] target(s) in 28.98s
     Running `target/release/mistralrs-server --port 1234 vision-plain -m EricB/Llama-3.2-11B-Vision-Instruct-UQFF -a vllama --from-uqff llama3.2-vision-instruct-q4k.uqff`
2024-12-02T12:10:30.749515Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-12-02T12:10:30.749585Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-12-02T12:10:30.749702Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2024-12-02T12:10:30.749864Z  INFO candle_hf_hub: Token file not found "/Users/sauravverma/.cache/huggingface/token"
2024-12-02T12:10:30.749906Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/sauravverma/.cache/huggingface/token", using no HF token.
2024-12-02T12:10:30.750408Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-02T12:10:30.750568Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-02T12:10:31.617781Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["residual.safetensors"]
2024-12-02T12:10:32.319334Z  INFO mistralrs_core::pipeline::vision: Loading `generation_config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-02T12:10:32.610969Z  INFO mistralrs_core::pipeline::vision: Loading `preprocessor_config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-02T12:10:32.864608Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer_config.json` at `EricB/Llama-3.2-11B-Vision-Instruct-UQFF`
2024-12-02T12:10:32.865074Z  INFO candle_hf_hub: Token file not found "/Users/sauravverma/.cache/huggingface/token"
2024-12-02T12:10:32.865368Z  INFO mistralrs_core::pipeline::vision: Loading model `EricB/Llama-3.2-11B-Vision-Instruct-UQFF` on metal[4294968379].
2024-12-02T12:10:32.866094Z  INFO mistralrs_core::pipeline::vision: Model config: MLlamaConfig { vision_config: MLlamaVisionConfig { hidden_size: 1280, hidden_act: Gelu, num_hidden_layers: 32, num_global_layers: 8, num_attention_heads: 16, num_channels: 3, intermediate_size: 5120, vision_output_dim: 7680, image_size: 560, patch_size: 14, norm_eps: 1e-5, max_num_tiles: 4, intermediate_layers_indices: [3, 7, 15, 23, 30], supported_aspect_ratios: [(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (3, 1), (4, 1)] }, text_config: MLlamaTextConfig { rope_scaling: Some(MLlamaRopeScaling { rope_type: Llama3, factor: Some(8.0), original_max_position_embeddings: 8192, attention_factor: None, beta_fast: None, beta_slow: None, short_factor: None, long_factor: None, low_freq_factor: Some(1.0), high_freq_factor: Some(4.0) }), vocab_size: 128256, hidden_size: 4096, hidden_act: Silu, num_hidden_layers: 40, num_attention_heads: 32, num_key_value_heads: 8, intermediate_size: 14336, rope_theta: 500000.0, rms_norm_eps: 1e-5, max_position_embeddings: 131072, tie_word_embeddings: false, cross_attention_layers: [3, 8, 13, 18, 23, 28, 33, 38], use_flash_attn: false, quantization_config: None } }
2024-12-02T12:10:32.877575Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 362/362 [00:03<00:00, 479.83it/s]
2024-12-02T12:10:37.336389Z  INFO mistralrs_core::pipeline::isq: Loaded in-situ quantization artifacts into 464 total tensors. Took 0.46s
2024-12-02T12:10:37.484674Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2024-12-02T12:10:37.494593Z  INFO mistralrs_server: Model loaded.
2024-12-02T12:10:37.494832Z  INFO mistralrs_core: Beginning dummy run.
2024-12-02T12:10:37.674735Z  INFO mistralrs_core: Dummy run completed in 0.17987025s.
2024-12-02T12:10:37.675587Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.

@GraphicalDot
Copy link
Author

I went furthur and switched to another commit to see if the implementation of idefic3 is working or not.

git checkout d5cc451b11003316d183bfe2b05e543b9e5baaf8

and it worked

mistral.rs git:(d5cc451b1) cargo build --release --features metal
   Compiling mistralrs-pyo3 v0.3.2 (/Users/sauravverma/programs/pyano/mistral.rs/mistralrs-pyo3)
   Compiling mistralrs v0.3.2 (/Users/sauravverma/programs/pyano/mistral.rs/mistralrs)
   Compiling mistralrs-bench v0.3.2 (/Users/sauravverma/programs/pyano/mistral.rs/mistralrs-bench)
   Compiling mistralrs-paged-attn v0.3.2 (/Users/sauravverma/programs/pyano/mistral.rs/mistralrs-paged-attn)
    Finished `release` profile [optimized] target(s) in 2.24s
➜  mistral.rs git:(d5cc451b1) ./mistralrs-server -i vision-plain -m HuggingFaceTB/SmolVLM-Instruct -a idefics3
zsh: permission denied: ./mistralrs-server
➜  mistral.rs git:(d5cc451b1) cd target/release
➜  release git:(d5cc451b1) ./mistralrs-server -i vision-plain -m HuggingFaceTB/SmolVLM-Instruct -a idefics3
2024-12-02T12:18:32.147294Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-12-02T12:18:32.147341Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-12-02T12:18:32.147459Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2024-12-02T12:18:32.147659Z  INFO candle_hf_hub: Token file not found "/Users/sauravverma/.cache/huggingface/token"
2024-12-02T12:18:32.147721Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/sauravverma/.cache/huggingface/token", using no HF token.
2024-12-02T12:18:32.148331Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-02T12:18:32.148780Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-02T12:18:32.461470Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2024-12-02T12:18:32.708808Z  INFO mistralrs_core::pipeline::vision: Loading `generation_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-02T12:18:33.075364Z  INFO mistralrs_core::pipeline::vision: Loading `preprocessor_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-02T12:18:33.382419Z  INFO mistralrs_core::pipeline::vision: Loading `processor_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-02T12:18:33.382789Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-02T12:18:33.691665Z  INFO mistralrs_core::pipeline::vision: Loading `chat_template.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-02T12:18:33.694229Z  INFO mistralrs_core::pipeline::vision: Loading model `HuggingFaceTB/SmolVLM-Instruct` on metal[4294968379].
2024-12-02T12:18:33.695273Z  INFO mistralrs_core::pipeline::vision: Model config: Idefics3Config { image_token_id: 49153, vision_config: Idefics3VisionConfig { hidden_size: 1152, intermediate_size: 4304, num_hidden_layers: 27, num_attention_heads: 16, num_channels: 3, image_size: 384, patch_size: 14, hidden_act: GeluPytorchTanh, layer_norm_eps: 1e-6 }, text_config: Config { hidden_size: 2048, intermediate_size: 8192, vocab_size: 49155, num_hidden_layers: 24, num_attention_heads: 32, num_key_value_heads: 32, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 273768.0, max_position_embeddings: 16384, rope_scaling: None, quantization_config: None, tie_word_embeddings: false }, scale_factor: 3 }
2024-12-02T12:18:33.703583Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 657/657 [00:05<00:00, 141.71it/s]
2024-12-02T12:18:38.992025Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|im_start|>", "<|endoftext|>", eos_toks = "<end_of_utterance>", "<|im_end|>", unk_tok = <|endoftext|>
2024-12-02T12:18:38.994208Z  INFO mistralrs_server: Model loaded.
2024-12-02T12:18:38.994330Z  INFO mistralrs_core: Beginning dummy run.
2024-12-02T12:18:39.120394Z  INFO mistralrs_core: Dummy run completed in 0.12605875s.
2024-12-02T12:18:39.120421Z  INFO mistralrs_server::interactive_mode: Starting interactive loop with sampling params: SamplingParams { temperature: Some(0.1), top_k: Some(32), top_p: Some(0.1), min_p: Some(0.05), top_n_logprobs: 0, frequency_penalty: Some(0.1), presence_penalty: Some(0.1), stop_toks: None, max_len: Some(4096), logits_bias: None, n_choices: 1, dry_params: Some(DrySamplingParams { sequence_breakers: ["\n", ":", "\"", "*"], multiplier: 0.0, base: 1.75, allowed_length: 2 }) }
====================
Welcome to interactive mode! Because this model is a vision model, you can enter prompts and chat with the model.

To specify a message with an image, use the `\image` command detailed below.

Commands:
- `\help`: Display this message.
- `\exit`: Quit interactive mode.
- `\system <system message here>`:
    Add a system message to the chat without running the model.
    Ex: `\system Always respond as a pirate.`
- `\image <image URL or local path here> <message here>`:
    Add a message paired with an image. The image will be fed to the model as if it were the first item in this prompt.
    You do not need to modify your prompt for specific models.
    Ex: `\image path/to/image.jpg Describe what is in this image.`
====================
> \image /Users/sauravverma/Desktop/theMeme.png what is this image
The image is a meme. It is a humorous image that uses text and an image to make a point. The image is of a skeleton, which is the first-time founder. The skeleton is lying on a table. The skeleton is in a state of decay, with its ribs and spine visible. The skeleton is also missing its head, which is lying on the table next to the skeleton. The skeleton is surrounded by a white background.

The text on the image reads "What happened to him?" and "He believed that being good at coding was enough to succeed". The text is in yellow and black font. The text is in all capital letters. The text is in a bold font. The text is in a sans-serif font.

The image is funny because it shows the skeleton as the first-time founder. The skeleton is in a state of decay, which suggests that it has been dead for a long time. The skeleton is missing its head, which suggests that it has been dead for even longer. The skeleton is surrounded by a white background, which suggests that it has been dead for a long time.

The text on the image reads "What happened to him?" and "He believed that being good at coding was enough to succeed". The text is in yellow and black font. The text is in all capital letters. The text is in a bold font. The text is in a sans-serif font.

The image is funny because it shows the skeleton as the first-time founder. The skeleton is in a state of decay, which suggests that it has been dead for a long time. The skeleton is missing its head, which suggests that it has been dead for even longer. The skeleton is surrounded by a white background, which suggests that it has been dead for a long time.

The question in the original caption is "What happened to him?". The answer to the question is "He died".

I also tried to get an inferenceover an HTTP server and it also worked fine

➜  release git:(d5cc451b1) ./mistralrs-server -p 1234 vision-plain -m HuggingFaceTB/SmolVLM-Instruct -a idefics3
2024-12-02T12:21:29.668950Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-12-02T12:21:29.668981Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-12-02T12:21:29.669045Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2024-12-02T12:21:29.669311Z  INFO candle_hf_hub: Token file not found "/Users/sauravverma/.cache/huggingface/token"
2024-12-02T12:21:29.669351Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/sauravverma/.cache/huggingface/token", using no HF token.
2024-12-02T12:21:29.669677Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-02T12:21:29.669919Z  INFO mistralrs_core::pipeline::vision: Loading `config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-02T12:21:30.024476Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2024-12-02T12:21:30.274160Z  INFO mistralrs_core::pipeline::vision: Loading `generation_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-02T12:21:30.555730Z  INFO mistralrs_core::pipeline::vision: Loading `preprocessor_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-02T12:21:30.845115Z  INFO mistralrs_core::pipeline::vision: Loading `processor_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-02T12:21:30.845440Z  INFO mistralrs_core::pipeline::vision: Loading `tokenizer_config.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-02T12:21:31.152353Z  INFO mistralrs_core::pipeline::vision: Loading `chat_template.json` at `HuggingFaceTB/SmolVLM-Instruct`
2024-12-02T12:21:31.154599Z  INFO mistralrs_core::pipeline::vision: Loading model `HuggingFaceTB/SmolVLM-Instruct` on metal[4294968379].
2024-12-02T12:21:31.155184Z  INFO mistralrs_core::pipeline::vision: Model config: Idefics3Config { image_token_id: 49153, vision_config: Idefics3VisionConfig { hidden_size: 1152, intermediate_size: 4304, num_hidden_layers: 27, num_attention_heads: 16, num_channels: 3, image_size: 384, patch_size: 14, hidden_act: GeluPytorchTanh, layer_norm_eps: 1e-6 }, text_config: Config { hidden_size: 2048, intermediate_size: 8192, vocab_size: 49155, num_hidden_layers: 24, num_attention_heads: 32, num_key_value_heads: 32, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 273768.0, max_position_embeddings: 16384, rope_scaling: None, quantization_config: None, tie_word_embeddings: false }, scale_factor: 3 }
2024-12-02T12:21:31.163821Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 657/657 [00:03<00:00, 4181.22it/s]
2024-12-02T12:21:34.530199Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|im_start|>", "<|endoftext|>", eos_toks = "<end_of_utterance>", "<|im_end|>", unk_tok = <|endoftext|>
2024-12-02T12:21:34.532342Z  INFO mistralrs_server: Model loaded.
2024-12-02T12:21:34.532380Z  INFO mistralrs_core: Beginning dummy run.
2024-12-02T12:21:34.654829Z  INFO mistralrs_core: Dummy run completed in 0.122442s.
2024-12-02T12:21:34.655406Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.

@akashicMarga
Copy link

@GraphicalDot i was also getting same error. now it's working fine. In my case I had deleted earlier xcode from applications and reinstalling it fixed it. you can check whether you metal API is available or not with this command xcrun metal if it's available then it will show "metal: error: no input files" , else you will get the same path issue.

@sgrebnov
Copy link
Contributor

sgrebnov commented Jan 22, 2025

This seems to be related to supports_attn_softmax logic that relies on xcrun + metal (that seems to require full XCode installation). As a result XCode must be installed to run the app with metal acceleration enabled.

  1. Can we fallback to supports_attn_softmax enabled if we are unable to detect it support. It seems macOS 13 (Ventura) and later support Metal 3: https://support.apple.com/en-us/102894
  2. Can we introduce a feature flag like metal_attn_softmax_enabled that can be used to skip attn_softmax support detection if an application targeting to modern OS versions only.
xcode-select --print-path
/Library/Developer/CommandLineTools

where xcrun
/usr/bin/xcrun

xcrun -sdk macosx metal -E -x metal -P -
xcrun: error: unable to find utility "metal", not a developer tool or in PATH

https://github.com/EricLBuehler/mistral.rs/blob/master/mistralrs-core/src/attention.rs#L113-L124

// Run the `xcrun` command, taking input from the `echo` command's output
let output = Command::new("xcrun")
    .arg("-sdk")
    .arg("macosx")
    .arg("metal")
    .arg("-E")
    .arg("-x")
    .arg("metal")
    .arg("-P")
    .arg("-")
    .stdin(echo.stdout.unwrap())
    .output()
    .expect("Failed to run xcrun command");

@EricLBuehler
Copy link
Owner

@sgrebnov, do you think moving this logic of supports_attn_softmax to the build.rs of mistralrs-core might be a nice solution? That way we can add this logic (try to use xcrun, falling back by default not using it). I'll add a PR to test this out, let me know what you think!

(Note that the logic is necessary as the Metal attn softmax uses bfloat vectorized types which fallback implementations for metal < 3.10 don't support).

@sgrebnov
Copy link
Contributor

@EricLBuehler - yeah, would definitely happy to help here and move this to build.rs, but I was under impression that this should be checked at runtime as the binaries could be built on device with metal >= 3.10 but run on metal < 3.10 and the execution will fail?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants