Processing in image encoding for Florence 2 #1170

ir2718 · 2025-01-27T16:13:28Z

Question

Hi,

while having a look at the code for generation with the Florence 2 model, I've noticed something weird. The original code for inference uses the _encode_image method for creating image features. However, looking at the encode_image used in transformers.js, I've noticed the postprocessing after the model forward pass is missing. Here's a minimal reproducible example:

import onnxruntime as ort

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image

# The vision encoder was downloaded from:
# https://huggingface.co/onnx-community/Florence-2-base-ft/resolve/main/onnx/vision_encoder.onnx
ONNX_MODEL_PATH = "models/onnx/original/vision_encoder.onnx"
MODEL_NAME = "microsoft/Florence-2-base-ft"
# Image download link:
# https://upload.wikimedia.org/wikipedia/en/7/7d/Lenna_%28test_image%29.png
IMG_PATH = "lena.png"
PROMPT = "<MORE_DETAILED_CAPTION>"

processor = AutoProcessor.from_pretrained(
    MODEL_NAME, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, trust_remote_code=True)

image = Image.open(IMG_PATH)
inputs = processor(text=PROMPT, images=image, return_tensors="pt")

hf_out = model._encode_image(inputs["pixel_values"])

ort_vision_tower = ort.InferenceSession(ONNX_MODEL_PATH)
ort_out = ort_vision_tower.run(
    None, {"pixel_values": inputs["pixel_values"].numpy()})[0]

print(hf_out.cpu().detach().numpy())
print()
print(ort_out)

The feature differences are pretty big:

[[[-0.4047455   0.51958734 -0.23121671 ...  1.0019573  -0.46846968
    0.5289913 ]
  [-0.08135182 -2.0622678  -0.50597775 ...  0.38061845 -0.7858853
   -1.247189  ]
  [ 0.69417834 -1.926735   -0.691345   ... -0.17574754 -0.98472327
   -1.2420652 ]
  ...
  [ 0.018062    1.2185848  -0.04483193 ...  0.61767036 -0.1832848
    0.9324351 ]
  [-0.13765828  0.7120823   0.12478658 ... -0.44853052 -0.6390534
    0.37095645]
  [ 0.58084226  1.6617624  -0.43527135 ... -0.92560166 -0.47037867
   -0.81996024]]]

[[[-0.52661824  0.508744   -0.24130312 ...  0.91191643 -0.39472336
    1.1632534 ]
  [-0.18091503 -2.2187433  -0.7923498  ...  0.6103708  -0.49637306
   -0.9830185 ]
  [ 0.3002218  -1.9726763  -1.1151179  ... -0.11572987 -0.6870862
   -0.96058726]
  ...
  [-0.08202907  0.8105656  -0.1748765  ...  1.0833437  -0.41167092
    1.2495995 ]
  [-0.01531404  0.6044417  -0.06392197 ... -0.30775025 -0.5735508
    0.6775356 ]
  [ 0.74322057  1.4011574  -0.5277405  ... -0.61488384 -0.40253094
   -0.8440974 ]]]

Am I missing something here or is this a potential bug?

The text was updated successfully, but these errors were encountered:

xenova · 2025-02-08T12:03:02Z

This might be due to image reading differences in JavaScript vs. Python. Could you try passing the exact same data (e.g., all-zero tensor) to see if the difference is there too? Also, remember to load the full-precision model in Transformers.js, as this could be another source for differences.

ir2718 · 2025-02-08T17:07:33Z

I've modified the minimal example by creating a blank image as follows:

image = Image.new("RGB", (512, 512))

However, the results are still different:

[[[ 0.7425719   0.01071071 -0.06678165 ...  1.2542837  -0.2884356
   -0.69630283]
  [ 0.9451557  -0.30704248  0.69962746 ... -0.72856545  0.15360388
   -1.0232862 ]
  [ 1.2876275   0.5174419  -0.2222641  ... -0.32981807  0.44000283
   -1.1317426 ]
  ...
  [ 0.20901655 -0.39984626  0.1699695  ...  1.923425   -0.6329966
   -0.91588783]
  [ 0.20724754 -0.40770236  0.42595854 ...  1.7196184  -0.38901007
   -1.0207707 ]
  [-0.08099215 -0.3391677  -0.17075935 ...  1.9568288  -0.02066579
   -1.1172475 ]]]

[[[ 0.0817447   0.41585156 -0.03429735 ...  1.6622943  -0.43160683
    0.08325118]
  [ 0.1999208   0.37867606  0.47249985 ... -0.29732558 -0.00243429
   -0.6437535 ]
  [ 0.27014455  1.0596321   0.04975559 ...  0.2688354   0.25734758
   -0.3757942 ]
  ...
  [-0.09077676  0.39021495  0.19065166 ...  1.6975157  -0.41929
   -0.6461764 ]
  [-0.18240517  0.71244407  0.34832954 ...  1.4980354  -0.24869794
   -0.6761538 ]
  [-0.51288855  0.36046848 -0.42776367 ...  0.80509955 -0.21319357
   -0.94580245]]]

To clear any misunderstanding, the model I used is converted in full precision. Unfortunately, using the model in transformers.js is not an option for me as my use case requires python.

ir2718 added the question Further information is requested label Jan 27, 2025

ir2718 mentioned this issue Feb 17, 2025

Releasing the Florence 2 ONNX conversion script? #1165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing in image encoding for Florence 2 #1170

Processing in image encoding for Florence 2 #1170

ir2718 commented Jan 27, 2025 •

edited

Loading

xenova commented Feb 8, 2025

ir2718 commented Feb 8, 2025

Processing in image encoding for Florence 2 #1170

Processing in image encoding for Florence 2 #1170

Comments

ir2718 commented Jan 27, 2025 • edited Loading

Question

xenova commented Feb 8, 2025

ir2718 commented Feb 8, 2025

ir2718 commented Jan 27, 2025 •

edited

Loading