SAM: How does the decoder handle output resolution? #8545

hashJoe · 2024-10-15T15:10:56Z

I am currently integrating a fine-tuned SAM model into CVAT, following the integration pattern of the existing SAM model. My integration process involved a few key steps:

Function Addition: I introduced a new function within the serverless/pytorch/ directory.
Model Conversion: The fine-tuned decoder portion was converted from Torch to ONNX utilizing the ONNX exporter repository.
Plugin Creation: I developed a new plugin located at cvat-ui/plugins/ and registered the plugin using the following env variable CLIENT_PLUGINS

During this integration, adjustments were made to the following files in my plugin:
src/ts/index.tsx
src/ts/inference.worker.ts

The implementation successfully generates masks as expected, and no errors arise during execution. However, the visual output on CVAT does not align with the image resolution, resulting in mismatched mask visualization.

Upon troubleshooting, I identified a disparity in the mask dimensions produced by my ONNX model's decoder and the existing integrated model:

Current SAM Decoder ONNX Dimensions: [1, 1, 1221, 1233]

My Converted SAM Decoder ONNX Dimensions: [1, 1, 2048, 2048]

The dimensions [1, 1, 2048, 2048] correlate precisely with the image resolution, i.e., 2048x2048, leading me to suspect this is causing improper mask visualization.

How does the current SAM decoder model manage resolution adjustment, and is this mechanism embedded within the ONNX model itself?
What adjustments can I make to my model or integration approach to synchronize mask resolution with the existing framework effectively?
Is the script used for exporting SAM decoder to onnx available somewhere?

Additionally, I assume the extra output x, y coordinates are bbox of the generated mask, so I implemented this in my code as well.

Any insights or guidance on this issue would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

hashJoe · 2024-10-15T15:16:33Z

Is my observation right, that mask dimensions are relative to the segmented object and xy coordinates are relative to the image?

hashJoe · 2024-10-16T14:04:59Z

This is what I was looking for issue-1666290429

hashJoe closed this as completed Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAM: How does the decoder handle output resolution? #8545

SAM: How does the decoder handle output resolution? #8545

hashJoe commented Oct 15, 2024

hashJoe commented Oct 15, 2024

hashJoe commented Oct 16, 2024

SAM: How does the decoder handle output resolution? #8545

SAM: How does the decoder handle output resolution? #8545

Comments

hashJoe commented Oct 15, 2024

hashJoe commented Oct 15, 2024

hashJoe commented Oct 16, 2024