Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
202 commits
Select commit Hold shift + click to select a range
a898595
initial comment
SangbumChoi Jul 30, 2024
02ebabe
test
SangbumChoi Jul 31, 2024
48d56f4
Merge branch 'sam2' of https://github.com/SangbumChoi/transformers in…
SangbumChoi Jul 31, 2024
a68ab5c
initial conversion for outline
SangbumChoi Jul 31, 2024
a6cd9d1
intermediate commit for configuration
SangbumChoi Aug 1, 2024
29f56e2
chore:init files for sam2
RUFFY-369 Aug 1, 2024
47324b2
adding arbitary undefined config
SangbumChoi Aug 1, 2024
f07991f
Merge pull request #1 from SangbumChoi/sam2_hf
SangbumChoi Aug 1, 2024
9f66cc9
check
SangbumChoi Aug 1, 2024
9ff3fa8
add vision
SangbumChoi Aug 2, 2024
289a0c0
make style
SangbumChoi Aug 2, 2024
241fbaf
init sam2 base model
haithamkhedr Aug 2, 2024
36b72e4
Fix imports
haithamkhedr Aug 2, 2024
e637647
Linting
haithamkhedr Aug 2, 2024
f022b0e
chore:sam to sam2 classes
RUFFY-369 Aug 2, 2024
6e1c1bf
Linting
haithamkhedr Aug 2, 2024
4df0ef3
Add sam2 to models.__init__
haithamkhedr Aug 2, 2024
dadfc27
chore:match prompt encoder with sam2 code
RUFFY-369 Aug 2, 2024
f43f41b
chore:prepare kwargs for mask decoder
RUFFY-369 Aug 2, 2024
66d6fb8
Merge pull request #3 from SangbumChoi/sam2_hf
SangbumChoi Aug 3, 2024
dc2cb88
Merge pull request #2 from SangbumChoi/sam2_config
SangbumChoi Aug 3, 2024
6b02d39
Add image/video predictors
haithamkhedr Aug 5, 2024
3f4041b
Add CUDA kernel
haithamkhedr Aug 6, 2024
bc9e3c9
Add output classes
haithamkhedr Aug 6, 2024
2eb495b
linting
haithamkhedr Aug 6, 2024
cefa0d9
Add logging info
haithamkhedr Aug 7, 2024
cedbdde
Merge branch 'main' of https://github.com/SangbumChoi/transformers in…
SangbumChoi Sep 22, 2024
5ab3053
Merge branch 'construct_sam2_base' of https://github.com/haithamkhedr…
SangbumChoi Sep 22, 2024
85dcf19
tmp commit
SangbumChoi Oct 1, 2024
54dff82
docs for sam2
SangbumChoi Oct 6, 2024
18e38a9
Merge branch 'main' of https://github.com/SangbumChoi/transformers in…
SangbumChoi Oct 19, 2024
b3d5139
enable image processing
SangbumChoi Oct 19, 2024
f6c4364
check difference of original SAM2
SangbumChoi Oct 20, 2024
e0176ef
enable promptencoder of sam2
SangbumChoi Oct 20, 2024
aceca2b
fix promprencoder
SangbumChoi Oct 21, 2024
57ca871
Confirmed that PromptEncoder is exactly same (Be aware of bfloat16 an…
SangbumChoi Oct 22, 2024
355fe4e
Confirmed that ImageEncoder is exactly same (Be aware the linting of …
SangbumChoi Oct 24, 2024
9990a8e
Confirmed that MaskDecoder is exactly same (TO DO: lint variable name)
SangbumChoi Oct 26, 2024
1749f19
SamModel is now available (Need more chore for name)
SangbumChoi Oct 26, 2024
9ed9718
make fix-copies
SangbumChoi Oct 26, 2024
7d73cbd
make style
SangbumChoi Oct 26, 2024
8c84a54
make CI happy
SangbumChoi Oct 26, 2024
37c2ab8
Merge branch 'main' of https://github.com/SangbumChoi/transformers in…
SangbumChoi Oct 27, 2024
ab46f71
Refactor VisionEncoder and PostioinEmbedding
SangbumChoi Oct 27, 2024
9182af6
TO DO : fix the image_embeddings and sparse_embeddings part
SangbumChoi Oct 28, 2024
5690eca
pure image inference done
SangbumChoi Oct 30, 2024
4c20a80
reusable features fix and make style
SangbumChoi Oct 30, 2024
3dc0058
Merge branch 'main' into sam2
SangbumChoi Oct 30, 2024
9003953
styling
SangbumChoi Nov 3, 2024
0e64e85
refactor memoryattention
SangbumChoi Nov 13, 2024
c86b3fe
tmp
SangbumChoi Nov 20, 2024
0a5cedc
tmp
SangbumChoi Dec 3, 2024
d1734f3
Merge branch 'main' of https://github.com/SangbumChoi/transformers in…
SangbumChoi Dec 3, 2024
9d5994e
refactor memoryencoder
SangbumChoi Dec 3, 2024
e1824fb
TO DO : fix the image_encoder shape
SangbumChoi Dec 3, 2024
5079e9e
conversion finish
SangbumChoi Dec 3, 2024
d62bcdf
Merge branch 'main' into sam2
SangbumChoi Jan 16, 2025
0d9e2eb
Merge branch 'main' of https://github.com/SangbumChoi/transformers in…
SangbumChoi Mar 15, 2025
89121bf
Merge branch 'main' into sam2
SangbumChoi Mar 15, 2025
b35454a
make style
Mar 15, 2025
4963c6b
remove video model
Mar 15, 2025
f68722c
lint
SangbumChoi Mar 15, 2025
68091c7
Merge branch 'main' into sam2
SangbumChoi Mar 15, 2025
1420e9a
change
SangbumChoi Mar 15, 2025
e32ab85
python utils/check_docstringspy --check_all
SangbumChoi Mar 15, 2025
234839e
python utils/check_config_attributes.py
SangbumChoi Mar 15, 2025
3284eee
remove copies for sam2promptencoder due to configuration
SangbumChoi Mar 15, 2025
301b9b4
Merge branch 'main' into sam2
SangbumChoi Apr 9, 2025
94b7c5d
change __init__.py
SangbumChoi Apr 9, 2025
5e1408d
remove tensorflow version
SangbumChoi Apr 9, 2025
61b3219
fix that to not use direct comparison
SangbumChoi Apr 9, 2025
864ba3d
make style
SangbumChoi Apr 9, 2025
48e3337
add missing import
SangbumChoi Apr 9, 2025
2afdb90
Merge branch 'main' of https://github.com/SangbumChoi/transformers in…
SangbumChoi Apr 23, 2025
568998b
Merge branch 'main' into sam2
SangbumChoi Apr 25, 2025
e31d02f
Merge branch 'main' into sam2
SangbumChoi Apr 29, 2025
bfebdaf
fix image_embedding_size
yonigozlan May 22, 2025
0b0f09b
Merge remote-tracking branch 'upstream/main' into sam2
yonigozlan May 22, 2025
ff5d788
refactor Sam2 Attention
yonigozlan May 23, 2025
5e9f23b
Merge branch 'main' into sam2
SangbumChoi May 24, 2025
3a02a89
add fully working video inference (refactoring todo)
yonigozlan May 30, 2025
dd52fce
Merge branch 'sam2' of https://github.com/SangbumChoi/transformers in…
yonigozlan May 30, 2025
485f697
clarify _prepare_memory_conditioned_features
yonigozlan Jun 3, 2025
2dfefb3
simplify modeling code, remove unused paths
yonigozlan Jun 3, 2025
0509c7d
use one model
yonigozlan Jun 3, 2025
6130231
use auto_docstring
yonigozlan Jun 3, 2025
45c7e24
refactor rope embeddings
yonigozlan Jun 6, 2025
9f1245f
nit
yonigozlan Jun 6, 2025
6a59a3e
not using multimask when several points given
yonigozlan Jun 9, 2025
2b092dc
Merge branch 'main' of https://github.com/SangbumChoi/transformers in…
SangbumChoi Jun 23, 2025
79055ad
add all sam2.1
SangbumChoi Jun 23, 2025
701748c
add video tmp
SangbumChoi Jun 24, 2025
c3330c6
add Sam2VideoSessionState + fast image proc + video proc
yonigozlan Jun 25, 2025
0aa02d6
Merge branch 'sam2' of https://github.com/SangbumChoi/transformers in…
yonigozlan Jun 25, 2025
c8e56aa
remove init_states from model
yonigozlan Jun 25, 2025
82e0a53
fix batch inference
yonigozlan Jun 26, 2025
9582278
add image integration tests
yonigozlan Jun 26, 2025
a26d854
Merge remote-tracking branch 'upstream/main' into sam2
yonigozlan Jun 27, 2025
953fc0c
Merge remote-tracking branch 'upstream/main' into sam2
yonigozlan Jun 30, 2025
9d5c7c0
uniformize modeling code with other sam models and use modular
yonigozlan Jun 30, 2025
aebcb34
pass vision tests an most model tests
yonigozlan Jul 2, 2025
978b02e
All tests passing
yonigozlan Jul 2, 2025
c145560
add offloading inference state and video to cpu
yonigozlan Jul 3, 2025
1082c02
fix inference from image embedding and existing mask
yonigozlan Jul 3, 2025
e1d689c
fix multi_boxes mask inference
yonigozlan Jul 4, 2025
ca6d2eb
Fix batch images + batch boxes inference
yonigozlan Jul 7, 2025
74e432a
improve processing for image inference
yonigozlan Jul 7, 2025
0b8476f
add support for mask generation pipeline
yonigozlan Jul 8, 2025
ca67983
add support for get_connected_components post processing in mask gene…
yonigozlan Jul 8, 2025
6fabcf1
add fast image processor sam, image processor tests and use modular f…
yonigozlan Jul 8, 2025
8d8d049
Merge remote-tracking branch 'upstream/main' into sam2
yonigozlan Jul 8, 2025
ace0b54
fix mistake in sam after #39120
yonigozlan Jul 8, 2025
633f239
fix init weights
yonigozlan Jul 8, 2025
ee5ee97
refactor convert
SangbumChoi Jul 9, 2025
37ea339
add integration tests for video + other improvements
yonigozlan Jul 9, 2025
f45e1d6
add needed missing docstrings
yonigozlan Jul 9, 2025
219215a
Merge pull request #4 from SangbumChoi/sam2_sbchoi
SangbumChoi Jul 10, 2025
3623926
Improve docstrings and
yonigozlan Jul 10, 2025
1bbcda3
Merge branch 'sam2' of https://github.com/SangbumChoi/transformers in…
yonigozlan Jul 10, 2025
5fe82fe
improve inference speed by avoiding cuda sync
yonigozlan Jul 11, 2025
201610b
Merge branch 'main' of https://github.com/SangbumChoi/transformers in…
SangbumChoi Jul 11, 2025
e89e9b4
add test
SangbumChoi Jul 11, 2025
4806f29
skip test for vision_model
SangbumChoi Jul 11, 2025
adbf963
minor fix for vision_model
SangbumChoi Jul 11, 2025
8529811
fix vision_model by adding sam2model and change the torch dependencies
SangbumChoi Jul 11, 2025
2b52dc8
remove patch_size
SangbumChoi Jul 11, 2025
5e974f0
remove image_embedding_size
SangbumChoi Jul 12, 2025
0677b7f
fix patch_size
SangbumChoi Jul 12, 2025
d72e261
fix test
SangbumChoi Jul 12, 2025
ed237d0
make style
SangbumChoi Jul 12, 2025
be8d7a6
Separate hieradet and vision encoder in sam2
yonigozlan Jul 12, 2025
b4ca616
Merge branch 'sam2' of https://github.com/SangbumChoi/transformers in…
yonigozlan Jul 12, 2025
3a1e6b0
fixup
yonigozlan Jul 12, 2025
4296e75
review changes part 1
yonigozlan Jul 14, 2025
f6ea5c6
remove MemoryEncoderConfig and MemoryAttentionConfig
yonigozlan Jul 14, 2025
5a24d7a
pass q_stride instead of q_pool module
yonigozlan Jul 14, 2025
109525e
add inference on streamed videos
yonigozlan Jul 15, 2025
e3319d5
explicitely process streamed frames
yonigozlan Jul 15, 2025
f75e04d
nit
yonigozlan Jul 15, 2025
bb107d9
Improve docstrings in Sam2Model
yonigozlan Jul 15, 2025
93bc44d
update sam2 modeling with better gestion of inference state and cache…
yonigozlan Jul 17, 2025
a8ded18
improve video inference api
yonigozlan Jul 17, 2025
589fd3b
change inference_state to inference_session
yonigozlan Jul 17, 2025
f898f69
Merge remote-tracking branch 'upstream/main' into sam2
yonigozlan Jul 17, 2025
f9f09fe
use modular for Sam2Model
yonigozlan Jul 18, 2025
236a386
fix convert sam2 hf
yonigozlan Jul 18, 2025
e76d48e
Merge branch 'main' of https://github.com/SangbumChoi/transformers in…
SangbumChoi Jul 19, 2025
2e85c00
modular
SangbumChoi Jul 19, 2025
5fc6b1c
Update src/transformers/models/sam2/video_processing_sam2.py
SangbumChoi Jul 19, 2025
13c878d
fix minor config
SangbumChoi Jul 19, 2025
2e757ba
Merge branch 'sam2' of https://github.com/SangbumChoi/transformers in…
SangbumChoi Jul 19, 2025
a5e2429
fix attention loading error
SangbumChoi Jul 21, 2025
1c74fa3
update modeling tests to use hub checkpoints
yonigozlan Jul 21, 2025
d79fdd0
Use CI A10 runner for integration tests values + higher tolerance for…
yonigozlan Jul 21, 2025
b81a6a2
PR review part 1
yonigozlan Jul 21, 2025
c3ea031
fix doc
yonigozlan Jul 23, 2025
ae98e30
nit improvements
yonigozlan Jul 23, 2025
cddfbd9
enforce one input format for points, labels and boxes
yonigozlan Jul 24, 2025
3067c7b
nit
yonigozlan Jul 24, 2025
8dbf74c
last few nits from PR review
yonigozlan Jul 24, 2025
a9e4e69
fix style
yonigozlan Jul 24, 2025
b5ff003
fix the input type
SangbumChoi Jul 27, 2025
335dd59
fix docs
SangbumChoi Jul 27, 2025
1951aea
add sam2 model as conversion script
SangbumChoi Jul 27, 2025
360fd6a
improve sam2 doc
yonigozlan Jul 28, 2025
ec5aa5b
Merge branch 'sam2' of https://github.com/SangbumChoi/transformers in…
yonigozlan Jul 28, 2025
f058630
add rough necessarry changes
yonigozlan Jul 29, 2025
9824acf
first working edgetam
yonigozlan Jul 29, 2025
5bf8ee2
fix issue with object pointers
yonigozlan Jul 30, 2025
fcdcc2a
Use modular as much as possible
yonigozlan Jul 30, 2025
90b17d2
Merge remote-tracking branch 'upstream/main' into sam2
yonigozlan Jul 30, 2025
92088f2
nit fixes + optimization
yonigozlan Jul 30, 2025
b7e8e91
Merge remote-tracking branch 'SangbumChoi-transformers/sam2' into add…
yonigozlan Jul 30, 2025
978c4e7
refactor spatial perceiver
yonigozlan Jul 31, 2025
03eac6f
Merge remote-tracking branch 'upstream/main' into add-edgetam
yonigozlan Sep 3, 2025
6d920ee
cleanup after merge
yonigozlan Sep 3, 2025
d775583
add working edgetam
yonigozlan Sep 4, 2025
c262c50
improve perceiver resampler code
yonigozlan Sep 8, 2025
7c8c935
simplify/unify rope attention logic
yonigozlan Sep 8, 2025
6116bee
Improve comments in apply_rotary_pos_emb_2d
yonigozlan Sep 8, 2025
5584a98
add working tests
yonigozlan Sep 9, 2025
d36e302
fix test timmwrapper
yonigozlan Sep 9, 2025
902b5e2
add docs
yonigozlan Sep 9, 2025
ecc5a89
make fixup
yonigozlan Sep 9, 2025
d0e7243
Merge branch 'main' into add-edgetam
yonigozlan Sep 9, 2025
e88e7d3
nits
yonigozlan Sep 9, 2025
82c834a
Merge branch 'add-edgetam' of https://github.com/yonigozlan/transform…
yonigozlan Sep 9, 2025
e7532cf
fix modular
yonigozlan Sep 9, 2025
1564656
fix modular
yonigozlan Sep 9, 2025
fa4f89f
Merge remote-tracking branch 'upstream/main' into add-edgetam
yonigozlan Sep 12, 2025
57f7cb2
PR review part 1
yonigozlan Sep 12, 2025
8898572
split apply_rotary_pos_emb_2d
yonigozlan Sep 12, 2025
c3d2f00
Merge remote-tracking branch 'upstream/main' into add-edgetam
yonigozlan Sep 25, 2025
7c203c4
add granularity to _prepare_memory_conditioned_features
yonigozlan Sep 25, 2025
429c6b9
Merge remote-tracking branch 'upstream/main' into add-edgetam
yonigozlan Sep 25, 2025
e6808d2
add dates to doc
yonigozlan Sep 25, 2025
3154c6f
add separate mlp for memory attention
yonigozlan Sep 25, 2025
509f06e
Fix memory on wrong device
yonigozlan Sep 26, 2025
4556b9f
store processed frames in dict
yonigozlan Sep 26, 2025
1af8481
update checkpoints in tests
yonigozlan Sep 29, 2025
74fd8e8
Merge branch 'main' into add-edgetam
yonigozlan Sep 29, 2025
35a7145
update dates
yonigozlan Sep 29, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1033,6 +1033,10 @@
title: DePlot
- local: model_doc/donut
title: Donut
- local: model_doc/edgetam
title: EdgeTAM
- local: model_doc/edgetam_video
title: EdgeTamVideo
- local: model_doc/emu3
title: Emu3
- local: model_doc/evolla
Expand Down
331 changes: 331 additions & 0 deletions docs/source/en/model_doc/edgetam.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,331 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->
*This model was released on 2025-01-13 and added to Hugging Face Transformers on 2025-09-29.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
</div>
</div>

# EdgeTAM

## Overview

The EdgeTAM model was proposed in [EdgeTAM: On-Device Track Anything Model](https://huggingface.co/papers/2501.07256) Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, Bilge Soran.

EdgeTAM is an efficient adaptation of SAM 2 that introduces a 2D Spatial Perceiver architecture to optimize memory attention mechanisms for real-time video segmentation on mobile devices.

The abstract from the paper is the following:

*On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. Despite several works optimizing SAM for better efficiency, we find they are not sufficient for SAM 2 because they all focus on compressing the image encoder, while our benchmark shows that the newly introduced memory attention blocks are also the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely stored frame-level memories with a lightweight Transformer that contains a fixed set of learnable queries. Given that video segmentation is a dense prediction task, we find preserving the spatial structure of the memories is essential so that the queries are split into global-level and patch-level groups. We also propose a distillation pipeline that further improves the performance without inference overhead. As a result, EdgeTAM achieves 87.7, 70.0, 72.3, and 71.7 J&F on DAVIS 2017, MOSE, SA-V val, and SA-V test, while running at 16 FPS on iPhone 15 Pro Max.*

This model was contributed by [yonigozlan](https://huggingface.co/yonigozlan).
The original code can be found [here](https://github.com/facebookresearch/EdgeTAM).

## Usage example

### Automatic Mask Generation with Pipeline

EdgeTAM can be used for automatic mask generation to segment all objects in an image using the `mask-generation` pipeline:

```python
>>> from transformers import pipeline

>>> generator = pipeline("mask-generation", model="yonigozlan/edgetam-1", device=0)
>>> image_url = "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/truck.jpg"
>>> outputs = generator(image_url, points_per_batch=64)

>>> len(outputs["masks"]) # Number of masks generated
39
```

### Basic Image Segmentation

#### Single Point Click

You can segment objects by providing a single point click on the object you want to segment:

```python
>>> from transformers import Sam2Processor, EdgeTamModel, infer_device
>>> import torch
>>> from PIL import Image
>>> import requests

>>> device = infer_device()

>>> model = EdgeTamModel.from_pretrained("yonigozlan/edgetam-1").to(device)
>>> processor = Sam2Processor.from_pretrained("yonigozlan/edgetam-1")

>>> image_url = "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/truck.jpg"
>>> raw_image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")

>>> input_points = [[[[500, 375]]]] # Single point click, 4 dimensions (image_dim, object_dim, point_per_object_dim, coordinates)
>>> input_labels = [[[1]]] # 1 for positive click, 0 for negative click, 3 dimensions (image_dim, object_dim, point_label)

>>> inputs = processor(images=raw_image, input_points=input_points, input_labels=input_labels, return_tensors="pt").to(model.device)

>>> with torch.no_grad():
... outputs = model(**inputs)

>>> masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])[0]

>>> # The model outputs multiple mask predictions ranked by quality score
>>> print(f"Generated {masks.shape[1]} masks with shape {masks.shape}")
Generated 3 masks with shape torch.Size([1, 3, 1200, 1800])
>>> print(f"IoU scores: {outputs.iou_scores.squeeze()}")
IoU scores: tensor([0.0463, 0.4859, 0.7616], device='cuda:0')
```

#### Multiple Points for Refinement

You can provide multiple points to refine the segmentation:

```python
>>> # Add both positive and negative points to refine the mask
>>> input_points = [[[[500, 375], [1125, 625]]]] # Multiple points for refinement
>>> input_labels = [[[1, 1]]] # Both positive clicks

>>> inputs = processor(images=raw_image, input_points=input_points, input_labels=input_labels, return_tensors="pt").to(device)

>>> with torch.no_grad():
... outputs = model(**inputs)

>>> masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])[0]
>>> print(f"IoU scores: {outputs.iou_scores.squeeze()}")
IoU scores: tensor([0.8362, 0.6900, 0.2120], device='cuda:0')
```

#### Bounding Box Input

EdgeTAM also supports bounding box inputs for segmentation:

```python
>>> # Define bounding box as [x_min, y_min, x_max, y_max]
>>> input_boxes = [[[75, 275, 1725, 850]]]

>>> inputs = processor(images=raw_image, input_boxes=input_boxes, return_tensors="pt").to(device)

>>> with torch.no_grad():
... outputs = model(**inputs)

>>> masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])[0]
>>> print(f"IoU scores: {outputs.iou_scores.squeeze()}")
IoU scores: tensor([0.9301, 0.9348, 0.6605], device='cuda:0')
```

#### Multiple Objects Segmentation

You can segment multiple objects simultaneously:

```python
>>> # Define points for two different objects
>>> input_points = [[[[500, 375]], [[650, 750]]]] # Points for two objects in same image
>>> input_labels = [[[1], [1]]] # Positive clicks for both objects

>>> inputs = processor(images=raw_image, input_points=input_points, input_labels=input_labels, return_tensors="pt").to(device)

>>> with torch.no_grad():
... outputs = model(**inputs, multimask_output=False)

>>> # Each object gets its own mask
>>> masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])[0]
>>> print(f"Generated masks for {masks.shape[0]} objects")
Generated masks for 2 objects
>>> print(f"IoU scores: {outputs.iou_scores.squeeze()}")
IoU scores: tensor([0.7616, 0.9465], device='cuda:0')
```

### Batch Inference

#### Batched Images

Process multiple images simultaneously for improved efficiency:

```python
>>> from transformers import Sam2Processor, EdgeTamModel, infer_device
>>> import torch
>>> from PIL import Image
>>> import requests

>>> device = infer_device()

>>> model = EdgeTamModel.from_pretrained("yonigozlan/edgetam-1").to(device)
>>> processor = Sam2Processor.from_pretrained("yonigozlan/edgetam-1")

>>> # Load multiple images
>>> image_urls = [
... "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/truck.jpg",
... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/dog-sam.png"
... ]
>>> raw_images = [Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in image_urls]

>>> # Single point per image
>>> input_points = [[[[500, 375]]], [[[770, 200]]]] # One point for each image
>>> input_labels = [[[1]], [[1]]] # Positive clicks for both images

>>> inputs = processor(images=raw_images, input_points=input_points, input_labels=input_labels, return_tensors="pt").to(model.device)

>>> with torch.no_grad():
... outputs = model(**inputs, multimask_output=False)

>>> # Post-process masks for each image
>>> all_masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])
>>> print(f"Processed {len(all_masks)} images, each with {all_masks[0].shape[0]} objects")
Processed 2 images, each with 1 objects
>>> print(f"IoU scores: {outputs.iou_scores.squeeze()}")
IoU scores: tensor([0.7618, 0.7999], device='cuda:0')
```

#### Batched Objects per Image

Segment multiple objects within each image using batch inference:

```python
>>> # Multiple objects per image - different numbers of objects per image
>>> input_points = [
... [[[500, 375]], [[650, 750]]], # Truck image: 2 objects
... [[[770, 200]]] # Dog image: 1 object
... ]
>>> input_labels = [
... [[1], [1]], # Truck image: positive clicks for both objects
... [[1]] # Dog image: positive click for the object
... ]

>>> inputs = processor(images=raw_images, input_points=input_points, input_labels=input_labels, return_tensors="pt").to(device)

>>> with torch.no_grad():
... outputs = model(**inputs, multimask_output=False)

>>> all_masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])
```

#### Batched Images with Batched Objects and Multiple Points

Handle complex batch scenarios with multiple points per object:

```python
>>> # Add groceries image for more complex example
>>> groceries_url = "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/groceries.jpg"
>>> groceries_image = Image.open(requests.get(groceries_url, stream=True).raw).convert("RGB")
>>> raw_images = [raw_images[0], groceries_image] # Use truck and groceries images

>>> # Complex batching: multiple images, multiple objects, multiple points per object
>>> input_points = [
... [[[500, 375]], [[650, 750]]], # Truck image: 2 objects with 1 point each
... [[[400, 300]], [[630, 300], [550, 300]]] # Groceries image: obj1 has 1 point, obj2 has 2 points
... ]
>>> input_labels = [
... [[1], [1]], # Truck image: positive clicks
... [[1], [1, 1]] # Groceries image: positive clicks for refinement
... ]

>>> inputs = processor(images=raw_images, input_points=input_points, input_labels=input_labels, return_tensors="pt").to(device)

>>> with torch.no_grad():
... outputs = model(**inputs, multimask_output=False)

>>> all_masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])
```

#### Batched Bounding Boxes

Process multiple images with bounding box inputs:

```python
>>> # Multiple bounding boxes per image (using truck and groceries images)
>>> input_boxes = [
... [[75, 275, 1725, 850], [425, 600, 700, 875], [1375, 550, 1650, 800], [1240, 675, 1400, 750]], # Truck image: 4 boxes
... [[450, 170, 520, 350], [350, 190, 450, 350], [500, 170, 580, 350], [580, 170, 640, 350]] # Groceries image: 4 boxes
... ]

>>> # Update images for this example
>>> raw_images = [raw_images[0], groceries_image] # truck and groceries

>>> inputs = processor(images=raw_images, input_boxes=input_boxes, return_tensors="pt").to(device)

>>> with torch.no_grad():
... outputs = model(**inputs, multimask_output=False)

>>> all_masks = processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"])
>>> print(f"Processed {len(input_boxes)} images with {len(input_boxes[0])} and {len(input_boxes[1])} boxes respectively")
Processed 2 images with 4 and 4 boxes respectively
>>> print(f"IoU scores: {outputs.iou_scores.squeeze()}")
IoU scores: tensor([0.9301, 0.9348, 0.6605, 0.9465], device='cuda:0')
```

### Using Previous Masks as Input

EdgeTAM can use masks from previous predictions as input to refine segmentation:

```python
>>> # Get initial segmentation
>>> input_points = [[[[500, 375]]]]
>>> input_labels = [[[1]]]
>>> inputs = processor(images=raw_image, input_points=input_points, input_labels=input_labels, return_tensors="pt").to(device)

>>> with torch.no_grad():
... outputs = model(**inputs)

>>> # Use the best mask as input for refinement
>>> mask_input = outputs.pred_masks[:, :, torch.argmax(outputs.iou_scores.squeeze())]

>>> # Add additional points with the mask input
>>> new_input_points = [[[[500, 375], [450, 300]]]]
>>> new_input_labels = [[[1, 1]]]
>>> inputs = processor(
... input_points=new_input_points,
... input_labels=new_input_labels,
... original_sizes=inputs["original_sizes"],
... return_tensors="pt",
... ).to(device)

>>> with torch.no_grad():
... refined_outputs = model(
... **inputs,
... input_masks=mask_input,
... image_embeddings=outputs.image_embeddings,
... multimask_output=False,
... )
```


## EdgeTamConfig

[[autodoc]] EdgeTamConfig

## EdgeTamVisionConfig

[[autodoc]] EdgeTamVisionConfig

## EdgeTamMaskDecoderConfig

[[autodoc]] EdgeTamMaskDecoderConfig

## EdgeTamPromptEncoderConfig

[[autodoc]] EdgeTamPromptEncoderConfig

## EdgeTamVisionModel

[[autodoc]] EdgeTamVisionModel
- forward

## EdgeTamModel

[[autodoc]] EdgeTamModel
- forward
Loading