-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAM2 for segmenting a 2 hour video? #264
Comments
You would have to do it in chunks of 10s clips. You could take the mask of the last frame per chunk and use it as input for the next chunk. That would take a while but be fully automated. |
The largest model uses <2GB of VRAM for videos, so a 4090 should have no issues. The main problem would be the likelihood of the segmentation failing at some point combined with the time it takes (using the large model at 60fps, I'd guess it would be 3-4 hours on a 4090), since that's a long time to have to sit there and correct the outputs. It might make sense to first run the tiny model at 512px resolution (see issue #257), which should take <1hr, to give some idea of where the tracking struggles. As for memory build up in the demo, the original code is setup for interactive use and won't work as-is. You'd have to clear the cached results as the video runs (see #196) and probably also avoid loading all the frames in at the start... I guess by a combination of using the async_loading_frames option on Alternatively, there are existing code bases that are aimed at this, for example #90, maybe PR #46, maybe #73 and I also have a script for it. |
Thanks, I'll take a look at the links you provided. Could you explain to me what |
By default, the video predictor loads & preprocesses every single frame of your video before doing any segmentation. If you run the examples, you'll see this show up as a progress bar when you begin tracking:
Only after this finishes does the SAM model actually start doing anything. The model results show up as a different progress bar:
When you set In theory the async loading is a far more practical choice, because it avoids loading everything into memory. Weirdly, the loader runs in it's own thread and actually does store everything in memory, which sort of defeats the purpose. But you can fix it by commenting out the storage line like I mentioned before, and it's also probably worth commenting out the threading lines to stop the loader from trying to get ahead of the model. Those changes should allow you to run any length of video, but the predictor still caches results as it runs (around ~1MB per frame) which will eventually consume all your memory for longer videos (but that can be fixed with the other changes I mentioned). Here's a minimal video example that prints out VRAM usage. You can try running it with a different async setting and with/without the threading/storage lines commented out to see the differences: from time import perf_counter
import torch
import numpy as np
from sam2.build_sam import build_sam2_video_predictor
video_folder_path = "notebooks/videos/bedroom"
cfg, ckpt = "sam2_hiera_t.yaml", "checkpoints/sam2_hiera_tiny.pt"
device = "cuda" # or "cpu"
predictor = build_sam2_video_predictor(cfg, ckpt, device)
inference_state = predictor.init_state(
video_path=video_folder_path,
async_loading_frames=False
)
predictor.add_new_points(
inference_state=inference_state,
frame_idx=0,
obj_id=1,
points=np.array([[210, 350]], dtype=np.float32),
labels=np.array([1], np.int32),
)
tprev = -1
for result in predictor.propagate_in_video(inference_state):
# Do nothing with results, just report VRAM use
if (perf_counter() > tprev + 1.0) and torch.cuda.is_available():
free_bytes, total_bytes = torch.cuda.mem_get_info()
print("VRAM:", (total_bytes - free_bytes) // 1_000_000, "MB")
tprev = perf_counter()
pass When I run this, the worst case scenario is the original code with async=True which uses >2.5GB VRAM and keeps ballooning as it runs. The best case is also with async=True but with threading & storage commented out, which ends up needing around 1.1GB (but will still grow slowly without clearing cached results). |
Any ideas for how to clear the cache in addition to doing async=True but with threading & storage commented out? Thanks. |
Yes, there's a link to some code for this in the post above (e.g. issue 196). |
I tried the following methods: offload_video_to_cpu=True, async_video_to_cpu=True, but it didn't solve the problem of insufficient memory in samurai, which led to the inability to process long videos. Do you have any other better methods? |
@Grpab I do have a solution, I'm surprised how unclear a lot of this has been. First I add a lazy video loader as opposed to loading all frames at once, this can be done easily modifying the code in utils/msic.py I add:
where in init_state I have:
Then I delete older conditional and non conditional frame embeddings after track_step in sam2_video_predictor.py
I haven't tested the above as much, but that's the general idea. and most importantly I modify build_sam2_video_predictor as follows:
Again set these added parameters to your situation, but for me, this worked like a charm. No OOM and great segmentation results on a very long video. |
Thank you very much for your reply, I will try your method. Do you know how to modify it in samurai to process video in real time? |
@Grpab I have no idea lol. |
Then I delete older conditional and non conditional frame embeddings in track_step in sam2_video_predictor.py Can you elaborate on this? “Add this at the end of the function (use the best_memory_frames for your setup), here below only for non_cond_frame_outputs, but you get the idea... ” |
Thank you very much for your help! |
As far as I am aware, init_state takes in video path in samv2 , You are using Samurai rather than Samv2, so I am not sure how to fix that, you will need to patch the compatibilities, and thats the extent of my help, best of luck. They may be using the legacy version of the sam2_video_predictor, but I don't think it should be too hard, just study the code and make the necessary adjustments I think. |
OK, thank you |
I used your method and it was very effective. It improved from 20s to one minute, but if it exceeds one minute, it will still report insufficient memory. |
This workflow involves continuously releasing old frames to maintain constant memory and GPU memory overhead in infinite-length video processing. However, it requires timely retrieval of processed frame results for streaming output. If these frame results are released before being accessed, the inference computation will have been wasted. |
In your opinion would it be possible to use SAM2 to segment a 2 hour video (720p, 60fps) with a 4090 GPU, avoiding of course the errors due to lack of memory?
What could be the best strategy to succeed in doing so?
The text was updated successfully, but these errors were encountered: