🚀 Code and Trained Models Coming Soon! 🚀
Our approach integrates natural language knowledge and temporal cues for streaming-based Referring Video Segmentation (RVOS). We mitigate tracking bias—where the model may overlook an identifiable object while tracking another—through a learnable mechanism. This enables efficient streaming processing, leveraging memory from previous frames to maintain context and ensure accurate object segmentation.
SAMWISE for streaming-based RVOS.
SAMWISE (our model, not the hobbit) segments objects from The Lord of the Rings in zero-shot—no extra training, just living up to its namesake! 🧙♂️✨