You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Be able to finetune Owlv2 for grounded object detection using JSONL referencing 3-channel imagery. N-channel imagery would be extra dope. Ideally with high bit depth TIFF support, since my imagery comes in .tif. I see Pillow in the requirements so high bit depth TIFF support might not be possible today without more work to change how imagery is loaded.
Use case
I've played around with OWLv2 a bit and compared it to GroundingDINO and Qwen 2.5 and it seems to do a better job at producing bounding boxes on hard images with small objects (satellite images) whereas the other models produce nothing. This makes me think it is a better candidate for fine-tuning potentially. But I'm definitely not certain and have more testing to do.
Additional
In the geospatial computer vision domain we are in the very earliest of days toward applying VLMs to solve actual problems on massive imagery corpuses. There have been some cool experiments recently that have inspired me to try fine-tuning VLMs to test their limits on remotely sensed imagery using modest sized datasets for fine-tuning.
Can't commit to a PR right now (but might be able to in the future.
Are you willing to submit a PR?
Yes I'd like to help by submitting a PR!
The text was updated successfully, but these errors were encountered:
Search before asking
Description
https://huggingface.co/docs/transformers/en/model_doc/owlv2
Be able to finetune Owlv2 for grounded object detection using JSONL referencing 3-channel imagery. N-channel imagery would be extra dope. Ideally with high bit depth TIFF support, since my imagery comes in .tif. I see Pillow in the requirements so high bit depth TIFF support might not be possible today without more work to change how imagery is loaded.
Use case
I've played around with OWLv2 a bit and compared it to GroundingDINO and Qwen 2.5 and it seems to do a better job at producing bounding boxes on hard images with small objects (satellite images) whereas the other models produce nothing. This makes me think it is a better candidate for fine-tuning potentially. But I'm definitely not certain and have more testing to do.
Additional
In the geospatial computer vision domain we are in the very earliest of days toward applying VLMs to solve actual problems on massive imagery corpuses. There have been some cool experiments recently that have inspired me to try fine-tuning VLMs to test their limits on remotely sensed imagery using modest sized datasets for fine-tuning.
Can't commit to a PR right now (but might be able to in the future.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: