Replies: 4 comments 1 reply
-
We discussed Mask-RCNN style (two pass) vs SSD (one pass) vs yolo (one pass) for object detection, and are leaning towards yolo. Interestingly yolo has gone through a number of iterations improved substantially over the past few years (review). v1-v3 had the original author Joseph Redmon working on it, but has since left CV research. Alexey Bochkovskiy worked on v4. Other teams, including ultralytics, have been improving it up to the current day v8. Ultralytic's v5 is particularly interesting because it introduces multi-task learning, including segmentation. This continues with ultralytics v8, which take into account improvements in efficiency made in v6 such as quantization, etc. Its also a generally nice framework, with the previously mention export functionality. The v8 family of models for the task of instance segmentation are called YOLOv8-seg. By this comment it seems that yolov8-seg architecture is pretty much the same as the v8 object detector with "an extra output module in the head which outputs masks coefficients and an added FCN layer", which is inspired by the YOLACT paper. Worth reading is the Fast Segment Anything paper which uses yolov8-seg and references YOLACT (and is also available via the ultralytics library). |
Beta Was this translation helpful? Give feedback.
-
@eddogola feel free to correct me on anything, and add any additional points here. Once you're ready to write some code, press "Create issue from discussion" and convert it into an issue, since this is the next immediate chunk of work we'll be working on. |
Beta Was this translation helpful? Give feedback.
-
Questions around picking the object detector head:
(We may not get to investigating this but may want to eventually get into that) |
Beta Was this translation helpful? Give feedback.
-
Possible object detectors:
|
Beta Was this translation helpful? Give feedback.
-
Over the last week or so we've been discussing going forward with composable models. The rough idea is to have individual components of an instance segmentation model, that are pre-trained on some task such as imagenet or coco. One component might be an object detector (probably single-shot such as yolo or ssd) which outputs boundbox coordinates (and classes, but we may not even need those). Another component might be a semantic segmenter, which takes in crops defined by the bounding boxes of the object detector, and semantically segments into bg/fg.
The hypothesis is, given pretrained components of the sort described above, can we add on a relatively slim number of layers on each, to have a fine-tunable model?
If so, we can have a composable pipline that goes
images -> object detector (pretrained, inference only) -> slim layers (trainable) -> output bbox coords -> image crops -> semantic segmenter (pretrained, inference only) -> slim layers (trainable) -> instances
In the end this pipeline would be deployed to the browser.
The object detector and semantic segmenter would be pretrained in python on a laptop or DGX, and then converted to a tfjs graph model. We have determined that is possible with more or less arbitrary tensorflow using tfjs convertor or pytorch models using the ultralytics Exporter class (which does
pytorch -> onnx -> tensorflow saved model -> tfjs converter -> tfj graph model
).The slim layers would be tfjs layers models, and therefore trainable in-browser.
For the purposes of validating the hypothesis first however, we'll do it all in python, as a proof of concept. We can worry about converting to tfjs and verifying its performance in the browser after.
Beta Was this translation helpful? Give feedback.
All reactions