You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello jpeyre,
I have read your paper and program, it is a nice idea to import spatial features to visual relation detection.
After reading, I confuse the sentence "To detect and localize such triplets in test images, we assume that the candidate object detections for s and o are given by a de- tector trained with full supervision. Here we use the object detector Faster-RCNN [14] trained on the Visual Relationship Detection training set [31]." in Part 3 of this paper. But, in the section Representing Appearance of Objects, you use Fast-RCNN with VGG16 pre-trained on ImageNet to extract the appearance feature. So you mean that you use the same CNN structure(Fast-RCNN) trained on different datasets in these two different steps?
I just find the "vgg16_fast_rcnn.caffemodel" in the program, but do not find the model trained on Visual Relationship Dataset. I wonder if I misunderstand the paper. Could you tell me some details about the model trained on VRD used for extracting the candidate pairs of objects? Thank you!
The text was updated successfully, but these errors were encountered:
Hi z-kun,
We use the same model both for extracting the candidate objects and computing their appearance features. This model is indeed "vgg16_fast_rcnn.caffemodel", a VGG16 network pre-trained on ImageNet and finetuned on the VRD training set.
Hello jpeyre,
I have read your paper and program, it is a nice idea to import spatial features to visual relation detection.
After reading, I confuse the sentence "To detect and localize such triplets in test images, we assume that the candidate object detections for s and o are given by a de- tector trained with full supervision. Here we use the object detector Faster-RCNN [14] trained on the Visual Relationship Detection training set [31]." in Part 3 of this paper. But, in the section Representing Appearance of Objects, you use Fast-RCNN with VGG16 pre-trained on ImageNet to extract the appearance feature. So you mean that you use the same CNN structure(Fast-RCNN) trained on different datasets in these two different steps?
I just find the "vgg16_fast_rcnn.caffemodel" in the program, but do not find the model trained on Visual Relationship Dataset. I wonder if I misunderstand the paper. Could you tell me some details about the model trained on VRD used for extracting the candidate pairs of objects? Thank you!
The text was updated successfully, but these errors were encountered: