Questions about the code for JHMDB #7

wenzhengzeng · 2022-09-03T13:55:09Z

Thanks for the great work. I have read the code for JHMDB and have some questions:
(1) The performance of mAP@0.5 is just 0.72, much lower than the 82.3 that is reported.
(2) I also notice that the provided evaluation code for JHMDB is for frame-mAP, rather than video-mAP, because the AP is calculated on frame-level rather than tubelet-level.
(3) Although the query number is defined as 10*clip_len, only the predictions of the queries corresponding to the intermediate frame (key_pos) are extracted as the final prediction result during training and testing. In other words, such a pipeline is more like a video object detection where the input is a video clip but the goal is just to predict the object and its class in the middle frame of the input video. I did not find the place that can reveal the properties of the so called tubelet transformer.
In summary, is some configurations wrong with the current code?

coocoo90 · 2022-09-11T16:10:22Z

Thanks for being interested with the work.
For (1) and (2), we haven't released the video-level evaluation using our output yet. Frame-level mAP will be lower than video-level. We are working on the TubeR-UCF open source, the video-level evaluation will be released with TubeR-UCF. The video-level eval scripts are written based on ActDetctor ('https://github.com/vkalogeiton/caffe/tree/act-detector/act-detector-scripts'). You can try to dump the TubeR outputs and modify the ActDetctor code to fit our outputs if you want the result urgently.

For (3), we extract the key frame output only for evaluate frame-mAP. For video-level mAP, the complete output is necessary. During training, we randomly select the position of the key frame and backpropagate the bbox loss for the selected position. In that case, the TubeR can predict correct bounding box in every position (see the latest PR for details).

wenzhengzeng · 2022-09-12T04:53:07Z

Thank you very much for the detailed reply! I am looking forward to the complete version of the TubeR-UCF code.

wenzhengzeng · 2022-10-31T06:54:43Z

Hi. I have one more question. Since the input of TubeR is a video clip, how to merge the prediction results from the adjacent clips? Is the same linking strategy as ActDetector and MOC used in TubeR?

sqiangcao99 · 2023-01-05T09:27:59Z

For (3), we extract the key frame output only for evaluate frame-mAP. For video-level mAP, the complete output is necessary. During training, we randomly select the position of the key frame and backpropagate the bbox loss for the selected position. In that case, the TubeR can predict correct bounding box in every position (see the latest PR for details).
@coocoo90 Hi. According to the paper, all the detections in the tubelet seem to participate in the calculation of losses. I'm a little confused.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the code for JHMDB #7

Questions about the code for JHMDB #7

wenzhengzeng commented Sep 3, 2022

coocoo90 commented Sep 11, 2022

wenzhengzeng commented Sep 12, 2022

wenzhengzeng commented Oct 31, 2022

sqiangcao99 commented Jan 5, 2023 •

edited

Loading

Questions about the code for JHMDB #7

Questions about the code for JHMDB #7

Comments

wenzhengzeng commented Sep 3, 2022

coocoo90 commented Sep 11, 2022

wenzhengzeng commented Sep 12, 2022

wenzhengzeng commented Oct 31, 2022

sqiangcao99 commented Jan 5, 2023 • edited Loading

sqiangcao99 commented Jan 5, 2023 •

edited

Loading