Vertex Identifying when inference time #11

GuSangmo · 2023-07-18T05:39:59Z

Hi Kyle, thank you for such a nice paper!
I really learned a lot from your work.

Currently, I am trying to tune your model for inference in ASD task
(for random video, with no annotation about any bbox or entity)

As mentioned in #4 , this code assumes face bbox and audio-visual features
be detected by other models. (in this case, your model used 2DResNet with TSM for feature extraction)

I understood SPELL works on pre-built graph data, where vertex is identified with (video_id, entity_id, timestamp).
Could you give me some advice on how to get entity_id when inference?

I tried to integrate L186 ~ L223 of data_loader.py into inference loop
as you mentioned, but it seems to require entity_id for node.
I thought adding some tracker would help, but I am curious if modifying some of your codes might help this.

Thank you so much,
Sangmo

kylemin · 2023-07-19T18:59:40Z

Hi Sangmo,

Yes, you can use a tracking algorithm to assign the same entity_id to all the face-crops of each person.
I think you can refer to Ego4D's starter code to see how it tracks a face-crop across the frames. It uses a short-term tracking algorithm to link face-crops with the same entity_id. There are many real-time algorithms for this type of short-term tracking.

I hope this helps.

Thank you,
Kyle

GuSangmo · 2023-07-24T14:44:38Z

I really appreciate your tips, Kyle!
Could you answer me one more question?

I tried to extract feature maps with STE(resnet18-tsm-aug) weight you gave,
but I couldn't replicate the accurate result like yours.

I used AudioVideoDatasetAuxLossesForwardPhase(..., clip_length = 11, target_size = (144,144)) , resnet18_two_streams_forward(rgb_stack_size = 11)
from models_stage1_tsm.py and STE_forward.py.

I thought data loading technique can be little different with your TSM inference code,
since using the original STE_forward.py can make the same data for the first few samples.

# This way dataset[0] and dataset[2] has same video & audio data
# Because of the loading logic by ASC requires padding
dataset = AudioVideoDatasetAuxLossesForwardPhase(...)
audio_data, video_data, .... =  dataset[idx]

Was there any slight modifiation to data loading, or STE_forward.py?
(Your bbox feature concatenation part seems to be in SPELL logic part, I guess)

It would be a big help if I could know there was something magic or my mistake.
Otherwise I could work with the given pretrained weight.

Best regards,
Sangmo

hugoobui · 2023-10-11T22:54:48Z

Hello @GuSangmo, I'm also working on making this code for real-time purposes. Did you manage to do it ?

GuSangmo · 2023-11-05T13:49:25Z

@hugoobui
Sorry for late reply.
I couldn't do it, so I chose the other model(LightASD).

In the case of GNN-postprocessing models(e.g. EASEE, SPELL), I thought bbox-tracking should be done in advance, to build the graph for postprocessing. And I couldn't reproduce those feature-encoder part ;-)

My task was due 23 August, so there can be some improvements in this domain.
LightASD is much faster, but I also thought additional engineering is necessary( i.e. wait for 5 frames to be encoded to the model)!
I hope you could tackle this problem.
Best regards,
Sangmo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vertex Identifying when inference time #11

Vertex Identifying when inference time #11

GuSangmo commented Jul 18, 2023

kylemin commented Jul 19, 2023

GuSangmo commented Jul 24, 2023 •

edited

Loading

hugoobui commented Oct 11, 2023

GuSangmo commented Nov 5, 2023 •

edited

Loading

Vertex Identifying when inference time #11

Vertex Identifying when inference time #11

Comments

GuSangmo commented Jul 18, 2023

kylemin commented Jul 19, 2023

GuSangmo commented Jul 24, 2023 • edited Loading

hugoobui commented Oct 11, 2023

GuSangmo commented Nov 5, 2023 • edited Loading

GuSangmo commented Jul 24, 2023 •

edited

Loading

GuSangmo commented Nov 5, 2023 •

edited

Loading