Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vertex Identifying when inference time #11

Open
GuSangmo opened this issue Jul 18, 2023 · 4 comments
Open

Vertex Identifying when inference time #11

GuSangmo opened this issue Jul 18, 2023 · 4 comments

Comments

@GuSangmo
Copy link

Hi Kyle, thank you for such a nice paper!
I really learned a lot from your work.

Currently, I am trying to tune your model for inference in ASD task
(for random video, with no annotation about any bbox or entity)

As mentioned in #4 , this code assumes face bbox and audio-visual features
be detected by other models. (in this case, your model used 2DResNet with TSM for feature extraction)

I understood SPELL works on pre-built graph data, where vertex is identified with (video_id, entity_id, timestamp).
Could you give me some advice on how to get entity_id when inference?

I tried to integrate L186 ~ L223 of data_loader.py into inference loop
as you mentioned, but it seems to require entity_id for node.
I thought adding some tracker would help, but I am curious if modifying some of your codes might help this.

Thank you so much,
Sangmo

@kylemin
Copy link
Collaborator

kylemin commented Jul 19, 2023

Hi Sangmo,

Yes, you can use a tracking algorithm to assign the same entity_id to all the face-crops of each person.
I think you can refer to Ego4D's starter code to see how it tracks a face-crop across the frames. It uses a short-term tracking algorithm to link face-crops with the same entity_id. There are many real-time algorithms for this type of short-term tracking.

I hope this helps.

Thank you,
Kyle

@GuSangmo
Copy link
Author

GuSangmo commented Jul 24, 2023

I really appreciate your tips, Kyle!
Could you answer me one more question?

I tried to extract feature maps with STE(resnet18-tsm-aug) weight you gave,
but I couldn't replicate the accurate result like yours.

I used AudioVideoDatasetAuxLossesForwardPhase(..., clip_length = 11, target_size = (144,144)) , resnet18_two_streams_forward(rgb_stack_size = 11)
from models_stage1_tsm.py and STE_forward.py.

I thought data loading technique can be little different with your TSM inference code,
since using the original STE_forward.py can make the same data for the first few samples.

# This way dataset[0] and dataset[2] has same video & audio data
# Because of the loading logic by ASC requires padding
dataset = AudioVideoDatasetAuxLossesForwardPhase(...)
audio_data, video_data, .... =  dataset[idx]

Was there any slight modifiation to data loading, or STE_forward.py?
(Your bbox feature concatenation part seems to be in SPELL logic part, I guess)

It would be a big help if I could know there was something magic or my mistake.
Otherwise I could work with the given pretrained weight.

Best regards,
Sangmo

@hugoobui
Copy link

Hello @GuSangmo, I'm also working on making this code for real-time purposes. Did you manage to do it ?

@GuSangmo
Copy link
Author

GuSangmo commented Nov 5, 2023

@hugoobui
Sorry for late reply.
I couldn't do it, so I chose the other model(LightASD).

In the case of GNN-postprocessing models(e.g. EASEE, SPELL), I thought bbox-tracking should be done in advance, to build the graph for postprocessing. And I couldn't reproduce those feature-encoder part ;-)

My task was due 23 August, so there can be some improvements in this domain.
LightASD is much faster, but I also thought additional engineering is necessary( i.e. wait for 5 frames to be encoded to the model)!
I hope you could tackle this problem.
Best regards,
Sangmo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants