Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Online inference #4

Closed
hsato1 opened this issue Jan 16, 2023 · 5 comments
Closed

Question: Online inference #4

hsato1 opened this issue Jan 16, 2023 · 5 comments

Comments

@hsato1
Copy link

hsato1 commented Jan 16, 2023

Thank you so much for such a wonderful paper!

I am working on exploring active speaker detection for real time and came across this paper and repo and wanted to ask a question.
Is it possible to do an online inference of active speaker with this approach for live video stream??

Thank you so much!

@kylemin
Copy link
Collaborator

kylemin commented Jan 17, 2023

Hi,

Yes, it is possible, but you would need to make the graph construction online. Specifically, you can create graphs on the fly by integrating l.186-223 of data_loader.py into the inference loop. In this case, the larger number of nodes (numv) results in higher latency, so there will be a trade-off.

Thank you,
Kyle

@hsato1
Copy link
Author

hsato1 commented Jan 18, 2023

That makes sense!

Thank you so much for your response!

@hsato1
Copy link
Author

hsato1 commented Jan 27, 2023

Hello again,

I wanted to clarify in terms of real-time inference further, the consistent real time inference is possible under the assumption that the we were able to detect face and crop their facial features properly for each incoming frame in a video stream or according to the paper 11 consecutive frames of faces cropped? Then we need to encode the cropped image and corresponding audio using the 2DResNet with TSM, correct? And that encoding process requires different computational power and time?

Thank you so much,
Hiro

@kylemin
Copy link
Collaborator

kylemin commented Jan 28, 2023

Hi Hiro,

Yes, all of your assumptions are correct. Our code assumes that the face bounding boxes and their initial audio-visual features are computed by other models.
I hope this clarifies your questions!

Best regards,
Kyle

@hugoobui
Copy link

Hello @hsato1, did you manage to make it work on real-time ? Thank you so much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants