-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Online inference #4
Comments
Hi, Yes, it is possible, but you would need to make the graph construction online. Specifically, you can create graphs on the fly by integrating l.186-223 of data_loader.py into the inference loop. In this case, the larger number of nodes (numv) results in higher latency, so there will be a trade-off. Thank you, |
That makes sense! Thank you so much for your response! |
Hello again, I wanted to clarify in terms of real-time inference further, the consistent real time inference is possible under the assumption that the we were able to detect face and crop their facial features properly for each incoming frame in a video stream or according to the paper 11 consecutive frames of faces cropped? Then we need to encode the cropped image and corresponding audio using the 2DResNet with TSM, correct? And that encoding process requires different computational power and time? Thank you so much, |
Hi Hiro, Yes, all of your assumptions are correct. Our code assumes that the face bounding boxes and their initial audio-visual features are computed by other models. Best regards, |
Hello @hsato1, did you manage to make it work on real-time ? Thank you so much |
Thank you so much for such a wonderful paper!
I am working on exploring active speaker detection for real time and came across this paper and repo and wanted to ask a question.
Is it possible to do an online inference of active speaker with this approach for live video stream??
Thank you so much!
The text was updated successfully, but these errors were encountered: