video2text.html

<!DOCTYPE html>
<html lang="en">

<head>

    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="description" content="">
    <meta name="author" content="">

    <title>Prof. Saenko's Research Group</title>

    <!-- Bootstrap Core CSS -->
    <link href="css/bootstrap.min.css" rel="stylesheet">

    <!-- Custom CSS -->
    <link href="css/modern-business.css" rel="stylesheet">

    <!-- Custom Fonts -->
    <link href="font-awesome/css/font-awesome.min.css" rel="stylesheet" type="text/css">

    <!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
    <!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
    <!--[if lt IE 9]>
        <script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
        <script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
    <![endif]-->

</head>

<body>


<!-- Page Content -->
<div class="container">

<!--******************************************-->
<!--begin page content, edit you own page here-->

<h2>Video to text: automatic natural language description of video</h2>

<p><em>In collaboration with UT Austin and UC Berkeley</em></p>

<img style="width: 400px; float: right;" alt="video2text" src="figs/youtube2text.png" hspace="10" vspace="5"> 
<p>
Many core tasks in artificial intelligence require joint modeling of visual data and natural language. The past few years have seen increasing recognition of this problem, with research on connecting words and names to pictures, storytelling based on static images, and visual grounding of natural-language instructions for robotics. This project focuses on generating natural language descriptions that capture sequences of activities depicted in diverse video corpora. The major obstacles to scalable "in-the-wild" video description are limited training data, extreme diversity of visual and language content, and lack of rich and robust representations. We tackle these obstacles by learning the underlying semantics of activities jointly from described video and from available text-only sources, and use them to both constrain visual recognition and to drive text generation.
</p>

<p>
<a href="http://www.eecs.berkeley.edu/~sguada/youtube2text.html">Youtube Dataset</a>
</p>

<p><strong>Papers:</strong></p>

<p>
<a href="http://arxiv.org/abs/1505.05914">A Multi-scale Multiple Instance Video Description Network.</a> Huijuan Xu, Subhashini Venugopalan, Vasili Ramanishka, Marcus Rohrbach, Kate Saenko. ICCV15 workshop on Closing the Loop Between Vision and Language
</p>

<p>
<a href="http://arxiv.org/abs/1411.4389">Long-term Recurrent Convolutional Networks for Visual Recognition and Description.</a>
Jeff Donahue, Lisa Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell. CVPR 2015.
[<a href="http://jeffdonahue.com/lrcn/">Project Website & Code</a>]
</p>

<p>
<a href="http://www.cs.utexas.edu/~ai-lab/downloadPublication.php?filename=http://www.cs.utexas.edu/users/ml/papers/venugopalan.naacl15.pdf&pubid=127495">Translating Videos to Natural Language Using Deep Recurrent Neural Networks. </a>
Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko. NAACL 2015
</p>

<p>
<a href="http://anthology.aclweb.org/C/C14/C14-1115.pdf">Integrating language and vision to generate natural language descriptions of videos in the wild.</a>
J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney. 
In Proceedings of the 25th International Conference on Computational Linguistics (COLING), August 2014.
</p>

<p>
<a href="http://www.eecs.berkeley.edu/~sguada/pdfs/2013-ICCV-YouTube2Text-final.pdf">Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition.</a>
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko; In IEEE International Conference on Computer Vision (ICCV) 2013.
[<a href="http://www.eecs.berkeley.edu/~sguada/youtube2text.html">dataset</a>]
</p>
<!--******************************************-->
<!--end of page content-->


<!-- Footer -->
<footer>
	<div class="row">
		<div class="col-lg-12">
			<p>Copyright &copy; Kate Saenko 2017</p>
		</div>
	</div>
</footer>

</div>
<!-- /.container -->

    <!-- jQuery -->
    <script src="js/jquery.js"></script>

    <!-- Bootstrap Core JavaScript -->
    <script src="js/bootstrap.min.js"></script>

</body>

</html>