Video generation from text

Published

Conference Paper

Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Generating videos from text has proven to be a significant challenge for existing generative models. We tackle this problem by training a conditional generative model to extract both static and dynamic information from text. This is manifested in a hybrid framework, employing a Variational Autoencoder (VAE) and a Generative Adversarial Network (GAN). The static features, called “gist,” are used to sketch text-conditioned background color and object layout structure. Dynamic features are considered by transforming input text into an image filter. To obtain a large amount of data for training the deep-learning model, we develop a method to automatically create a matched text-video corpus from publicly available online videos. Experimental results show that the proposed framework generates plausible and diverse short-duration smooth videos, while accurately reflecting the input text information. It significantly outperforms baseline models that directly adapt text-to-image generation procedures to produce videos. Performance is evaluated both visually and by adapting the inception score used to evaluate image generation in GANs.

Duke Authors

Cited Authors

  • Li, Y; Min, MR; Shen, D; Carlson, D; Carin, L

Published Date

  • January 1, 2018

Published In

  • 32nd Aaai Conference on Artificial Intelligence, Aaai 2018

Start / End Page

  • 7065 - 7072

International Standard Book Number 13 (ISBN-13)

  • 9781577358008

Citation Source

  • Scopus