- New deep learning based system can automatically generate pictures from a long text-based description.
- Researchers demonstrated a network that takes a recipe as input and constructs a picture from scratch.
Generating pictures from a short visual description is a challenging task and has numerous applications in computer vision. Recent studies have proved that Generative Adversarial Networks (GAN) can effectively synthesize high-quality, realistic pictures with low resolution and low variability.
A recent contribution made by a research team at Tel Aviv University, Israel, can help accelerate research in this field. They have built a deep learning-based model that can automatically create pictures from a text-based description.
In particular, they have demonstrated their system generating images of a finished meal from a simple written recipe. To do this, the system uses a combination of state-of-the-art Stacked GAN and learning cross-modal embeddings for cooking recipes and food images.
Conditional Generative Adversarial Networks
Basically, GANs are made of two models (generator and discriminator) that are trained to compete with each other. The generator is designed to synthesize images similar to the original data distribution, while the discriminator’s job is to differentiate between the original and synthetic images.
In this work, researchers used conditional GANs in which both the generator and the discriminator are compelled to consider a specific condition. They proposed two kinds of embedding techniques: semantic and non-semantic regularization. These techniques are composed of three steps:
- Initial embedding of the ingredients and cooking instructions.
- Combined neural embedding of the whole recipe.
- Integration of a semantic regularization loss using a high-level classification objective.
The conditional GAN is trained on 52,000 text-based recipes and their corresponding pictures. It’s trained using NVIDIA TITAN X GPUs with CUDA Deep Neural Network library. Once trained, the system constructed pictures of what the recipe might look like from a long description (that didn’t contain any visual information).
Reference: arXiv:1901.02404 | Tel-Aviv University
The network takes a recipe as input and creates a picture (from scratch) that best reflects the text-based description of food. What’s really impressive here is the system doesn’t have any access to the title of the recipe — otherwise, the job would become too easy — and the text of recipe is quite long. This makes the task difficult for even humans.
Courtesy of researchers
To better evaluate the synthesized pictures, the team asked 30 people to judge the most appealing images on a scale of 1 to 5. They presented 10 corresponding pairs of resulting images (chosen randomly) generated by each embedding technique.
The results showed that the non-semantic regularization method outperforms the semantic regularization by producing more vivid pictures with photorealistic details. In fact, some people found it very hard to differentiate between real and synthetic images.
Moreover, both embedding techniques succeeded in producing ‘porridge-like’ food pictures (such as salad, soups, and rice) but struggles to create food pictures that have a distinctive shape (such as chicken, hamburger, and drinks).