Microsoft Develops AI That Can Draw Pictures Based On Your Description

  • Microsoft’s new drawing bot can create any picture using text-description and captions. 
  • It consists of two machine learning models – generator and discriminator. 
  • The AI can even draw details that are not described, using its own imagination. 
  • In future, this technology can be used to make animated movies based on screenplays.

In the last couple of years, there have been numerous experiments that turned vague drawings into fine clip arts. But this time, Microsoft came up with something better. Their latest bot, based on deep learning technique, can create photorealistic images using text descriptions and big library of similar pictures to draw on.

The AI (which is still under development in Microsoft Research Lab) is designed to analyze each individual words when generating images from meta text descriptions or captions. The researchers claim that the new technique creates 3 times better image quality compared to the previous state-of-the art text-to-image generation approaches.

Let’s find out what actually they have built and how does it work.

The Drawing Bot and Its Artificial Imagination

If you are asked to draw a blue bird with red wings and a short beak, you will probably start with a rough outline. Then you will go into details and reach for a blue pen to fill in the body. Chances are you will read the description again and reach for a red sketch pen to draw the wings. Finally, you will define it with a reflective glint. That is what bot do.

The drawing bot is capable of creating picture of anything – from tools and gadgets to living species. In order to make the picture look more realistic, the bot can even draw details that are not described, indicating its ability to imagine things.

The picture is created pixel by pixel, from scratch and it may or may not exist in the real world. Since this process involves machine learning algorithms to guess and imagine some missing parts of the image, it’s more challenging task compared to captioning a picture.

Generating Image

The drawing bot is called Attentional Generative Adversarial Network (AttnGAN) that synthesizes fine-grained details at multiple subregions of the image by analyzing description. It has 2 machine learning models –

  1. Generator – create images from the descriptions (texts).
  2. Discriminator – judge the authenticity of generated images using descriptions.

Both these models work together to achieve perfection.

The bot is trained on thousands of datasets containing paired photos and captions, which enable the system to learn how to precisely match words with visual representation. For instance, AttnGAN learns to create a picture of an elephant when caption contains the word “elephant”, likewise, learns how photo of an elephant should look like.

In order to understand the complex sentences, the system breaks the text into separate words and matches those words to certain regions of the picture.

In the training phase, the system learned what we call commonsense. It uses ‘artificial commonsense knowledge’ to fill in details of pictures that are left to the imagination.

How AI puts the bird together

The above image of bird is generated by the drawing bot. There wasn’t any specific detail about the location of the bird. However, instead of placing a stationary bird against a fancy background that looks similar to the sky, the AI elected to place the bird on a branch, which clearly demonstrates the artificial imagination.

Reference: arXiv | 1711.10485 | Microsoft 

The system learned this commonsense from the trained data where the bird should belong. The decision to put a bird on the branch is a result of the fact that most of the pictures in the training data show birds on the branch instead of flying. The ability to think beyond given instruction is really impressing.

Some more creations from the AI

In order to push the system’s flourishing imagination, researchers asked it to produce a picture of a double-decker bus floating on a lake. The best it could do was drippy and blurry picture that resembles a double-decker bus and boat with two decks on a lake surrounded by mountains. This shows that the AI was struggling between the bus description and the fact the boats float on lakes.

Result and Applications

The new AI performed much better than the previous state-of-the-art techniques, improving the best reported inception score by 170.25% on the challenging COCO dataset and 14.14% on the CUB dataset.

These types of technologies could be used as a sketch assistant to interior designers, or as voice-activated photo refinement system. For now, the system is not perfect, but in future, with more computational power, it could create animated movies based on screenplays.


Of course, this isn’t the first technology to combine art and artificial intelligence. Most of the times, intersection of these two leads to fascinating results, like Google AI drew trippy machine-generated images that got their own art show (in 2016). They have also developed an automated drawing bot, and a neural network that guesses what you are trying to draw.

Read: Google Develops Voice AI That Is Indistinguishable From Humans | Tacotron 2 

Facebook, on the other hand, has worked on teaching deep neural network to generate basic images like cars, ships and animals. They are also working on a system that can create your own Bitmoji-like avatar from your photo. Moreover, in 2017, Nvidia developed an AI that creates computer-generated celebrities.

Written by
Varun Kumar

I am a professional technology and business research analyst with more than a decade of experience in the field. My main areas of expertise include software technologies, business strategies, competitive analysis, and staying up-to-date with market trends.

I hold a Master's degree in computer science from GGSIPU University. If you'd like to learn more about my latest projects and insights, please don't hesitate to reach out to me via email at [email protected].

View all articles
Leave a reply