DeepIM: An AI To Estimate 6D Pose From 2D Color Image

  • DeepIM is a deep learning based system that accurately estimates 6D pose of objects using color images only. 
  • It considerably outperforms the state-of-the-art methods that use depth images for pose refinement. 

Several real-world applications require object localization in 3D from a standard image. For example, in virtual reality apps, the ability to recognize 6D pose (3D location and 3D orientation) of objects allows virtual interactions between humans and objects. In robotics, it provides useful data to identify and move objects in its vicinity.

Recently developed technologies use depth cameras to estimate the 6D pose of objects, but they are not quite accurate. These cameras have certain limitations when it comes to depth range, resolution, field of view and frame rate. That’s why they cannot identify thin, transparent, small and fast moving objects.

Another technique of estimating 6D pose of an object is to use RGB image. But the method is equally challenging because the object’s appearance keeps on changing due to occlusions, lightning and pose variations. Also, the algorithm has to maintain both textured and texture-less objects.

Now, the researchers at the University of Washington, Tsinghua University and NVIDIA have developed a deep learning based system, named DeepIM that performs iterative 6D pose matching using color images only.

How It Works?

Using objects’ initial 6D pose estimation, DeepIM provides a relative pose transformation, which can be deployed to initial pose to enhance the 6D pose estimation. While training, the deep neural network gradually learns to match the pose of the object.

Network architecture for pose matching | Courtesy of researchers

The architecture of the DeepIM is based on FlowNetSimple (trained to estimate optical flow between an image pair) and VGG16 image classification network. Starting from the input, the feature map goes through 11 convolution layers, including 2 fully-connected layers followed by 2 additional layers for estimating the quaternion of 3D rotation and 3D translation, respectively. During training, two auxiliary branches regulate the network’s feature representation while enhancing training stability.

Reference: arXiv:1804.00175

Training Strategy: For each picture, they created 10 random poses near the ground truth pose, forming 2,000 training samples for every single object in the training dataset. They also created 10,000 additional synthetic pictures for individual objects where the distribution of pose is similar to the real training set. Therefore, they had a total of 12,000 samples for every object in training.

The network is trained on thousands of images (taken from LINEMOD dataset) using NVIDIA Tesla V1000 GPUs with MXNetframework.

The researchers also developed an untangled pose representation that does not depend on the 3D object’s coordinate frame. This helped them make DeepIM even better: the neural network is now capable of matching poses of unseen items.

Read: AI Can Put Anyone In Any Pose | Synthesizing Human Images In Unseen Poses

Using only color images, this technique considerably outperforms the state-of-the-art approaches of 6D pose estimation. The results are quite impressive and its performance is similar to the iterative closest point algorithm, a method that uses depth images for pose refinement.


DeepIM can be used in a wide range of applications. For example, the stereo version of this method can further enhance the accuracy of poses. Moreover, the technique makes it possible to generate precise 6D pose estimates with help of color images only, offering detailed and useful estimates for applications like virtual reality and robot manipulation.

Leave a reply