- Microsoft creates new records in the field of conversational artificial intelligence.
- They have developed an enhanced version of Multi-Task Deep Neural Network for learning text representations across various natural language understanding tasks.
Robust and universal language representations are important for obtaining decent results on various Natural Language Processing (NLP) tasks. Ensemble learning is one of the most efficient approaches to enhance model generalization. So far, developers have used it to obtain state-of-the-art outcomes in a variety of natural language understanding (NLU) tasks, ranging from machine reading comprehension to question answering.
However, such ensemble models contain hundreds of deep neural networks (DNN) models and are quite expensive to implement. Pretrained models, such as GPT and BERT, are also very expensive to deploy. GPT, for instance, consists of 48 transformer layer with 1.5 billion parameters, while BERT has 24 transformer layers with 344 million parameters.
In 2019, Microsoft came up with its own natural language processing (NLP) algorithm, named Multi-Task DNN. They have now updated this algorithm to obtain impressive outcomes.
Extending Knowledge Distillation
The research team compressed several ensembled models into one Multi-Task DNN, using knowledge distillation. They used the ensemble model [in an offline manner] to generate soft targets for every single task in the training dataset. Compared to hard targets, they offer more helpful data per training sample.
Let’s take a sentence for example, “I had a good chat with John last evening”, the sentiment in this phrase is unlikely to be negative. However, the sentence “We had an intriguing conversation last evening” can either be negative or positive, based on the context.
The researchers used both the correct targets and soft targets across various tasks to train a single MT-DNN. They utilized cuDNN accelerated PyTorch deep learning framework to train and test the new model on NVIDIA Tesla V100 GPUs.
They compared distilled MT-DNN with normal MT-DNN and BERT. The outcomes show that the distilled MT-DNN outperforms both models by a significant margin, in terms of overall score on the General Language Understanding Evaluation (GLUE) benchmark, which is used for testing system performance on a wide range of linguistic phenomena.
GLUE benchmark score
The benchmark comprises of 9 NLU tasks, including text similarity, textual entailment, sentiment analysis, and question answering. The data contains several hundred sentence pairs drawn from different sources, such as academic and encyclopedic text, news and social media.
All experiments performed in this research clearly show that language representation learned through distilled MT-DNN is more universal and robust than normal MT-DNN and BERT.
In the coming years, researchers will try to find better ways of combining hard correct targets and soft targets for multi-task learning. And, rather than compressing a complicated model to a simpler one, they will explore better ways of using knowledge distillation to enhance model performance regardless of its complexity.