Wu Dao 2.0: China’s Answer to GPT-3 May in fact be Better
The Chinese govt-backed Beijing Academy of Artificial Intelligence’s (BAAI) has introduced Wu Dao 2.0, the largest language model till date, with 1.75 trillion parameters. It has surpassed OpenAI’s GPT-3 and Google’s Switch Transformer in size. HuggingFace DistilBERT and Google GShard are other popular language models. Wu Dao means ‘enlightenment’ in English.
“Wu Dao 2.0 aims to enable ‘machines’ to think like ‘humans’ and achieve cognitive abilities beyond the Turing test,” said Tang Jie, the lead researcher behind Wu Dao 2.0. The Turing test is a method to check whether or not a computer can think like humans.
Smartphone maker Xiaomi, short-video giant Kuaishou, on-demand service provider Meituan, 100 plus scientists and multiple organisations have collaborated with BAAI on this project.
Wu Dao 2.0
The Wu Dao 2.0 is a pre-trained AI model that uses 1.75 trillion parameters to simulate conversational speech, writes poems, understand pictures and even generate recipes. The next generation Wu Dao model can also predict the 3D structures of proteins, similar to DeepMind’s AlphaFold and power virtual idols. Recently, China’s first virtual student, Hua Zhibing, was built on Wu Dao 2.0.
The language model Wu Dao 2.0 was trained with FastMoE, a Fast Mixture-of-Expert (MoE) training system similar to Google’s Mixture of Experts. Unlike Google’s MoE, FastMoE is an open source system based on Pytorch (Facebook’s open-source framework) with common accelerators. It provides a hierarchical interface for flexible model design and easy adaption to various applications like Transformer-XL and Megatron-LM. The source code of FastMoE is available here.
“[FastMoE] is simple to use, high-performance, flexible, and supports large-scale parallel training,” wrote BAAI in its official WeChat blog.
Result-wise, Wu Dao 2.0 has surpassed SOTA levels on nine benchmark tasks, including:
-
ImageNet (zero-shot) SOTA, exceeds OpenAI CLIP
-
LAMA knowledge detection, more than AutoPrompt
-
LAMBADA Cloze (ability-wise), surpasses Microsoft Turing NLG
-
SuperGLUE (few-short), surpasses OpenAI GPT-3
-
UC Merced Land-Use (zero-shot) SOTA, exceeds OpenAI CLIP
-
MS COCO (text generation diagram), surpasses OpenAI DALL-E
-
MS COCO (English graphic retrieval), more than Google ALIGN and OpenAI CLIP
-
MS COCO (multilingual graphic retrieval), surpasses (the current best multilingual and multimodal model) UC2, M3P
-
Multi 30K (multilingual graphic retrieval), surpasses UC2, M3P
Showcasing benchmark tasks where Wu Dao 2.0 surpasses other SOTA models (Source: BAAI)
Towards multimodal model
Currently, AI systems are moving towards GPT-like multimodal and multitasking models to achieve artificial general intelligence (AGI). Experts believe there will be a rise in multimodal models in the coming months. Meanwhile, some are rooting for embodied AI, rejecting traditional bodiless models, such as neural networks altogether.
Unlike GPT-3 , Wu Dao 2.0 covers both Chinese and English with skills acquired by studying 4.9 terabytes of texts and images, including 1.2 terabytes of Chinese and English texts.
Google has also been working towards developing a multimodal model similar to Wu Dao. At Google I/O 2021, the search giant unveiled language models like LaMDA (trained on 2.6 billion parameters) and MUM (multitask unified model) trained across 75 different languages and 1000x times more powerful than BERT. At the time, Google CEO Sundar Pichai said that LaMDA, trained on only text, will soon shift to a multimodal model to integrate text, image, audio and video.
The training data of Wu Dao 2.0 include:
-
1.2 terabytes of English text data in the Pile dataset
-
1.2 terabytes of Chinese text in Wu Dao Corpora
-
2.5 terabytes of Chinese graphic data
Blake Yan, an AI researcher from Beijing, told South China Morning Post that these advanced models, trained on massive datasets, are good at transfer learning, just like humans. “Large -scale ‘pre-trained models’ are one of today’s best shortcuts to AGI,” said Yan.
“No one knows which is the right step,” said OpenAI on its GPT-3 demo blog post, “Even if larger ‘pre-trained models’ are the logical trend today, we may be missing the forest for the trees, and we may end up reaching a less determined ceiling ahead. The only clear aspect is that if the world has to suffer from ‘environmental damage,’ ‘harmful biases,’ or ‘high economic costs,’ not even reaching AGI would be worth it.”
Turing NLG, GPT-3 & Wu Dao 2.0: Meet The Who’s Who Of Language Models
Language modelling involves the use of statistical and probabilistic techniques to determine the probability of a given sequence of words in a sentence. To make word predictions, language models analyse preceding text data. Language modelling is usually used in applications such as machine translations and question-answer tasks. Many researchers and developers working on building robust and efficient language models posit that larger models, trained on a higher number of parameters, produce better outcomes. In this article, we compare three massive language models to find out if the theory holds.
Turing NLG
Microsoft introduced Turing NLG in early 2020. At that time, it held the distinction of being the largest model ever published, with 17 billion parameters. A Transformer-based generative language model, Turing NLG or T-NLG is part of the Turing project of Microsoft, announced in 2020.
T-NLG can generate words to complete open-ended textual tasks and unfinished sentences. Microsoft has claimed the model can generate direct answers to questions and summarise documents. The team behind T-NLG believes that the bigger the model, the better it performs with fewer training examples. It is also more efficient to train a large centralised multi-task model rather than a new model for every task individually.
T-NLG is trained on the same type of data as NVIDIA’s Megatron-LM and has a maximum learning rate of 1.5×10^-4. Microsoft has used DeepSpeed, trained on 256 NVIDIA GPUs for more efficient training of large models with fewer GPUs.
GPT-3
In July last year, OpenAI released GPT-3–an autoregressive language model trained on public datasets with 500 billion tokens and 175 billion parameters– at least ten times bigger than previous non-sparse language models.To put things into perspective, its predecessor GPT-2 was trained on just 1.5 billion parameters.
GPT-3 is applied without any gradient updates or fine-tuning. It achieves strong performance on many NLP datasets and can perform tasks such as translation, question-answer, reasoning, and 3-digit arithmetic operations.
OpenAI’s language model achieved promising results in the zero-shot and one-shot settings, and occasionally surpassed state-of-the-art models in the few-shot setting.
GPT-3 has a lot of diverse applications, including:
-
The Guardian published an entire article written using GPT-3 titled “A robot wrote this entire article. Are you scared yet, human?” The footnote said the model was given specific instructions on word count, language choice, and a short prompt.
-
A short film of approximately 4 minutes–Solicitors was written by GPT-3.
-
A bot powered by GPT-3 was found to be interacting with people in a Reddit thread.
The industry’s reaction towards GPT-3 has been mixed. The language model has courted controversy over inherent biases, tendency to go rogue when left to its own devices, and its overhyped capabilities.
Wu Dao 2.0
Wu Dao 2.0 is the latest offering from the China government-backed Beijing Academy of Artificial Intelligence (BAAI). It is the latest and the largest language model till date with 1.75 trillion parameters. It has surpassed previous models such as GPT-3, Google’s Switch Transformer in size. Unlike GPT-3 , Wu Dao 2.0 covers both Chinese and English with skills acquired by studying 4.9 terabytes of texts and images, including 1.2 terabytes of Chinese and English texts.
It can perform tasks such as simulating conversational speech, writing poetry, understanding pictures, and even generating recipes. It can also predict the 3D structures of proteins like DeepMind’s AlphaFold. China’s first virtual student Hua Zhibing was built on Wu Dao 2.0.
Wu Dao 2.0 was trained with FastMoE, a Fast Mixture-of-Expert (training system). FastMoE is a PyTorch-based open source system akin to Google’s Mixture of Experts. It offers a hierarchical interface for flexible model design and easy adoption to applications such as Transformer-XL and Megatron-LM.
Are bigger models better?
The size of the language models are increasing. Bigger models are assumed to be better at generalising and taking us a step closer towards artificial general intelligence.
Former Google AI researcher Timnit Gebru detailed the associated risks of large language models in her controversial paper “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”. The paper argued although these models were extraordinarily good and could produce meaningful results, they carry risks such as huge carbon footprints.
Echoing similar sentiments, Facebook’s Yann LeCun said, “It’s entertaining, and perhaps mildly useful as a creative help. But trying to build intelligent machines by scaling up language models is like building high-altitude airplanes to go to the moon. You might beat altitude records, but going to the moon will require a completely different approach.”
All the three discussed language models have been introduced within a span of just one and a half years. The researcher communities around the world are gearing up to develop the next ‘biggest’ language model to achieve unparalleled efficiency at task execution and getting close to the AGI holy grail. However, the lingering question here is whether this is the right way to achieve AGI, especially when in the face of risks including bias, discrimination, and environmental costs.