GPT-3's Economical Chinese Cousin
GPT-3, the largest neural platform with 175Bn parameters, became a breakthrough innovation last year. Launched in June 2020, it writes codes like author blogs, websites, humans, creates apps & stories. Its antecedent GPT-2 had just 1.5Bn parameters. Extensive pre-trained language models (PLMs) such as GPT-3 have illustrated the best performance on natural language generation with less-shot in-context learning.
However, among other cons, most language models are only available in English. Massive models like GPT-3 have been trained on a 45TB dataset drawn from English sources. A feasible solution to this problem may be in sight as Chinese firm - Huawei, which created PanGu Alpha, a 750GB model that contains up to 200Bn parameters. It is being touted as the Chinese equivalent of GPT-3 & is trained on 1.1TB of Chinese news, websites, language e-books, social media posts & encyclopedias.
Training GPT-3 Vs PanGu Alpha
Huge models with more than 10Bn parameters undergo 3 main training challenges:
Not all PLMs can be seamlessly scaled to hundreds of billions of parameters. They might face challenges such as slow divergence or convergence during the training as the model size expands. To solve this, researchers opt for transformer-based autoregressive language as the base architecture for PanGu Alpha. In addition, a query layer was also added above the Transformer layer, thus helping the model scale up to 200Bn parameters.
On the one hand, the data amount should be enough to feed extensive PLM; on the other hand, this data must be of top-notch quality & diversity to ensure the model's generality. The Huawei team acquired data from - encyclopedias, Common Crawl, news & e-books. The data was then processed, filtered & cleaned to ensure reliability & quality.
They eradicated documents containing less than 60 percent of Chinese characters, less than 150 advertisements, only titles, characters & navigation bars. The text was converted to simplified Chinese, & over 700 spam, poor-quality samples & offensive words were filtered out.
The memory needs of PanGu Alpha with 200Bn parameters were beyond the scope of the latest Artificial Intelligence (AI) processors. The issue becomes more competitive when the hardware topology is considered. For PanGu Alpha, scientists combined 5D similar functionalities & applied them to the model, trained on a cluster of 2048 Ascend 910 AI processors powered by CANN4.
For this research, the authors trained 3 models on the best quality Chinese text corpus with expanding magnitude of parameter sizes - 2.6Bn, 13Bn & 200Bn. The models were first examined on language modeling tasks, which noted the decrease in perplexity with an increase in model capacity & amount of computation & data. Moreover, the team also analyzed the model's text generation ability in different scenarios such as Q&A, dialogue generation & summarization. The results showed that the performance enhances with the increasing capacity.
Adepts believe that the most vital aspect of PanGu Alpha is its availability in the Chinese language. However, it seems like that in terms of architecture, this project doesn't provide anything new. The group has made no statements of solving major blockers, like solving math problems correctly or answering the question without paraphrasing training data. It doesn't even provide a solution to GPT-3 shortcomings on which it is modeled.
The group is currently working on launching the code through APIs for the more significant benefits of both commercial firms & non-profit research institutes. Further, the group has open-sourced the parallel computing functionalities in MindSpore5's Auto-parallel module. MindSpore5 is an in-depth learning inference & training framework that can be used for cloud platforms, mobile & edge.
Image Source: Unsplash