We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and disconvergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B – the largest Chinese language model – across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization, without quantization aware training and with almost no performance loss, making it the first among 100B-scale models. More importantly, the property allows its effective inference on 4×RTX 3090 (24G) or 8×RTX 2080 Ti (11G) GPUs, the most ever affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at https://github.com/THUDM/GLM-130B
Prompt tuning, which only tunes continuous prompts with a frozen language model, substantially reduces per-task storage and memory usage at training. However, in the context of NLU, prior work reveals that prompt tuning does not perform well for normal-sized pretrained models. We also find that existing methods of prompt tuning cannot handle hard sequence labeling tasks, indicating a lack of universality. We present a novel empirical finding that properly optimized prompt tuning can be universally effective across a wide range of model scales and NLU tasks. It matches the performance of finetuning while having only 0.1%-3% tuned parameters. Our method P-Tuning v2 is an implementation of Deep Prompt Tuning (CITATION) optimized and adapted for NLU. Given the universality and simplicity of P-Tuning v2, we believe it can serve as an alternative to finetuning and a strong baseline for future research.
There have been various types of pretraining architectures including autoencoding models (e.g., BERT), autoregressive models (e.g., GPT), and encoder-decoder models (e.g., T5). On the other hand, NLP tasks differ in nature, with three main categories being natural language understanding (NLU), unconditional generation, and conditional generation, while none of the pretraining frameworks performs the best for all tasks. We propose a General Language Model (GLM) based on autoregressive blank infilling to address this challenge. The proposed architecture has two major benefits: (1) It improves pretrain-finetune consistency via cloze-style finetuning and naturally handles variable-length blank infilling which is crucial for many downstream tasks. Empirically, GLM substantially outperforms BERT on the SuperGLUE natural language understanding benchmark with the same amount of pretraining data and steps. (2) It is flexible enough to handle various NLP tasks with a single pretrained model. GLM with 1.25x parameters of BERT-Large achieves the best performance in NLU, conditional and unconditional generation at the same time, demonstrating its generalizability to different downstream tasks.
Inferring new facts from an existing knowledge graph with explainable reasoning processes is an important problem, known as knowledge graph (KG) reasoning. The problem is often formulated as finding the specific path that represents the query relation and connects the query entity and the correct answer. However, due to the limited expressiveness of individual paths, the majority of previous works failed to capture the complex subgraph structure in the graph. We propose CogKR that traverses the knowledge graph to conduct multi-hop reasoning. More specifically, motivated by the dual process theory from cognitive science, our framework is composed of an extension module and a reasoning module. By setting up a cognitive graph through iteratively coordinating the two modules, CogKR can cope with more complex reasoning scenarios in the form of subgraphs instead of individual paths. Experiments on three knowledge graph reasoning benchmarks demonstrate that CogKR achieves significant improvements in accuracy compared with previous methods while providing the explainable capacity. Moreover, we evaluate CogKR on the challenging one-shot link prediction task, exhibiting the superiority of the framework on accuracy and scalability compared to the state-of-the-art approaches.
While implicit feedback (e.g., clicks, dwell times, etc.) is an abundant and attractive source of data for learning to rank, it can produce unfair ranking policies for both exogenous and endogenous reasons. Exogenous reasons typically manifest themselves as biases in the training data, which then get reflected in the learned ranking policy and often lead to rich-get-richer dynamics. Moreover, even after the correction of such biases, reasons endogenous to the design of the learning algorithm can still lead to ranking policies that do not allocate exposure among items in a fair way. To address both exogenous and endogenous sources of unfairness, we present the first learning-to-rank approach that addresses both presentation bias and merit-based fairness of exposure simultaneously. Specifically, we define a class of amortized fairness-of-exposure constraints that can be chosen based on the needs of an application, and we show how these fairness criteria can be enforced despite the selection biases in implicit feedback data. The key result is an efficient and flexible policy-gradient algorithm, called FULTR, which is the first to enable the use of counterfactual estimators for both utility estimation and fairness constraints. Beyond the theoretical justification of the framework, we show empirically that the proposed algorithm can learn accurate and fair ranking policies from biased and noisy feedback.
While GPTs with traditional ﬁne-tuning fail to achieve strong results on natural language understanding (NLU), we show that GPTs can be better than or comparable to similar-sized BERTs on NLU tasks with a novel method P-tuningwhich employs trainable continuous prompt embeddings. On the knowledge probing (LAMA) benchmark, the best GPT recovers 64% (P@1) of world knowledge without any additional text provided during test time, which substantially improves the previous best by 20+ percentage points. On the SuperGlue benchmark, GPTs achieve comparable and sometimes better performance to similar-sized BERTs in supervised learning. Importantly, we ﬁnd that P-tuning also improves BERTs’ performance in both few-shot and supervised settings while largely reducing the need for prompt engineering. Consequently, Ptuning outperforms the state-of-the-art approaches on the few-shot SuperGlue benchmark.
We study the problem of personalized article recommendation, in particular when the user’s preference data is missing or limited, which is knowns as the user cold-start issue in recommender systems. We propose POLAR++, an active recommendation framework that utilizes Bayesian neural networks to capture the uncertainty of user preference, actively selects articles to query the user for feedback, and adaptively learns user preference with one-shot learning. For the article recommendation, we design an attention-based CNN to quantify the similarity between user preference and recommended articles, which signiﬁcantly improves the performance with only a few articles rated by the users. We evaluate the proposed POLAR++ on datasets of different scale and sources. Experimental results demonstrate the effectiveness of the proposed model. We have successfully deployed POLAR++ into AMiner as the recommendation engine for article recommendation, which further conﬁrms the effectiveness of the proposed model.
Cold-start problems are long-standing challenges for practical recommendations. Most existing recommendation algorithms rely on extensive observed data and are brittle to recommendation scenarios with few interactions. This paper addresses such problems using few-shot learning and meta learning. Our approach is based on the insight that having a good generalization from a few examples relies on both a generic model initialization and an effective strategy for adapting this model to newly arising tasks. To accomplish this, we combine the scenario-specific learning with a model-agnostic sequential meta-learning and unify them into an integrated end-toend framework, namely Scenario-specific Sequential Meta learner (or $s^2$ Meta ). By doing so, our meta-learner produces a generic initial model through aggregating contextual information from a variety of prediction tasks while effectively adapting to specific tasks by leveraging learning-to-learn knowledge. Extensive experiments on various real-world datasets demonstrate that our proposed model can achieve significant gains over the state-of-the-arts for cold-start problems in online recommendation. Deployment is at the Guess You Like session, the front page of the Mobile Taobao.
In this paper, we propose POLAR, an attention-based CNN combined with one-shot learning for personalized article recommendation. Given a query, POLAR uses an attention-based CNN to estimate the relevance score between the query and related articles. The attention mechanism can help significantly improve the relevance estimation. For example, on AMiner, this can help achieve a +5.0% improvement in terms of NDCG@3. One more challenge in personalized article recommendation is how to collect statistically sufficient training data for a recommendation model. POLAR combines a one-shot learning function into the recommendation model, which further gains significant improvements. For example, on AMiner, with only 1.6 feedbacks on average, POLAR achieves 2.7% improvement by NDCG@3. We evaluate the proposed POLAR on three different datasets: AMiner, Patent, and RARD. Experimental results demonstrate the effectiveness of the proposed model. Recently, we have successfully deployed POLAR into AMiner as the recommendation engine for article recommendation, which further confirms the effectiveness of the proposed model.