分散式表示即將語言表示為稠密、低維、連續的向量。 研究者最早發現學習得到詞嵌入之間存在類比關係。比如apple−apples ≈ car−cars, man−woman ≈ king – queen 等。這些方法都可以直接在大規模無標注語料上進行訓練。詞嵌入的質量也非常依賴於上下文窗口大小的選擇。通常大的上下文窗口學到的詞嵌入更反映主題信息,而小的上下文窗口學到的詞嵌入更反映詞的功能和上下文語義信息。

VIP內容

題目

Pre-trained Models for Natural Language Processing: A Survey

關鍵詞

預訓練語言模型,深度學習,自然語言處理,BERT,Transfomer,人工智能

簡介

最近,預訓練模型(PTM)的出現將自然語言處理(NLP)帶入了一個新時代。 在此調查中,我們提供了針對NLP的PTM的全麵概述。 我們首先簡要介紹語言表示學習及其研究進展。 然後,我們基於分類從四個角度對現有PTM進行係統分類。 接下來,我們描述如何使PTM的知識適應下遊任務。 最後,我們概述了PTM未來研究的一些潛在方向。該調查旨在作為實踐指南,幫助您理解,使用和開發適用於各種NLP任務的PTM。

作者

Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai ,Xuanjing Huang

譯者

專知成員,範誌廣

成為VIP會員查看完整內容
Pre-trained Models for Natural Language Processing A Survey.pdf
0
36
3

最新內容

Word2vec is one of the most used algorithms to generate word embeddings because of a good mix of efficiency, quality of the generated representations and cognitive grounding. However, word meaning is not static and depends on the context in which words are used. Differences in word meaning that depends on time, location, topic, and other factors, can be studied by analyzing embeddings generated from different corpora in collections that are representative of these factors. For example, language evolution can be studied using a collection of news articles published in different time periods. In this paper, we present a general framework to support cross-corpora language studies with word embeddings, where embeddings generated from different corpora can be compared to find correspondences and differences in meaning across the corpora. CADE is the core component of our framework and solves the key problem of aligning the embeddings generated from different corpora. In particular, we focus on providing solid evidence about the effectiveness, generality, and robustness of CADE. To this end, we conduct quantitative and qualitative experiments in different domains, from temporal word embeddings to language localization and topical analysis. The results of our experiments suggest that CADE achieves state-of-the-art or superior performance on tasks where several competing approaches are available, yet providing a general method that can be used in a variety of domains. Finally, our experiments shed light on the conditions under which the alignment is reliable, which substantially depends on the degree of cross-corpora vocabulary overlap.

0
1
0
下載
預覽

最新論文

Word2vec is one of the most used algorithms to generate word embeddings because of a good mix of efficiency, quality of the generated representations and cognitive grounding. However, word meaning is not static and depends on the context in which words are used. Differences in word meaning that depends on time, location, topic, and other factors, can be studied by analyzing embeddings generated from different corpora in collections that are representative of these factors. For example, language evolution can be studied using a collection of news articles published in different time periods. In this paper, we present a general framework to support cross-corpora language studies with word embeddings, where embeddings generated from different corpora can be compared to find correspondences and differences in meaning across the corpora. CADE is the core component of our framework and solves the key problem of aligning the embeddings generated from different corpora. In particular, we focus on providing solid evidence about the effectiveness, generality, and robustness of CADE. To this end, we conduct quantitative and qualitative experiments in different domains, from temporal word embeddings to language localization and topical analysis. The results of our experiments suggest that CADE achieves state-of-the-art or superior performance on tasks where several competing approaches are available, yet providing a general method that can be used in a variety of domains. Finally, our experiments shed light on the conditions under which the alignment is reliable, which substantially depends on the degree of cross-corpora vocabulary overlap.

0
1
0
下載
預覽
父主題
Top