Open Information Extraction (OIE) is the task of the unsupervised creation of structured information from text. OIE is often used as a starting point for a number of downstream tasks including knowledge base construction, relation extraction, and question answering. While OIE methods are targeted at being domain independent, they have been evaluated primarily on newspaper, encyclopedic or general web text. In this article, we evaluate the performance of OIE on scientific texts originating from 10 different disciplines. To do so, we use two state-of-the-art OIE systems applying a crowd-sourcing approach. We find that OIE systems perform significantly worse on scientific text than encyclopedic text. We also provide an error analysis and suggest areas of work to reduce errors. Our corpus of sentences and judgments are made available.

6
0
下載
關閉預覽

相關內容

信息抽取 (Information Extraction: IE)是把文本裏包含的信息進行結構化處理,變成表格一樣的組織形式。輸入信息抽取係統的是原始文本,輸出的是固定格式的信息點。信息點從各種各樣的文檔中被抽取出來,然後以統一的形式集成在一起。這就是信息抽取的主要任務。信息以統一的形式集成在一起的好處是方便檢查和比較。 信息抽取技術並不試圖全麵理解整篇文檔,隻是對文檔中包含相關信息的部分進行分析。至於哪些信息是相關的,那將由係統設計時定下的領域範圍而定。

Recent work pre-training Transformers with self-supervised objectives on large text corpora has shown great success when fine-tuned on downstream NLP tasks including text summarization. However, pre-training objectives tailored for abstractive text summarization have not been explored. Furthermore there is a lack of systematic evaluation across diverse domains. In this work, we propose pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective. In PEGASUS, important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. We evaluated our best PEGASUS model on 12 downstream summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills. Experiments demonstrate it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores. Our model also shows surprising performance on low-resource summarization, surpassing previous state-of-the-art results on 6 datasets with only 1000 examples. Finally we validated our results using human evaluation and show that our model summaries achieve human performance on multiple datasets.

0
3
1
下載
預覽

We consider open domain event extraction, the task of extracting unconstraint types of events from news clusters. A novel latent variable neural model is constructed, which is scalable to very large corpus. A dataset is collected and manually annotated, with task-specific evaluation metrics being designed. Results show that the proposed unsupervised model gives better performance compared to the state-of-the-art method for event schema induction.

0
3
0
下載
預覽

Our interest in this paper is in meeting a rapidly growing industrial demand for information extraction from images of documents such as invoices, bills, receipts etc. In practice users are able to provide a very small number of example images labeled with the information that needs to be extracted. We adopt a novel two-level neuro-deductive, approach where (a) we use pre-trained deep neural networks to populate a relational database with facts about each document-image; and (b) we use a form of deductive reasoning, related to meta-interpretive learning of transition systems to learn extraction programs: Given task-specific transitions defined using the entities and relations identified by the neural detectors and a small number of instances (usually 1, sometimes 2) of images and the desired outputs, a resource-bounded meta-interpreter constructs proofs for the instance(s) via logical deduction; a set of logic programs that extract each desired entity is easily synthesized from such proofs. In most cases a single training example together with a noisy-clone of itself suffices to learn a program-set that generalizes well on test documents, at which time the value of each entity is determined by a majority vote across its program-set. We demonstrate our two-level neuro-deductive approach on publicly available datasets ("Patent" and "Doctor's Bills") and also describe its use in a real-life industrial problem.

0
3
0
下載
預覽

In this paper, we propose a span based model combined with syntactic information for n-ary open information extraction. The advantage of span model is that it can leverage span level features, which is difficult in token based BIO tagging methods. We also improve the previous bootstrap method to construct training corpus. Experiments show that our model outperforms previous open information extraction systems. Our code and data are publicly available at https://github.com/zhanjunlang/Span_OIE

0
3
0
下載
預覽

We present a system for rapidly customizing event extraction capability to find new event types and their arguments. The system allows a user to find, expand and filter event triggers for a new event type by exploring an unannotated corpus. The system will then automatically generate mention-level event annotation automatically, and train a Neural Network model for finding the corresponding event. Additionally, the system uses the ACE corpus to train an argument model for extracting Actor, Place, and Time arguments for any event types, including ones not seen in its training data. Experiments show that with less than 10 minutes of human effort per event type, the system achieves good performance for 67 novel event types. The code, documentation, and a demonstration video will be released as open source on github.com.

0
6
0
下載
預覽

Natural Language Inference (NLI) is fundamental to many Natural Language Processing (NLP) applications including semantic search and question answering. The NLI problem has gained significant attention thanks to the release of large scale, challenging datasets. Present approaches to the problem largely focus on learning-based methods that use only textual information in order to classify whether a given premise entails, contradicts, or is neutral with respect to a given hypothesis. Surprisingly, the use of methods based on structured knowledge -- a central topic in artificial intelligence -- has not received much attention vis-a-vis the NLI problem. While there are many open knowledge bases that contain various types of reasoning information, their use for NLI has not been well explored. To address this, we present a combination of techniques that harness knowledge graphs to improve performance on the NLI problem in the science questions domain. We present the results of applying our techniques on text, graph, and text-to-graph based models, and discuss implications for the use of external knowledge in solving the NLI problem. Our model achieves the new state-of-the-art performance on the NLI problem over the SciTail science questions dataset.

0
3
0
下載
預覽

We introduce a multi-task setup of identifying and classifying entities, relations, and coreference clusters in scientific articles. We create SciERC, a dataset that includes annotations for all three tasks and develop a unified framework called Scientific Information Extractor (SciIE) for with shared span representations. The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links. Experiments show that our multi-task model outperforms previous models in scientific information extraction without using any domain-specific features. We further show that the framework supports construction of a scientific knowledge graph, which we use to analyze information in scientific literature.

0
8
0
下載
預覽

Information Extraction (IE) refers to automatically extracting structured relation tuples from unstructured texts. Common IE solutions, including Relation Extraction (RE) and open IE systems, can hardly handle cross-sentence tuples, and are severely restricted by limited relation types as well as informal relation specifications (e.g., free-text based relation tuples). In order to overcome these weaknesses, we propose a novel IE framework named QA4IE, which leverages the flexible question answering (QA) approaches to produce high quality relation triples across sentences. Based on the framework, we develop a large IE benchmark with high quality human evaluation. This benchmark contains 293K documents, 2M golden relation triples, and 636 relation types. We compare our system with some IE baselines on our benchmark and the results show that our system achieves great improvements.

0
4
0
下載
預覽

In this paper, we propose an improved quantitative evaluation framework for Generative Adversarial Networks (GANs) on generating domain-specific images, where we improve conventional evaluation methods on two levels: the feature representation and the evaluation metric. Unlike most existing evaluation frameworks which transfer the representation of ImageNet inception model to map images onto the feature space, our framework uses a specialized encoder to acquire fine-grained domain-specific representation. Moreover, for datasets with multiple classes, we propose Class-Aware Frechet Distance (CAFD), which employs a Gaussian mixture model on the feature space to better fit the multi-manifold feature distribution. Experiments and analysis on both the feature level and the image level were conducted to demonstrate improvements of our proposed framework over the recently proposed state-of-the-art FID method. To our best knowledge, we are the first to provide counter examples where FID gives inconsistent results with human judgments. It is shown in the experiments that our framework is able to overcome the shortness of FID and improves robustness. Code will be made available.

0
3
0
下載
預覽

The task of event extraction has long been investigated in a supervised learning paradigm, which is bound by the number and the quality of the training instances. Existing training data must be manually generated through a combination of expert domain knowledge and extensive human involvement. However, due to drastic efforts required in annotating text, the resultant datasets are usually small, which severally affects the quality of the learned model, making it hard to generalize. Our work develops an automatic approach for generating training data for event extraction. Our approach allows us to scale up event extraction training instances from thousands to hundreds of thousands, and it does this at a much lower cost than a manual approach. We achieve this by employing distant supervision to automatically create event annotations from unlabelled text using existing structured knowledge bases or tables.We then develop a neural network model with post inference to transfer the knowledge extracted from structured knowledge bases to automatically annotate typed events with corresponding arguments in text.We evaluate our approach by using the knowledge extracted from Freebase to label texts from Wikipedia articles. Experimental results show that our approach can generate a large number of high quality training instances. We show that this large volume of training data not only leads to a better event extractor, but also allows us to detect multiple typed events.

0
6
0
下載
預覽
小貼士
相關論文
Jingqing Zhang,Yao Zhao,Mohammad Saleh,Peter J. Liu
3+閱讀 · 6月2日
Open Domain Event Extraction Using Neural Latent Variable Models
Xiao Liu,Heyan Huang,Yue Zhang
3+閱讀 · 2019年6月17日
One-shot Information Extraction from Document Images using Neuro-Deductive Program Synthesis
Vishal Sunder,Ashwin Srinivasan,Lovekesh Vig,Gautam Shroff,Rohit Rahul
3+閱讀 · 2019年6月6日
Junlang Zhan,Hai Zhao
3+閱讀 · 2019年3月1日
Rapid Customization for Event Extraction
Yee Seng Chan,Joshua Fasching,Haoling Qiu,Bonan Min
6+閱讀 · 2018年9月20日
Improving Natural Language Inference Using External Knowledge in the Science Questions Domain
Xiaoyan Wang,Pavan Kapanipathi,Ryan Musa,Mo Yu,Kartik Talamadupula,Ibrahim Abdelaziz,Maria Chang,Achille Fokoue,Bassem Makni,Nicholas Mattei,Michael Witbrock
3+閱讀 · 2018年9月15日
Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction
Yi Luan,Luheng He,Mari Ostendorf,Hannaneh Hajishirzi
8+閱讀 · 2018年8月29日
Lin Qiu,Hao Zhou,Yanru Qu,Weinan Zhang,Suoheng Li,Shu Rong,Dongyu Ru,Lihua Qian,Kewei Tu,Yong Yu
4+閱讀 · 2018年4月10日
Shaohui Liu,Yi Wei,Jiwen Lu,Jie Zhou
3+閱讀 · 2018年3月27日
Ying Zeng,Yansong Feng,Rong Ma,Zheng Wang,Rui Yan,Chongde Shi,Dongyan Zhao
6+閱讀 · 2017年12月11日
相關VIP內容
[綜述]深度學習下的場景文本檢測與識別
專知會員服務
18+閱讀 · 2019年10月10日
相關資訊
計算機 | IUI 2020等國際會議信息4條
Call4Papers
4+閱讀 · 2019年6月17日
Hierarchically Structured Meta-learning
CreateAMind
4+閱讀 · 2019年5月22日
IEEE | DSC 2019誠邀稿件 (EI檢索)
Call4Papers
3+閱讀 · 2019年2月25日
人工智能 | SCI期刊專刊信息3條
Call4Papers
5+閱讀 · 2019年1月10日
Unsupervised Learning via Meta-Learning
CreateAMind
22+閱讀 · 2019年1月3日
A Technical Overview of AI & ML in 2018 & Trends for 2019
待字閨中
10+閱讀 · 2018年12月24日
Disentangled的假設的探討
CreateAMind
5+閱讀 · 2018年12月10日
人工智能 | 國際會議/SCI期刊約稿信息9條
Call4Papers
3+閱讀 · 2018年1月12日
論文筆記 | How NOT To Evaluate Your Dialogue System
科技創新與創業
7+閱讀 · 2017年12月23日
【論文】變分推斷(Variational inference)的總結
機器學習研究會
20+閱讀 · 2017年11月16日
Top