Options
A Data Bootstrapping Recipe for LowResource Multilingual Relation Classification
Journal
CoNLL 2021 - 25th Conference on Computational Natural Language Learning, Proceedings
Date Issued
2021-01-01
Author(s)
Nag, Arijit
Samanta, Bidisha
Mukherjee, Animesh
Ganguly, Niloy
Chakrabarti, Soumen
Abstract
Relation classification (sometimes called ‘extraction’) requires trustworthy datasets for finetuning large language models, as well as for evaluation. Data collection is challenging for Indian languages, because they are syntactically and morphologically diverse, as well as different from resourcerich languages like English. Despite recent interest in deep generative models for Indian languages, relation classification is still not wellserved by public data sets. In response, we present IndoRE, a dataset with 21K entity and relationtagged gold sentences in three Indian languages, plus English. We start with a multilingual BERT (mBERT) based system that captures entity span positions and type information, and provides competitive monolingual relation classification. Using this system, we explore and compare transfer mechanisms between languages. In particular, we study the accuracyefficiency tradeoff between expensive gold instances vs. translated and aligned ‘silver’ instances. We release the dataset for future research.