VANiLLa: Verbalized Answers in Natural Language at Large scale
A large scale data for answer verbaliser for simple natural questions.
Question Answering (QA) has been an active field of research in the past years with significant developments in the area of Question Answering over Knowledge Graphs (KGQA). In spite of all the notable advancements, current KGQA datasets only provide the answers as resource or literals rather than full sentences. Thus, template-based verbalizations are usually employed for representing the answers in natural language. This deficiency is a ramification of the scarcity of datasets for verbalizing KGQA responses. Hence, we provide the VANiLLa dataset which aims at reducing this gap. The VANiLLa dataset consists of over 100k simple questions adapted from the CSQA and SimpleQuestionsWikidata datasets along with their answers in natural language sentences. In this paper, we describe the dataset creation process and dataset characteristics. We also present multiple baseline models adapted from current state-of-the-art Natural Language Generation (NLG) architectures. We believe that this dataset will allow researchers to focus on finding suitable methodologies and architectures for answer verbalization.