LC-QuAD 2.0

Smart Data Analytics

LC-QuAD 2.0: A large dataset for complex question answering over Wikidata and DBpedia

Providing machines with the capability of exploring knowledge graphs and answering user questions have been an active area of research in the last decade.  Question Answering over knowledge graphs by translating natural language questions to formal queries has been one of the key approaches.  To advance the research area several datasets like WebQuestions, QALD and LCQuAD have been published in the past. The biggest data set available for the complex questions (LCQuAD) over the knowledge graph contains five thousand questions. We now provide LC-QuAD 2.0 (Large-Scale Complex Question Answering Dataset) with 30,000 questions, their paraphrases and their corresponding SPARQL queries. LC-QuAD 2.0 is compatible with both Wikidata and DBpedia 2018 knowledge graphs.

Dataset Generation Workflow

The core of the methodology is to generate SPARQL queries based on sparql templates, selected entities and suitable predicate. The SPARQL are then transformed into Template Questions QT, which acts as an intermediate stage between natural language and formal language. Then a large crowdsourcing experiment(AMT) is conducted where the QT are verbalised to natural language questions - ie verbalised questions QV and then later paraphrase them to the paraphrased questions QP

LC-QuAD 2.0 Workflow            lcquad_kg

JSON Structure

The dataset generated has the following JSON structure

     "uid": a unique id number
     "sparql_wikidata": a sparql fro wikidata endpoint
     "sparql_dbpedia18": a sparql for DBpedia endpoint which has wikidata information
     "NNQT_question": system generated question,
     "question": Verbalised question,
     "paraphrased_question": paraphrased version of the verbalised question,
     "template_id": id for the template
     "template": template discription    

Project Team