Luís Garcia received the prize of best PhD thesis

logo-capesWe are very pleased to announce that Dr. Luis Paulo Faina Garcia, a researcher from SDA received the prize for the best PhD thesis in Computer Science in 2016 from the Brazilian Government Council. The title of his thesis is “Noise Detection in Classification Problems” with the supervision of Prof. Dr. André de Carvalho from the University of São Paulo. In 2017 his thesis was also selected between the best thesis by the Brazilian Computer Science Society.

The main contributions of his work improved the accuracy in a Machine Learning system based on noise detection to predict non-native species in protected areas of the Brazilian state of São Paulo. The results obtained were several publications in good conferences and and high-quality journals.  

 

Short Abstract: Large volumes of data have been produced in many application domains. Nonetheless, when data quality is low, the performance of Machine Learning techniques is harmed. Real data are frequently affected by the presence of noise, which, when used in the training of Machine Learning techniques for predictive tasks, can result in complex models, with high induction time and low predictive performance. Identification and removal of noise can improve data quality and, as a result, the induced model. This thesis proposes new techniques for noise detection and the development of a recommendation system based on meta-learning to recommend the most suitable filter for new tasks. Experiments using artificial and real datasets show the relevance of this research.

 

Prof. Manolis Koubarakis visit SDA

manolisProf. Manolis Koubarakis  from the Department of Informatics and Telecommunications at the National and Kapodistrian University of Athens, was visiting the SDA group on the 21st of September 2017.
Manolis Koubarakis with his research group of Management of Data, Information & Knowledge has been working the last 7 years on managing geospatial data and has contributed to various research projects and applications on this domain. Examples of successful projects include LEO: Linked Earth Observation Data and MELODIES: Maximizing the Exploitation of Linked Open Data in Enterprise and Science, and some of their applications, widely used by the research community, are Strabon (spatiotemporal RDF store)  and SEXTANT (web-based platform for visualizing, exploring and interacting with time-evolving linked geospatial data).

The goal of his visit was to exchange experience and ideas on data management techniques specifically for geospatial data. Apart from presenting various use cases where geospatial tools have helped scientists to get useful insights from scientific data, Prof. Koubarakis shared with our group future research problems and challenges related to this research area. From our side, SDA researchers presented their work on managing Big Data (query processing, analytics, benchmarking, etc.), as well as realated tools, like SANSA – Semantic Analytics Stack and Ontario – Semantic Data Lake.

 
SDA and MADgIK have already been working together since a few years in the context of the EU H2020 projects Big Data Europe and WDAqua and hope to strengthen this collaboration in new projects and joint research activities. The important outcome of this meeting was the plan to organize a common workshop on managing scientific geospatial data in the near future.

SDA at TPDL2017 & a Honorary Mention Award

TPDL2017_logoTPDL 2017: The 21st version of the International Conference on Theory and Practice of Digital Libraries took place in Thessaloniki, Greece from September 18 to 21, 2017.

We as SDA group had four scientific papers accepted and presented:

And we are happy to win the Honorary award for the long paper entitled ‘Exploiting Interlinked Research Metadata’ presented by Sahar Vahdati.

Paper abstract: OpenAIRE, the Open Access Infrastructure for Research in Europe, aggregates metadata about research (projects, publications, people, organizations, etc.) into a central Information Space. OpenAIRE aims at increasing interoperability and reusability of this data collection by exposing it as Linked Open Data (LOD). By following the LOD principles, it is now possible to further increase interoperability and reusability by connecting the OpenAIRE LOD to other datasets about projects, publications, people, and organizations. Doing so required us to identify link discovery tools that perform well, as well as candidate datasets that provide comprehensive scholarly communication metadata, and then to specify linking rules. We demonstrate the added value that interlinking provides for end users by implementing visual frontends for looking up publications to cite, and publication statistics, and evaluating their usability on top of interlinked vs. non-interlinked data

This year at TPDL 2017, three very interesting keynote speeches were given by Paul Groth on Machines are people too,


Elton Barker on Back to the future: annotating, collaborating and linking in a digital ecosystem and Dimitrios Tzovaras on Visualization in the big data era: data mining from networked information.
Thanks to all organizers at TPDL 2017 mainly general chairs:

  • Yannis Manolopoulos, Aristotle University of Thessaloniki, Greece
  • Lazaros Iliadis, Democritus University of Thrace, Greece

Papers accepted at K-CAP 2017

K-CAP-2017-logoWe are very pleased to announce that our group got 4 papers accepted for presentation at K-CAP 2017, which will be held on December 4th-6th, 2017, Austin, Texas, United States.
The Ninth International Conference on Knowledge Capture attracts researchers from diverse areas of Artificial Intelligence, including knowledge representation, knowledge acquisition, intelligent user interfaces, problem-solving and reasoning, planning, agents, text extraction, and machine learning, information enrichment and visualization, as well as researchers interested in cyber-infrastructures to foster the publication, retrieval, reuse, and integration of data.

Here is the list of the accepted paper with their abstract:

Capturing Knowledge in Semantically-typed Relational Patterns to Enhance Relation Linking” by Kuldeep Singh, Isaiah Onando Mulang, Ioanna Lytra, Mohamad Yaser Jaradeh, Ahmad Sakor, Maria-Esther Vidal, Christoph Lange and Sören Auer.

Abstract: Transforming natural language questions into formal queries is an integral task in Question Answering (QA) systems. QA systems built on knowledge graphs like DBpedia, require an extra step after Natural Language Processing (NLP) for linking words, specifically including named entities and relations, to their corresponding entities in a knowledge graph. To achieve this task, several approaches rely on background knowledge bases containing semantically-typed relations, e.g., PATTY, for an extra disambiguation step. Two major factors may affect the performance of relation linking approaches whenever background knowledge bases are accessed: a)limited availability of such semantic knowledge sources, and b) lack of a systematic approach on how to maximize the benefits of the collected knowledge. We tackle this problem and devise SIBKB, a semantic-based index able to capture knowledge encoded on background knowledge bases like PATTY. SIBKB represents a background knowledge base as a bi-partite and a dynamic index over the relation patterns included the knowledge base. Moreover, we develop a relation linking component able to exploit SIBKB features. The benefits of SIBKB are empirically studied on existing QA benchmarks. Observed results suggest that SIBKB is able to enhance the accuracy of relation linking by up to three times.


“SimDoc: Topic Sequence Alignment based Document Similarity Framework” by Gaurav Maheshwari, Priyansh Trivedi, Harshita Sahijwani, Kunal Jha, Sourish Dasgupta and Jens Lehmann.

Abstract: Document similarity is the problem of estimating the degree to which a given pair of documents has similar semantic content. An accurate document similarity measure can improve several enterprise relevant tasks such as document clustering, text mining, and question-answering. In this paper, we show that a document’s thematic flow, which is often disregarded by bag-of-word techniques, is pivotal in estimating their similarity. To this end, we propose a novel semantic document similarity framework, called SimDoc. We model documents as topic-sequences, where topics represent latent generative clusters of related words. Then, we use a sequence alignment algorithm to estimate their semantic similarity. We further conceptualize a novel mechanism to compute topic-topic similarity to fine tune our system. In our experiments, we show that SimDoc outperforms many contemporary bag-of-words techniques in accurately computing document similarity, and on practical applications such as document clustering.


“SQCFramework: SPARQL Query Containment Benchmark Generation Framework” by Muhammad Saleem, Claus Stadler, Qaiser Mehmood, Jens Lehmann and Axel-Cyrille Ngonga Ngomo.

Abstract: Query containment is a fundamental problem in data management. Its main application is in global query optimization. A number of SPARQL query containment solvers for SPARQL have been developed recently. To the best of our knowledge, the Query Containment Benchmark (QC-Bench) is the only benchmark for evaluating these containment solvers. However, this benchmark contains a fixed number of synthetic queries, which were handcrafted by its creators. We propose SQCFramework, a SPARQL query containment benchmark generation framework which is able to generate customized SPARQL containment benchmarks from real SPARQL query logs. The framework is flexible enough to generate benchmarks of varying sizes and according to the user-defined criteria on the most important SPARQL features to be considered for query containment benchmarking. The generation of benchmarks is achieved using different clustering algorithms. We compare state-of-the-art SPARQL query containment solvers by using different query containment benchmarks generated from DBpedia and Semantic Web Dog Food query logs.


“Semantic Zooming for Ontology Graph Visualizations” by Vitalis Wiens, Steffen Lohmann and Sören Auer.

Abstract: Visualizations of ontologies, in particular graph visualizations in the form of node-link diagrams, are often used to support ontology development, exploration, verification, and sensemaking. With growing size and complexity of ontology graph visualizations, their represented information tend to become hard to comprehend due to visual clutter and information overload. We present an approach that abstracts and simplifies the underlying graph structure of ontologies. The new approach of semantic zooming for ontology graph visualizations separates the comprised information of an ontology into three layers with discrete levels of detail. The visual appearance layer is defined with the support of expert interviews. The approach is applied on a force-directed layout using the VOWL notation. The mental map is preserved using smart expanding and ordering of elements in the layout. Navigation and sensemaking are supported by local and global exploration methods, halo visualization, and smooth zooming. The results of a user study confirm an increase in readability, visual clarity, and information clarity of ontology graph visualizations enhanced with our semantic zooming approach.


Acknowledgments
These work were supported by the European Union’s H2020 research and innovation program BigDataEurope (GA no. 644564), WDAqua : Marie Skłodowska-Curie Innovative Training Network (GA no. 642795) and by the European Union’s Horizon 2020 research and innovation programme GRACeFUL (GA no. 640954).


Looking forward to seeing you at K-CAP 2017.

SDA at SEMANTiCS 2017 & a Best Paper Award

logo-semantics-17-smallSEMANTiCS 2017 is an international event on Linked Data and the Semantic Web where business users, vendors and academia meet. Our members have actively participated in 13th SEMANTiCS 2017, which took place in Amsterdam, Nederland, Sept 11-14.


We are very pleased to announce that we got 7 papers accepted at SEMANTiCS 2017 for presentation at the main conference. Additionally, we also had 4 Posters and 2 Demo papers accepted at the same.

Furthermore, adding a feather to our hat, our colleague Harsh Thakkar (@harsh9t) secured a Best Research and Innovation Paper Award for his work “Trying Not to Die Benchmarking – Orchestrating RDF and Graph Data Management Solution Benchmarks Using LITMUS” (Github Org., Website, Docker, PDF).

Abstract. Knowledge graphs, usually modelled via RDF or property graphs, have gained importance over the past decade. In order to decide which Data Management Solution (DMS) performs best for specific query loads over a knowledge graph, it is required to perform benchmarks. Benchmarking is an extremely tedious task demanding repetitive manual effort, therefore it is advantageous to automate the whole process. However, there is currently no benchmarking framework which supports benchmarking and comparing diverse DMSs for both RDF and property graph DMS. To this end, we introduce, the rst working prototype of, LITMUS which provides this functionality as well as ne-grained environment configuration options, a comprehensive set of DMS and CPU-specific key performance indicators and a quick analytical support via custom visualization (i.e. plots) for the benchmarked DMSs.

The audience displayed enthusiasm during the presentation appreciating the work and asking questions regarding the future of his work and possible synergy with industrial partners/projects.


Furthermore, an interested mass also indulged in the Poster & Demo session for their first-hand experience with LITMUS.

Among the other presentations, our colleagues presented the following research papers: 

  • SMJoin: A Multi-way Join Operator for SPARQL Queries by Mikhail Galkin, Kemele M. Endris, Maribel Acosta, Diego Collarana, Maria-Esther Vidal, Sören Auer.
    Mikhail Galkin presented his work on introducing a concept and practical approach for multi-way joins applicable in SPARQL query engines. Multi-way joins refer to operators of n-arity, i.e., supporting more than two inputs. A set of optimizations were proposed to increase the performance of a multi-way operator. The audience was particularly interested in experimental results, the impact of query selectivity and operator optimizations.
  • IDOL: Comprehensive & Complete LOD Insights by Ciro Baron Neto, Dimitris Kontokostas, Amit Kirschenbaum, Gustavo Publio, Diego Esteves, and Sebastian Hellmann.
    The presents challenges and technical barriers at identifying and linking the whole linked open data cloud using a probabilistic data structure called Bloom Filter. The audience was most interested in questions related to the problem of cross-dataset error correction as well as the generation of further analytics and heuristics metadata.

SEMANTiCS was a great venue to meet the community, create new connections, talk about current research challenges, share ideas and settle new collaborations. We look forward to the next SEMANTiCS conference.

SDA at DEXA2017 & a Best Paper Award

dexa2017_newDEXA International Conference on Database and Expert Systems Applications – DEXA 2017 is one of the major venues for discussing the latest scientific results and technologies around database, information, and knowledge systems. Our members have actively participated in 28th DEXA 2017 Conferences and Workshops, which took place in Lyon, France from August 28-31, 2017.

W
e are very pleased to report that:

4 papers from our group were accepted for presentation @DEXA2017

  • MULDER: Querying the Linked Data Web by Bridging RDF Molecule Templates by Kemele M. Endris, Mikhail Galkin, Ioanna Lytra, Mohamed Nadjib Mami, Maria-Esther Vidal and Sören Auer.
    Kemele M. Endris, presented his work on Querying the Linked Data Web by Bridging RDF Molecule Templates in the main conference.

    The audience showed high interest in his presentation and appreciated such a composition into existing query processing engines.
    Kemele M. Endris secured a Best Research Paper Award for his work.

    Abstract. The increasing number of RDF data sources that allow for querying Linked Data via Web services form the basis for federated SPARQL query processing. Federated SPARQL query engines provide a unified view of a federation of RDF data sources, and rely on source descriptions for selecting the data sources over which unified queries will be executed. Albeit efficient, existing federated SPARQL query engines usually ignore the meaning of data accessible from a data source, and describe sources only in terms of the vocabularies utilized in the data source. Lack of source description may conduce to the erroneous selection of data sources for a query, thus affecting the performance of query processing over the federation. We tackle the problem of federated SPARQL query processing and devise MULDER, a query engine for federations of RDF data sources. MULDER describes data sources in terms of RDF molecule templates, i.e., abstract descriptions of entities belonging to the same RDF class. Moreover, MULDER utilizes RDF molecule templates for source selection, and query decomposition and optimization. We empirically study the performance of MULDER on existing benchmarks, and compare MULDER performance with state-of-the-art federated SPARQL query engines. Experimental results suggest that RDF molecule templates empower MULDER federated query processing, and allow for the selection of RDF data sources that not only reduce execution time, but also increase answer completeness.

DEXA2017 was a great venue to meet the community, create new connections, talk about current research challenges, share ideas and settle new collaborations. We look forward to the next DEXA conference.

Papers accepted at KESW 2017

keswLogoWe are very pleased to announce that our group got 2 papers accepted for presentation at KESW 2017, which will be held on 08-10 November 2017 in Szczecin, Poland.
The International Conference on Knowledge Engineering and Semantic Web (KESW) is the top international event dedicated to discussing research results and directions in the areas related to Knowledge Representation and Reasoning, Semantic Web, and Linked Data. Its aim is to bring together researchers, practitioners, and educators, in particular from ex-USSR, Eastern and Northern Europe, to present and share ideas regarding Semantic Web, and popularize the area in these regions.

Here is the list of the accepted paper with their abstract:

Managing Lifecycle of Big Data Applications” by Ivan Ermilov, Axel-Cyrille Ngonga Ngomo, Aad Versteden, Hajira Jabeen, Gezim Sejdiu, Giorgos Argyriou, Luigi Selmi, Jürgen Jakobitsch and Jens Lehmann.

Abstract: The growing digitization and networking process within our society has a large influence on all aspects of everyday life. Large amounts of data are being produced continuously, and when these are analyzed and interlinked they have the potential to create new knowledge and intelligent solutions for economy and society. To process this data, we developed the Big Data Integrator (BDI) Platform with various Big Data components available out-of-the-box. The integration of the components inside the BDI Platform requires components homogenization, which leads to the standardization of the development process. To support these activities we created the BDI Stack Lifecycle (SL), which consists of development, packaging, composition, enhancement, deployment and monitoring steps. In this paper, we show how we support the BDI SL with the enhancement applications developed in the BDE project. As an evaluation, we demonstrate the applicability of the BDI SL on three pilots in the domains of transport, social sciences and security.


“Ontology-based Representation of Learner Profiles for Accessible OpenCourseWare Systems” by Mirette Elias, Steffen Lohmann, Sören Auer.

Abstract: The development of accessible web applications has gained significant attention over the past couple of years due to the widespread use of the Internet and the equality laws enforced by governments. Particularly in e-learning contexts, web accessibility plays an important role, as e-learning often requires to be inclusive, addressing all types of learners, including those with disabilities. However, there is still no comprehensive formal representation of learners with disabilities and their particular accessibility needs in e-learning contexts. We propose the use of ontologies to represent accessibility needs and preferences of learners in order to structure the knowledge and to access the information for recommendations and adaptations in e-learning contexts. In particular, we reused the concepts of the ACCESSIBLE ontology and extended them with concepts defined by the IMS Global Learning Consortium. We show how OpenCourseWare systems can be adapted based on this ontology to improve accessibility.


Acknowledgments
These work were supported by the European Union’s H2020 research and innovation program BigDataEurope (GA no.644564) and the European Union’s H2020 project SlideWiki (grant no. 688095).


Looking forward to seeing you at KESW 2017.

Demo and Poster papers accepted at ISWC 2017

cropped-icon_iswc-1We are very pleased to announce that our group got 6 demo/poster papers accepted for presentation at ISWC 2017, which will be held on 21-24 October in Vienna, Austria.
The International Semantic Web Conference (ISWC) is the premier international forum where Semantic Web / Linked Data researchers, practitioners, and industry specialists come together to discuss, advance, and shape the future of semantic technologies on the web, within enterprises and in the context of the public institution.

Here is the list of the accepted demo/poster papers with their abstract:

How to Revert Question Answering on Knowledge Graphs” by Gaurav Maheshwari, Mohnish Dubey, Priyansh Trivedi and Jens Lehmann.

Abstract: A large scale question answering dataset has a potential to enable development of robust and more accurate question answering systems. In this direction, we introduce a framework for creating such datasets which decreases the manual intervention and domain expertise traditionally needed. We describe in details the architecture and the design decision we took while creating the framework.


The Tale of Sansa Spark” by Ivan Ermilov, Jens Lehmann, Gezim Sejdiu, Buehmann Lorenz, Patrick Westphal, Claus Stadler, Simon Bin, Nilesh Chakraborty, Henning Petzka, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo and Hajira Jabeen.

Abstract: We demonstrate the open-source Semantic Analytics Stack (SANSA), which can perform scalable analysis of large-scale knowledge graphs to facilitate applications such as link prediction, knowledge base completion and reasoning. The motivation behind this work lies in the lack of scalability of analytics methods which exploit expressive structures underlying semantically structured knowledge bases. The demonstration is based on the BigDataEurope technical platform, which utilizes Docker technology. We present various examples of using SANSA in a form of interactive Spark notebooks, which are executed using Apache Zeppelin. The technical platform and the notebooks are available on SANSA Github and can be easily deployed on any Docker-enabled host, locally or in a Docker Swarm cluster.


A Vocabulary Independent Generation Framework for DBpedia and beyond” by Ben De Meester, Anastasia Dimou, Dimitris Kontokostas, Ruben Verborgh, Jens Lehmann, Erik Mannens, Sebastian Hellmann.

Abstract: The DBpedia Extraction Framework, the generation framework behind one of the Linked Open Data cloud’s central hubs, has limitations which lead to quality issues with the DBpedia dataset. Therefore, we provide a new take on its Extraction Framework that allows for a sustainable and general-purpose Linked Data generation framework by adapting a semantic-driven approach. The proposed approach decouples, in a declarative manner, the extraction, transformation, and mapping rules execution. This way, among others, interchanging different schema annotations is supported, instead of being coupled to a certain ontology as it is now, because the DBpedia Extraction Framework allows only generating a certain dataset with a single semantic representation. In this paper, we shed more light to the added value that this aspect brings. We provide an extracted DBpedia dataset using a different vocabulary, and give users the opportunity to generate a new dbpedia dataset using a custom combination of vocabularies.


Benchmarking RDF Storage Solutions with IGUANA” by Felix Conrads, Jens Lehmann, Muhammad Saleem and Axel-Cyrille Ngonga Ngomo.

Abstract: Choosing the right RDF storage storage is of central importance when developing any data-driven Semantic Web solution. In this demonstration paper, we present the configuration and use of the IGUANA benchmarking framework. This framework addresses a crucial drawback of state-of-the-art benchmarks: While several benchmarks have been proposed that assess the performance of triple stores, an integrated benchmark-independent execution framework for these benchmarks was yet to be made available. IGUANA addresses this research by providing an integrated and highly configurable environment for the execution of SPARQL benchmarks. Our framework complements benchmarks by providing an execution environment which can measure the performance of triple stores during data loading, data updates as well as under different loads and parallel requests. Moreover, it allows a uniform comparison of results on different benchmarks. During the demonstration, we will execute the DBPSB benchmark using the IGUANA framework and show how our framework measures the performance of popular triple stores under updates and parallel user requests. IGUANA is open-source and can be found at http://iguana-benchmark.eu/.


BatWAn – A Binary and Multi-way Query Plan Analyzer” by Mikhail Galkin, Maria-Esther Vidal.

Abstract: The majority of existing SPARQL query engines generate query plans composed of binary join operators. Albeit effective, binary joins can drastically impact on the performance of query processing whenever source answers need to be passed through multiple operators in a query plan. Multi-way joins have been proposed to overcome this problem; they are able to propagate and generate results in a single step during query execution. We demonstrate the benefits of query plans with multi-way operators with BatWAn, a binary and multi-way query plan analyzer. Attendees will observe the behavior of multi-way joins on queries of different selectivity, as well as the impact on total execution time, time for the first answer, and continuous results yield over time.


QAESTRO – Semantic Composition of QA Pipelines” by Kuldeep Singh, Ioanna Lytra, Kunwar Abhinav Aditya, Maria-Esther Vidal.

Abstract: Many question answering systems and related components have been developed in recent years. Since question answering involves several tasks and subtasks, common in many systems, existing components can be combined in various ways to build tailored question answering pipelines. QAESTRO provides the tools to semantically describe question answering components and automatically generate possible pipelines given developer requirements. We demonstrate the functionality of QAESTRO for building question answering pipelines including different tasks and components. Attendees will be able to semantically describe question answering pipelines and integrate them in existing frameworks.


Acknowledgments
These work were supported by the European Union’s H2020 research and innovation action HOBBIT (GA no. 688227), the European Union’s H2020 research and innovation program BigDataEurope (GA no.644564), DAAD Scholarship and   WDAqua : Marie Skłodowska-Curie Innovative Training Network.


Looking forward to seeing you at ISWC 2017

Papers accepted at ISWC 2017

cropped-icon_iswc-1We are very pleased to announce that our group got 6 papers accepted for presentation at ISWC 2017, which will be held on 21-24 October in Vienna, Austria.
The International Semantic Web Conference (ISWC) is the premier international forum where Semantic Web / Linked Data researchers, practitioners, and industry specialists come together to discuss, advance, and shape the future of semantic technologies on the web, within enterprises and in the context of the public institution.

Here is the list of the accepted paper with their abstract:

Distributed Semantic Analytics using the SANSA Stack” by Jens Lehmann, Gezim Sejdiu, Lorenz Bühmann, Patrick Westphal, Claus Stadler, Ivan Ermilov, Simon Bin, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo and Hajira Jabeen.

Abstract: A major research challenge is to perform scalable analysis of large-scale knowledge graphs to facilitate applications like link prediction, knowledge base  completion and reasoning. Analytics methods which exploit expressive structures usually do not scale well to very large knowledge bases, and most analytics approaches which do scale horizontally (i.e., can be executed in a distributed environment) work on simple feature-vector-based input. This software framework paper describes the ongoing Semantic Analytics Stack (SANSA) project, which supports expressive and scalable semantic analytics by providing functionality for distributed computing on RDF data.


A Corpus for Complex Question Answering over Knowledge Graphs” by Priyansh Trivedi, Gaurav Maheshwari, Mohnish Dubey and Jens Lehmann.

Abstract: Being able to access knowledge bases in an intuitive way has been an active area of research over the past years. In particular, several question answering (QA) approaches which allow to query RDF datasets in natural language have been developed as they allow end users to access knowledge without needing to learn the schema of a knowledge base and learn a formal query language. To foster this research area, several training datasets have been created, e.g.~in the QALD (Question Answering over Linked Data) initiative. However, existing datasets are insufficient in terms of size, variety or complexity to apply and evaluate a range of machine learning based QA approaches for learning complex SPARQL queries. With the provision of the Large-Scale Complex Question Answering Dataset (LC-QuAD), we close this gap by providing a dataset with 5000 questions and their corresponding SPARQL queries over the DBpedia dataset.In this article, we describe the dataset creation process and how we ensure a high variety of questions, which should enable to assess the robustness and accuracy of the next generation of QA systems for knowledge graphs.


Iguana : A Generic Framework for Benchmarking the Read-Write Performance of Triple Stores” by Felix Conrads, Jens Lehmann, Axel-Cyrille Ngonga Ngomo, Muhammad Saleem and Mohamed Morsey.

Abstract  : The performance of triples stores is crucial for applications which rely on RDF data. Several benchmarks have been proposed that assess the performance of triple stores. However, no integrated benchmark-independent execution framework for these benchmarks has been provided so far. We propose a novel SPARQL benchmark execution framework called IGUANA. Our framework complements benchmarks by providing an execution environment which can measure the performance of triple stores during data loading, data updates as well as under different loads. Moreover, it allows a uniform comparison of results on different benchmarks. We execute the FEASIBLE and DBPSB benchmarks using the IGUANA framework and measure the performance of popular triple stores under updates and parallel user requests. We compare our results with state-of-the-art benchmarking results and show that our benchmark execution framework can unveil new insights pertaining to the performance of triple stores.


Sustainable Linked Data generation: the case of DBpedia” by Wouter Maroy, Anastasia Dimou, Dimitris Kontokostas, Ben De Meester, Jens Lehmann, Erik Mannens and Sebastian Hellmann.

Abstract : DBpedia EF, the generation framework behind one of the Linked Open Data cloud’s central interlinking hubs, has limitations regarding the quality, coverage and sustainability of the generated dataset. Hence, DBpedia can be further improved both on schema and data level. Errors and inconsistencies can be addressed by amending (i) the DBpediaEF; (ii) the DBpedia mapping rules; or (iii) Wikipedia itself. However, even though the DBpedia ef is continuously evolving and several changes were applied to both the DBpedia EF and mapping rules, there are no significant improvements on the DBpedia dataset since the identification of its limitations. To address these shortcomings, we propose adapting a different semantic-driven approach that decouples, in a declarative manner, the extraction, transformation and mapping rules execution. In this paper, we provide details regarding the new DBpedia EF, its architecture, technical implementation and extraction results. This way, we achieve an enhanced data generation process for DBpedia, which can be broadly adopted, that improves its quality, coverage and sustainability.


Realizing an RDF-based Information Model for a Manufacturing Company – A Case Study” by Niklas Petersen, Lavdim Halilaj, Irlán Grangel-González, Steffen Lohmann, Christoph Lange and Sören Auer.

Abstract: The digitization of the industry requires information models describing assets and information sources of companies to enable the semantic integration and interoperable exchange of data. We report on a case study in which we realized such an information model for a global manufacturing company using semantic technologies. The information model is centered around machine data and describes all relevant assets, key terms and relations in a structured way, making use of existing as well as newly developed RDF vocabularies. In addition, it comprises numerous RML mappings that link different data sources required for integrated data access and querying via SPARQL. The technical infrastructure and methodology used to develop and maintain the information model is based on a Git repository and utilizes the development environment VoCol as well as the Ontop framework for Ontology Based Data Access. Two use cases demonstrate the benefits and opportunities provided by the information model. We evaluated the approach with stakeholders and report on lessons learned from the case study.


Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing Approaches” by Maribel Acosta, Maria-Esther Vidal, York Sure-Vetter.

Abstract: During empirical evaluations of query processing techniques, metrics like execution time, time for the first answer, and throughput are usually reported. Albeit informative, these metrics are unable to quantify and evaluate the efficiency of a query engine over a certain time period -or diefficiency-, thus hampering the distinction of cutting-edge engines able to exhibit high-performance gradually. We tackle this issue and devise two experimental metrics named dief@t and dief@k, which allow for measuring the diefficiency during an elapsed time period or while k answers are produced, respectively. The dief@t and dief@k measurement methods rely on the computation of the area under the curve of answer traces and thus capturing the answer concentration over a time interval. We report experimental results of evaluating the behavior of a generic SPARQL query engine using both metrics. Observed results suggest that dief@t and dief@k are able to measure the performance of SPARQL query engines based on both the amount of answers produced by an engine and the time required to generate these answers.


Acknowledgments
These work were supported by the European Union’s H2020 research and innovation action HOBBIT (GA no. 688227), the European Union’s H2020 research and innovation program BigDataEurope (GA no.644564), German Ministry BMWI under the SAKE project (Grant No. 01MD15006E), WDAqua : Marie Skłodowska-Curie Innovative Training Network and Industrial Data Space.


Looking forward to seeing you at ISWC 2017

Papers accepted at SEMANTiCS 2017

We are very pleased to announce that our group got 7 papers accepted for presentation at SEMANTiCS 2017, which will be held on 11-14 September in Amsterdam.
SEMANTiCS 2017 is an international event on Linked Data and the Semantic Web where business users, vendors and academia meet. Widely recognized to be of pivotal importance, it is the thirteenth edition of a well-attended yearly conference that started back in 2005. It offers keynotes by world-class practitioners, presentations and field reports in diverse tracks, talks addressing a variety of topics, and panel discussions. And, of course, ample opportunities for networking and meeting like-minded professionals in an informal setting.

Here is the list of the accepted paper with their abstract:

“Trying Not to Die Benchmarking — Orchestrating RDF and Graph Data Management Solution Benchmarks Using LITMUS”
by Harsh Thakkar, Yashwant Keswani, Mohnish Dubey, Jens Lehmann and Sören Auer.

Abstract: Knowledge graphs, usually modelled via RDF or property graphs, have gained importance over the past decade. In order to decide which Data Management Solution (DMS) performs best for specific query loads over a knowledge graph, it is required to perform benchmarks.Benchmarking is an extremely tedious task demanding repetitive manual effort, therefore it is advantageous to automate the whole process.However, there is currently no benchmarking framework which supports benchmarking and comparing diverse DMSs for both RDF and property graph DMS. To this end, we introduce, the first working prototype of, LITMUS which provides this functionality as well as fine-grained environment configuration options, a comprehensive set of DMS and CPU-specific key performance indicators and a quick analytical support via custom visualization (i.e. plots) for the benchmarked DMSs.


“IDOL: Comprehensive & Complete LOD Insights”
by C. Baron Neto, D. Kontokostas, G. Publio, D. Esteves, A. Kirschenbaum and S. Hellmann.

Abstract: Over the last decade, we observed a steadily increasing amount of RDF datasets made available on the web of data. The decentralized nature of the web, however, makes it hard to identify all these datasets. Even more so, when downloadable data distributions are discovered, only insufficient metadata is available to describe the datasets properly, thus posing barriers on its usefulness and reuse. In this paper, we describe an attempt to exhaustively identify the whole linked open data cloud by harvesting metadata from multiple sources, providing insights about duplicated data and the general quality of the available metadata. This was only possible by using a probabilistic data structure called Bloom filter. Finally, we enrich existing dataset metadata with our approach and republish them through an SPARQL endpoint.


Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation”
by Najmeh Mousavi Nejad, Simon Scerri, Sören Auer.

Abstract  : With the omnipresent availability and use of cloud services, software tools, Web portals or services, legal contracts in the form of license agreements or terms and conditions regulating their use are of paramount importance. Often the textual documents describing these regulations comprise many pages and can not be reasonably assumed to be read and understood by humans. In this work, we describe a method for extracting and clustering relevant parts of such documents, including permissions, obligations, and prohibitions. The clustering is based on semantic similarity employing a distributional semantics approach on large word embeddings database. An evaluation shows that it can significantly improve human comprehension and that improved feature-based clustering has a potential to further reduce the time required for EULA digestion. Our implementation is available as a web service, which can directly be used to process and prepare legal usage contracts.


Matching Natural Language Relations to Knowledge Graph Properties for Question Answering.”
by Isaiah Onando Mulang’, Kuldeep Singh, Fabrizio Orlandi.

Abstract : Research has seen considerable achievements concerning translation of natural language patterns into formal queries for Question Answering (QA) based on Knowledge Graphs (KG). One of the main challenges in this research area is about how to identify which property within a Knowledge Graph matches the predicate found in a Natural Language (NL) relation. Current approaches for formal query generation attempt to resolve this problem mainly by first retrieving the named entity from the KG together with a list of its predicates, then filtering out one from all the predicates of the entity. We attempt an approach to directly match an NL predicate to KG properties that can be employed within QA pipelines. In this paper, we specify a systematic approach as well as providing a tool that can be employed to solve this task. Our approach models KB relations with their underlying parts of speech, we then enhance this with extra attributes obtained from Wordnet and Dependency parsing characteristics. From a question, we model a similar representation of query relations. We then define distance measurements between the query relation and the properties representations from the KG to identify which property is referred to by the relation within the query. We report substantive recall values and considerable precision from our evaluation.


Ontology-guided Job Market Demand Analysis: A Cross-Sectional Study for the Data Science field.”
by Elisa Margareth Sibarani, Simon Scerri, Camilo Morales, Sören Auer and Diego Collarana.

Abstract: The rapid changes in the job market, including a continuous year-on-year increase in new skills in sectors like information technology, has resulted in new challenges for job seekers and educators alike. The former feel less informed about which skills they should acquire to raise their competitiveness, whereas the latter are inadequately prepared to offer courses that meet the expectations by fast-evolving sectors like data science. In this paper, we describe efforts to obtain job demand data and employ a information extraction method guided by a purposely-designed vocabulary to identify skills requested by the job vacancies.
The Ontology-based Information Extraction (OBIE) method employed relies on the Skills and Recruitment Ontology (SARO), which we developed to represent job postings in the context of skills and competencies needed to fill a job role. Skill demand by employers is then abstracted using co-word analysis based on a set of skill keywords and their co-occurrences in the job posts. This method reveals the technical skills in demand together with their structure for revealing significant linkages. In an evaluation, the performance of the OBIE method for automatic skill annotation is estimated (strict F-measure) at 79%, which is satisfactory given that human inter-annotator agreement was found to be automatic keyword indexing with an overall strict F-measure at 94%. In a secondary study, sample skill maps generated from the matrix of co-occurrences and correlation are presented and discussed as proof-of-concept, highlighting the potential of using the extracted OBIE data for more advanced analysis that we plan as future work, including time series analysis.


SMJoin: A Multi-way Join Operator for SPARQL Queries
by Mikhail Galkin, Kemele M. Endris, Maribel Acosta, Diego Collarana, Maria-Esther Vidal, Sören Auer.

Abstract: State-of-the-art SPARQL query engines rely on binary join operators tailored for merging results from SPARQL queries over Web access interfaces. However, in queries with a large number of triple patterns, binary joins constitute a significant burden on the query performance. Multi-way joins that handle more than two inputs are able to reduce the complexity of pre-processing stages and reduce the execution time. We devise SMJoin, a multi-way non-blocking join operator tailored for independently merging results from more than two RDF data sources. SMJoin implements intra-operator adaptivity, i.e., it is able to adjust join execution schedulers to the conditions of Web access interfaces; thus, query answers are produced as soon as they are computed and can be continuously generated even if one of the sources becomes blocked. We empirically study the behavior of SMJoin in two benchmarks with queries of different selectivity; state-of-the-art SPARQL query engines are included in the study. Experimental results suggest that SMJoin outperforms existing approaches in very selective queries, and produces rst answers as fast as state-of-the-art adaptive query engines in non-selective queries.


“On the Benchmarking of Faceted Browsing”
by Henning Petzka, Claus Stadler, Georgios Katsimpras, Bastian Haarmann and Jens Lehmann

Abstract : The increasing availability of large amounts of linked data creates a need for software that allows for its efficient exploration. Systems enabling faceted browsing constitute a user-friendly solution that need to combine suitable choices for front and back end. Since a generic solution must be adjustable with respect to the dataset, the underlying ontology and the knowledge graph characteristics raise several challenges and heavily influence the browsing experience. As a consequence, an understanding of these challenges becomes an important matter of study. We present a benchmark on faceted browsing, which allows systems to test their performance on specific choke points on the back end. Further, we address additional issues in faceted browsing that may be caused by problematic modelling choices within the underlying ontology.


Acknowledgments
These work were supported by the European Union’s H2020 research and innovation action HOBBIT (GA no. 688227), the European Union’s H2020 research and innovation program BigDataEurope. (GA no.644564), DAAD Scholarship, LPDP (Indonesia Endowment Fund for Education), EDSA and WDAqua : Marie Skłodowska-Curie Innovative Training Network


Looking forward to seeing you at SEMANTiCS 2017