Mohnish – Smart Data Analytics

We are very pleased to announce that our group got 7 papers accepted for presentation at SEMANTiCS 2017, which will be held on 11-14 September in Amsterdam.
SEMANTiCS 2017 is an international event on Linked Data and the Semantic Web where business users, vendors and academia meet. Widely recognized to be of pivotal importance, it is the thirteenth edition of a well-attended yearly conference that started back in 2005. It offers keynotes by world-class practitioners, presentations and field reports in diverse tracks, talks addressing a variety of topics, and panel discussions. And, of course, ample opportunities for networking and meeting like-minded professionals in an informal setting.

Here is the list of the accepted paper with their abstract:

“Trying Not to Die Benchmarking — Orchestrating RDF and Graph Data Management Solution Benchmarks Using LITMUS”
by Harsh Thakkar, Yashwant Keswani, Mohnish Dubey, Jens Lehmann and Sören Auer.

Abstract: Knowledge graphs, usually modelled via RDF or property graphs, have gained importance over the past decade. In order to decide which Data Management Solution (DMS) performs best for specific query loads over a knowledge graph, it is required to perform benchmarks.Benchmarking is an extremely tedious task demanding repetitive manual effort, therefore it is advantageous to automate the whole process.However, there is currently no benchmarking framework which supports benchmarking and comparing diverse DMSs for both RDF and property graph DMS. To this end, we introduce, the first working prototype of, LITMUS which provides this functionality as well as fine-grained environment configuration options, a comprehensive set of DMS and CPU-specific key performance indicators and a quick analytical support via custom visualization (i.e. plots) for the benchmarked DMSs.

“IDOL: Comprehensive & Complete LOD Insights”
by C. Baron Neto, D. Kontokostas, G. Publio, D. Esteves, A. Kirschenbaum and S. Hellmann.

Abstract: Over the last decade, we observed a steadily increasing amount of RDF datasets made available on the web of data. The decentralized nature of the web, however, makes it hard to identify all these datasets. Even more so, when downloadable data distributions are discovered, only insufficient metadata is available to describe the datasets properly, thus posing barriers on its usefulness and reuse. In this paper, we describe an attempt to exhaustively identify the whole linked open data cloud by harvesting metadata from multiple sources, providing insights about duplicated data and the general quality of the available metadata. This was only possible by using a probabilistic data structure called Bloom filter. Finally, we enrich existing dataset metadata with our approach and republish them through an SPARQL endpoint.

Abstract : With the omnipresent availability and use of cloud services, software tools, Web portals or services, legal contracts in the form of license agreements or terms and conditions regulating their use are of paramount importance. Often the textual documents describing these regulations comprise many pages and can not be reasonably assumed to be read and understood by humans. In this work, we describe a method for extracting and clustering relevant parts of such documents, including permissions, obligations, and prohibitions. The clustering is based on semantic similarity employing a distributional semantics approach on large word embeddings database. An evaluation shows that it can significantly improve human comprehension and that improved feature-based clustering has a potential to further reduce the time required for EULA digestion. Our implementation is available as a web service, which can directly be used to process and prepare legal usage contracts.

“Matching Natural Language Relations to Knowledge Graph Properties for Question Answering.”
by Isaiah Onando Mulang’, Kuldeep Singh, Fabrizio Orlandi.

Abstract : Research has seen considerable achievements concerning translation of natural language patterns into formal queries for Question Answering (QA) based on Knowledge Graphs (KG). One of the main challenges in this research area is about how to identify which property within a Knowledge Graph matches the predicate found in a Natural Language (NL) relation. Current approaches for formal query generation attempt to resolve this problem mainly by first retrieving the named entity from the KG together with a list of its predicates, then filtering out one from all the predicates of the entity. We attempt an approach to directly match an NL predicate to KG properties that can be employed within QA pipelines. In this paper, we specify a systematic approach as well as providing a tool that can be employed to solve this task. Our approach models KB relations with their underlying parts of speech, we then enhance this with extra attributes obtained from Wordnet and Dependency parsing characteristics. From a question, we model a similar representation of query relations. We then define distance measurements between the query relation and the properties representations from the KG to identify which property is referred to by the relation within the query. We report substantive recall values and considerable precision from our evaluation.

“Ontology-guided Job Market Demand Analysis: A Cross-Sectional Study for the Data Science field.”
by Elisa Margareth Sibarani, Simon Scerri, Camilo Morales, Sören Auer and Diego Collarana.

Abstract: The rapid changes in the job market, including a continuous year-on-year increase in new skills in sectors like information technology, has resulted in new challenges for job seekers and educators alike. The former feel less informed about which skills they should acquire to raise their competitiveness, whereas the latter are inadequately prepared to offer courses that meet the expectations by fast-evolving sectors like data science. In this paper, we describe efforts to obtain job demand data and employ a information extraction method guided by a purposely-designed vocabulary to identify skills requested by the job vacancies.
The Ontology-based Information Extraction (OBIE) method employed relies on the Skills and Recruitment Ontology (SARO), which we developed to represent job postings in the context of skills and competencies needed to fill a job role. Skill demand by employers is then abstracted using co-word analysis based on a set of skill keywords and their co-occurrences in the job posts. This method reveals the technical skills in demand together with their structure for revealing significant linkages. In an evaluation, the performance of the OBIE method for automatic skill annotation is estimated (strict F-measure) at 79%, which is satisfactory given that human inter-annotator agreement was found to be automatic keyword indexing with an overall strict F-measure at 94%. In a secondary study, sample skill maps generated from the matrix of co-occurrences and correlation are presented and discussed as proof-of-concept, highlighting the potential of using the extracted OBIE data for more advanced analysis that we plan as future work, including time series analysis.

“SMJoin: A Multi-way Join Operator for SPARQL Queries“
by Mikhail Galkin, Kemele M. Endris, Maribel Acosta, Diego Collarana, Maria-Esther Vidal, Sören Auer.

Abstract: State-of-the-art SPARQL query engines rely on binary join operators tailored for merging results from SPARQL queries over Web access interfaces. However, in queries with a large number of triple patterns, binary joins constitute a significant burden on the query performance. Multi-way joins that handle more than two inputs are able to reduce the complexity of pre-processing stages and reduce the execution time. We devise SMJoin, a multi-way non-blocking join operator tailored for independently merging results from more than two RDF data sources. SMJoin implements intra-operator adaptivity, i.e., it is able to adjust join execution schedulers to the conditions of Web access interfaces; thus, query answers are produced as soon as they are computed and can be continuously generated even if one of the sources becomes blocked. We empirically study the behavior of SMJoin in two benchmarks with queries of different selectivity; state-of-the-art SPARQL query engines are included in the study. Experimental results suggest that SMJoin outperforms existing approaches in very selective queries, and produces rst answers as fast as state-of-the-art adaptive query engines in non-selective queries.

“On the Benchmarking of Faceted Browsing”
by Henning Petzka, Claus Stadler, Georgios Katsimpras, Bastian Haarmann and Jens Lehmann

Abstract : The increasing availability of large amounts of linked data creates a need for software that allows for its efficient exploration. Systems enabling faceted browsing constitute a user-friendly solution that need to combine suitable choices for front and back end. Since a generic solution must be adjustable with respect to the dataset, the underlying ontology and the knowledge graph characteristics raise several challenges and heavily influence the browsing experience. As a consequence, an understanding of these challenges becomes an important matter of study. We present a benchmark on faceted browsing, which allows systems to test their performance on specific choke points on the back end. Further, we address additional issues in faceted browsing that may be caused by problematic modelling choices within the underlying ontology.

Acknowledgments
These work were supported by the European Union’s H2020 research and innovation action HOBBIT (GA no. 688227), the European Union’s H2020 research and innovation program BigDataEurope. (GA no.644564), DAAD Scholarship, LPDP (Indonesia Endowment Fund for Education), EDSA and WDAqua : Marie Skłodowska-Curie Innovative Training Network

Looking forward to seeing you at SEMANTiCS 2017.

Author: Mohnish

Papers accepted at SEMANTiCS 2017

CSCUBS-16

Question Answering Challenges & Applications

Job Offering in the Area of Semantic Web