We are very pleased to announce that our group got 6 papers accepted for presentation at ISWC 2017, which will be held on 21-24 October in Vienna, Austria.
The International Semantic Web Conference (ISWC) is the premier international forum where Semantic Web / Linked Data researchers, practitioners, and industry specialists come together to discuss, advance, and shape the future of semantic technologies on the web, within enterprises and in the context of the public institution.
Here is the list of the accepted paper with their abstract:
“Distributed Semantic Analytics using the SANSA Stack” by Jens Lehmann, Gezim Sejdiu, Lorenz Bühmann, Patrick Westphal, Claus Stadler, Ivan Ermilov, Simon Bin, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo and Hajira Jabeen.
Abstract: A major research challenge is to perform scalable analysis of large-scale knowledge graphs to facilitate applications like link prediction, knowledge base completion and reasoning. Analytics methods which exploit expressive structures usually do not scale well to very large knowledge bases, and most analytics approaches which do scale horizontally (i.e., can be executed in a distributed environment) work on simple feature-vector-based input. This software framework paper describes the ongoing Semantic Analytics Stack (SANSA) project, which supports expressive and scalable semantic analytics by providing functionality for distributed computing on RDF data.
Abstract: Being able to access knowledge bases in an intuitive way has been an active area of research over the past years. In particular, several question answering (QA) approaches which allow to query RDF datasets in natural language have been developed as they allow end users to access knowledge without needing to learn the schema of a knowledge base and learn a formal query language. To foster this research area, several training datasets have been created, e.g.~in the QALD (Question Answering over Linked Data) initiative. However, existing datasets are insufficient in terms of size, variety or complexity to apply and evaluate a range of machine learning based QA approaches for learning complex SPARQL queries. With the provision of the Large-Scale Complex Question Answering Dataset (LC-QuAD), we close this gap by providing a dataset with 5000 questions and their corresponding SPARQL queries over the DBpedia dataset.In this article, we describe the dataset creation process and how we ensure a high variety of questions, which should enable to assess the robustness and accuracy of the next generation of QA systems for knowledge graphs.
Abstract : The performance of triples stores is crucial for applications which rely on RDF data. Several benchmarks have been proposed that assess the performance of triple stores. However, no integrated benchmark-independent execution framework for these benchmarks has been provided so far. We propose a novel SPARQL benchmark execution framework called IGUANA. Our framework complements benchmarks by providing an execution environment which can measure the performance of triple stores during data loading, data updates as well as under different loads. Moreover, it allows a uniform comparison of results on different benchmarks. We execute the FEASIBLE and DBPSB benchmarks using the IGUANA framework and measure the performance of popular triple stores under updates and parallel user requests. We compare our results with state-of-the-art benchmarking results and show that our benchmark execution framework can unveil new insights pertaining to the performance of triple stores.
Abstract : DBpedia EF, the generation framework behind one of the Linked Open Data cloud’s central interlinking hubs, has limitations regarding the quality, coverage and sustainability of the generated dataset. Hence, DBpedia can be further improved both on schema and data level. Errors and inconsistencies can be addressed by amending (i) the DBpediaEF; (ii) the DBpedia mapping rules; or (iii) Wikipedia itself. However, even though the DBpedia ef is continuously evolving and several changes were applied to both the DBpedia EF and mapping rules, there are no significant improvements on the DBpedia dataset since the identification of its limitations. To address these shortcomings, we propose adapting a different semantic-driven approach that decouples, in a declarative manner, the extraction, transformation and mapping rules execution. In this paper, we provide details regarding the new DBpedia EF, its architecture, technical implementation and extraction results. This way, we achieve an enhanced data generation process for DBpedia, which can be broadly adopted, that improves its quality, coverage and sustainability.
Abstract: The digitization of the industry requires information models describing assets and information sources of companies to enable the semantic integration and interoperable exchange of data. We report on a case study in which we realized such an information model for a global manufacturing company using semantic technologies. The information model is centered around machine data and describes all relevant assets, key terms and relations in a structured way, making use of existing as well as newly developed RDF vocabularies. In addition, it comprises numerous RML mappings that link different data sources required for integrated data access and querying via SPARQL. The technical infrastructure and methodology used to develop and maintain the information model is based on a Git repository and utilizes the development environment VoCol as well as the Ontop framework for Ontology Based Data Access. Two use cases demonstrate the benefits and opportunities provided by the information model. We evaluated the approach with stakeholders and report on lessons learned from the case study.
“Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing Approaches” by Maribel Acosta, Maria-Esther Vidal, York Sure-Vetter.
Abstract: During empirical evaluations of query processing techniques, metrics like execution time, time for the first answer, and throughput are usually reported. Albeit informative, these metrics are unable to quantify and evaluate the efficiency of a query engine over a certain time period -or diefficiency-, thus hampering the distinction of cutting-edge engines able to exhibit high-performance gradually. We tackle this issue and devise two experimental metrics named dief@t and dief@k, which allow for measuring the diefficiency during an elapsed time period or while k answers are produced, respectively. The dief@t and dief@k measurement methods rely on the computation of the area under the curve of answer traces and thus capturing the answer concentration over a time interval. We report experimental results of evaluating the behavior of a generic SPARQL query engine using both metrics. Observed results suggest that dief@t and dief@k are able to measure the performance of SPARQL query engines based on both the amount of answers produced by an engine and the time required to generate these answers.
These work were supported by the European Union’s H2020 research and innovation action HOBBIT (GA no. 688227), the European Union’s H2020 research and innovation program BigDataEurope (GA no.644564), German Ministry BMWI under the SAKE project (Grant No. 01MD15006E), WDAqua : Marie Skłodowska-Curie Innovative Training Network and Industrial Data Space.
Looking forward to seeing you at ISWC 2017.
We are very pleased to announce that our group got 7 papers accepted for presentation at SEMANTiCS 2017, which will be held on 11-14 September in Amsterdam.
SEMANTiCS 2017 is an international event on Linked Data and the Semantic Web where business users, vendors and academia meet. Widely recognized to be of pivotal importance, it is the thirteenth edition of a well-attended yearly conference that started back in 2005. It offers keynotes by world-class practitioners, presentations and field reports in diverse tracks, talks addressing a variety of topics, and panel discussions. And, of course, ample opportunities for networking and meeting like-minded professionals in an informal setting.
Here is the list of the accepted paper with their abstract:
Abstract: Knowledge graphs, usually modelled via RDF or property graphs, have gained importance over the past decade. In order to decide which Data Management Solution (DMS) performs best for specific query loads over a knowledge graph, it is required to perform benchmarks.Benchmarking is an extremely tedious task demanding repetitive manual effort, therefore it is advantageous to automate the whole process.However, there is currently no benchmarking framework which supports benchmarking and comparing diverse DMSs for both RDF and property graph DMS. To this end, we introduce, the first working prototype of, LITMUS which provides this functionality as well as fine-grained environment configuration options, a comprehensive set of DMS and CPU-specific key performance indicators and a quick analytical support via custom visualization (i.e. plots) for the benchmarked DMSs.
“IDOL: Comprehensive & Complete LOD Insights”
by C. Baron Neto, D. Kontokostas, G. Publio, D. Esteves, A. Kirschenbaum and S. Hellmann.
Abstract: Over the last decade, we observed a steadily increasing amount of RDF datasets made available on the web of data. The decentralized nature of the web, however, makes it hard to identify all these datasets. Even more so, when downloadable data distributions are discovered, only insufficient metadata is available to describe the datasets properly, thus posing barriers on its usefulness and reuse. In this paper, we describe an attempt to exhaustively identify the whole linked open data cloud by harvesting metadata from multiple sources, providing insights about duplicated data and the general quality of the available metadata. This was only possible by using a probabilistic data structure called Bloom filter. Finally, we enrich existing dataset metadata with our approach and republish them through an SPARQL endpoint.
Abstract : With the omnipresent availability and use of cloud services, software tools, Web portals or services, legal contracts in the form of license agreements or terms and conditions regulating their use are of paramount importance. Often the textual documents describing these regulations comprise many pages and can not be reasonably assumed to be read and understood by humans. In this work, we describe a method for extracting and clustering relevant parts of such documents, including permissions, obligations, and prohibitions. The clustering is based on semantic similarity employing a distributional semantics approach on large word embeddings database. An evaluation shows that it can significantly improve human comprehension and that improved feature-based clustering has a potential to further reduce the time required for EULA digestion. Our implementation is available as a web service, which can directly be used to process and prepare legal usage contracts.
Abstract : Research has seen considerable achievements concerning translation of natural language patterns into formal queries for Question Answering (QA) based on Knowledge Graphs (KG). One of the main challenges in this research area is about how to identify which property within a Knowledge Graph matches the predicate found in a Natural Language (NL) relation. Current approaches for formal query generation attempt to resolve this problem mainly by first retrieving the named entity from the KG together with a list of its predicates, then filtering out one from all the predicates of the entity. We attempt an approach to directly match an NL predicate to KG properties that can be employed within QA pipelines. In this paper, we specify a systematic approach as well as providing a tool that can be employed to solve this task. Our approach models KB relations with their underlying parts of speech, we then enhance this with extra attributes obtained from Wordnet and Dependency parsing characteristics. From a question, we model a similar representation of query relations. We then define distance measurements between the query relation and the properties representations from the KG to identify which property is referred to by the relation within the query. We report substantive recall values and considerable precision from our evaluation.
“Ontology-guided Job Market Demand Analysis: A Cross-Sectional Study for the Data Science field.”
by Elisa Margareth Sibarani, Simon Scerri, Camilo Morales, Sören Auer and Diego Collarana.
Abstract: The rapid changes in the job market, including a continuous year-on-year increase in new skills in sectors like information technology, has resulted in new challenges for job seekers and educators alike. The former feel less informed about which skills they should acquire to raise their competitiveness, whereas the latter are inadequately prepared to offer courses that meet the expectations by fast-evolving sectors like data science. In this paper, we describe efforts to obtain job demand data and employ a information extraction method guided by a purposely-designed vocabulary to identify skills requested by the job vacancies.
The Ontology-based Information Extraction (OBIE) method employed relies on the Skills and Recruitment Ontology (SARO), which we developed to represent job postings in the context of skills and competencies needed to fill a job role. Skill demand by employers is then abstracted using co-word analysis based on a set of skill keywords and their co-occurrences in the job posts. This method reveals the technical skills in demand together with their structure for revealing significant linkages. In an evaluation, the performance of the OBIE method for automatic skill annotation is estimated (strict F-measure) at 79%, which is satisfactory given that human inter-annotator agreement was found to be automatic keyword indexing with an overall strict F-measure at 94%. In a secondary study, sample skill maps generated from the matrix of co-occurrences and correlation are presented and discussed as proof-of-concept, highlighting the potential of using the extracted OBIE data for more advanced analysis that we plan as future work, including time series analysis.
“SMJoin: A Multi-way Join Operator for SPARQL Queries“
by Mikhail Galkin, Kemele M. Endris, Maribel Acosta, Diego Collarana, Maria-Esther Vidal, Sören Auer.
Abstract: State-of-the-art SPARQL query engines rely on binary join operators tailored for merging results from SPARQL queries over Web access interfaces. However, in queries with a large number of triple patterns, binary joins constitute a significant burden on the query performance. Multi-way joins that handle more than two inputs are able to reduce the complexity of pre-processing stages and reduce the execution time. We devise SMJoin, a multi-way non-blocking join operator tailored for independently merging results from more than two RDF data sources. SMJoin implements intra-operator adaptivity, i.e., it is able to adjust join execution schedulers to the conditions of Web access interfaces; thus, query answers are produced as soon as they are computed and can be continuously generated even if one of the sources becomes blocked. We empirically study the behavior of SMJoin in two benchmarks with queries of different selectivity; state-of-the-art SPARQL query engines are included in the study. Experimental results suggest that SMJoin outperforms existing approaches in very selective queries, and produces rst answers as fast as state-of-the-art adaptive query engines in non-selective queries.
Abstract : The increasing availability of large amounts of linked data creates a need for software that allows for its efficient exploration. Systems enabling faceted browsing constitute a user-friendly solution that need to combine suitable choices for front and back end. Since a generic solution must be adjustable with respect to the dataset, the underlying ontology and the knowledge graph characteristics raise several challenges and heavily influence the browsing experience. As a consequence, an understanding of these challenges becomes an important matter of study. We present a benchmark on faceted browsing, which allows systems to test their performance on specific choke points on the back end. Further, we address additional issues in faceted browsing that may be caused by problematic modelling choices within the underlying ontology.
These work were supported by the European Union’s H2020 research and innovation action HOBBIT (GA no. 688227), the European Union’s H2020 research and innovation program BigDataEurope. (GA no.644564), DAAD Scholarship, LPDP (Indonesia Endowment Fund for Education), EDSA and WDAqua : Marie Skłodowska-Curie Innovative Training Network
Looking forward to seeing you at SEMANTiCS 2017.
We are happy to announce SANSA 0.2 – the second release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing for semantic technologies in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.
- Website: http://sansa-stack.net
- GitHub: https://github.com/SANSA-Stack
- Download: http://sansa-stack.net/downloads-usage/
You can find the FAQ and usage examples at http://sansa-stack.net/faq/.
The following features are currently supported by SANSA:
- Reading and writing RDF files in N-Triples format
- Reading OWL files in various standard formats
- Querying and partitioning based on Sparqlify
- RDFS/RDFS Simple/OWL-Horst forward chaining inference
- RDF graph clustering with different algorithms
- Rule mining from RDF graphs
Deployment and getting started:
- There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
- The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
- There is example code for various tasks available.
- We provide interactive notebooks for running and testing code via Docker.
Greetings from the SANSA Development Team
We are very pleased to announce that our group got a paper accepted for presentation at The International Conference on Web Intelligence (WI), which will be held in Leipzig between the 23th – 26th of August. The WI is an important international forum for research advances in theories and methods usually associated with Collective Intelligence, Data Science, Human-Centric Computing, Knowledge Management, and Network Science.
Abstract: A choice of the best computational solution for a particular task is increasingly reliant on experimentation. Even though experiments are often described through text, tables, and figures, their descriptions are often incomplete or confusing. Thus, researchers often have to perform lengthy web searches for reproducing and understanding the results. In order to minimize this gap, vocabularies and ontologies have been proposed for representing data mining and machine learning (ML) experiments. However, we still lack proper tools to export properly these metadata. To this end, we present an open-source library dubbed LOG4MEX which aims at supporting the scientific community to fulfill this gap.
This work is supported by the European Union’s H2020 research and innovation action HOBBIT (GA no. 688227) and the European Union’s H2020 research and innovation program BigDataEurope. (GA no.644564).
we are happy to announce the 0.2 release of SML-Bench, our Structured Machine Learning benchmark framework. SML-Bench provides full benchmarking scenarios for inductive supervised machine learning covering different knowledge representation languages like OWL and Prolog. It already comes with adapters for prominent inductive learning systems like the DL-Learner, the General Inductive Logic Programming System (GILPS), and Aleph, as well as Inductive Logic Programming ‘classics’ like Golem and Progol. The framework is easily extensible, be it in terms of new benchmarking scenarios, or support for new learning systems. SML-Bench allows to define, run and report on benchmarks combining different scenarios and learning systems giving insight into the performance characteristics of the respective inductive learning algorithms on a wide range of learning problems.
GitHub page: https://github.com/AKSW/SML-Bench/
Change log: https://github.com/AKSW/SML-Bench/releases/tag/0.2
In the current release we extended the options to configure learning systems in the overall benchmarking configuration, and added support for running multiple instances of a learning system, as well as the nesting of instance-specific settings and settings that apply to all instances of a learning system. Besides internal refactoring to increase the overall software quality, we also extended the reporting capabilities of the benchmark results. We added a new benchmark scenario and experimental support for the Statistical Relational Learning system TreeLiker.
We want to thank everyone who helped to create this release and appreciate any feedback.
Patrick Westphal, Simon Bin, Lorenz Bühmann and Jens Lehmann
The WWW conference is an important international forum for the evolution of the web, technical standards, the impact of the web on society, and its future. Our members have actively participated in the 26th International World Wide Web Conference (WWW 2017), which took place on the sunny shores of Perth, Western Australia /3-7 April 2017.
We are very pleased to report that:
A paper from our group was accepted for presentation as full research paper at WWW 2017:
- “Neural Network-based Question Answering over Knowledge Graphs on Word and Character Level”
— Jens Lehmann (@JLehmann82) April 6, 2017
The Web is developing from a medium for publishing textual documents into a medium for sharing structured data. This trend is fuelled by the adoption of Linked Data principles by a growing number of data providers and the increasing trend to include semantic markup of content of HTML pages. LDOW2017 aims to stimulate discussion and further research into the challenges of publishing, consuming, and integrating structured data from the Web as well as mining knowledge from the global Web of Data.
The audience showed high interest for the workshop.
— Jens Lehmann (@JLehmann82) April 3, 2017
Following discussion included further challenges on Pioneering the Linked Open Research Cloud and The Future of Linked Data.
WWW2017 was a great venue to meet the community, create new connections, talk about current research challenges, share ideas and settle new collaborations. We look forward to the next WWW conferences.
Anisa presented a talk in the context of quality dimensions and their evolution and how the anatomy of data representation and the quality assessment in Knowledge Bases (KBs) could lead to the improvement of existing KBs, i.e., by providing an enrichment of KBs. The trade-off between the enrichment and quality of KGs were risen up and discussed in details. Some of the use cases were mentioned as well, with the main focus on Link Discovery. In particular, enriching KBs will help in better interlinking by eliminating noise and search space.
During the talk, she also introduced ABSTAT, an ontology-driven linked data summarization framework that generates summaries of Linked Data datasets that comprises a set of Abstract Knowledge patterns, statistics, and a subtype graph.
Prof. Dr. Jens Lehmann invited the speaker to the bi-weekly “SDA colloquium presentations”, so there was good representation from various students and researchers from our group.
The Slides of the talk of our invited speaker Anisa Rula were inspired by “Data Quality Issues in Linked Open Data”, a chapter of the book “Data and Information Quality by Carlo Batini and Monica Scannapieco.
With this visit, we expect to strengthen our research collaboration networks with the Department of Computer Science, Systems and Communication, University of Milan-Bicocca, mainly on combining quality assessment metrics and distributed frameworks applied on SANSA
We are very pleased to announce that our group got a paper accepted for presentation at the 26th International World Wide Web Conference (WWW 2017), which will be held on the sunny shores of Perth, Western Australia /3-7 April, 2017. The WWW is an important international forum for the evolution of the web, technical standards, the impact of the web on society, and its future.
Abstract: Question Answering (QA) systems over Knowledge Graphs (KG) automatically answer natural language questions using facts contained in a knowledge graph. Simple questions, which can be answered by the extraction of a single fact, constitute a large part of questions asked on the web but still pose challenges to QA systems, especially when asked against a large knowledge resource. Existing QA systems usually rely on various components each specialised in solving different sub-tasks of the problem (such as segmentation, entity recognition, disambiguation, and relation classification etc.). In this work, we follow a quite different approach: We train a neural network for answering simple questions in an end-to-end manner, leaving all decisions to the model. It learns to rank subject-predicate pairs to enable the retrieval of relevant facts given a question. The network contains a nested word/character-level question encoder which allows to handle out-of-vocabulary and rare word problems while still being able to exploit word-level semantics. Our approach achieves results competitive with state-of-the-art end-to-end approaches that rely on an attention mechanism.
This work is supported in part by the European Union under the Horizon 2020 Framework Program for the project WDAqua (GA 642795).
Looking forward to seeing you at WWW.
We are very pleased to announce that our group got a paper accepted for presentation at the 17th International Conference on Web Engineering (ICWE 2017 ), which will be held on 5 – 8 June 2017 / Rome Italy. The ICWE is an important international forum for the Web Engineering Community.
“The BigDataEurope Platform – Supporting the Variety Dimension of Big Data“ Sören Auer, Simon Scerri, Aad Versteden, Erika Pauwels, Angelos Charalambidis, Stasinos Konstantopoulos, Jens Lehmann, Hajira Jabeen, Ivan Ermilov, Gezim Sejdiu, Andreas Ikonomopoulos, Spyros Andronopoulos, Mandy Vlachogiannis, Charalambos Pappas, Athanasios Davettas, Iraklis A. Klampanos, Efstathios Grigoropoulos, Vangelis Karkaletsis, Victor de Boer, Ronald Siebes, Mohamed Nadjib Mami, Sergio Albani, Michele Lazzarini, Paulo Nunes, Emanuele Angiuli, Nikiforos Pittaras, George Giannakopoulos, Giorgos Argyriou, George Stamoulis, George Papadakis, Manolis Koubarakis, Pythagoras Karampiperis, Axel-Cyrille Ngonga Ngomo, Maria-Esther Vidal.
Abstract: The management and analysis of large-scale datasets – described with the term Big Data – involves the three classic dimensions volume, velocity and variety. While the former two are well supported by a plethora of software components, the variety dimension is still rather neglected. We present the BDE platform – an easy-to-deploy, easy-to-use and adaptable (cluster-based and standalone) platform for the execution of big data components and tools like Hadoop, Spark, Flink, Flume and Cassandra. The BDE platform was designed based upon the requirements gathered from seven of the societal challenges put forward by the European Commission in the Horizon 2020 programme and targeted by the BigDataEurope pilots. As a result, the BDE platform allows to perform a variety of Big Data flow tasks like message passing, storage, analysis or publishing. To facilitate the processing of heterogeneous data, a particular innovation of the platform is the Semantic Layer, which allows to directly process RDF data and to map and transform arbitrary data into RDF. The advantages of the BDE platform are demonstrated through seven pilots, each focusing on a major societal challenge.
This work is supported by the European Union’s Horizon 2020 research and innovation program under grant agreement no.644564 – BigDataEurope.
We are very pleased to announce that our paper “AskNow: A Framework for Natural Language Query Formalization in SPARQL” by Mohnish Dubey, Sourish Dasgupta, Ankit Sharma, Konrad Höffner, Jens Lehmann has been elected as the Paper of the month at Fraunhofer IAIS. This award is given to publications that have a high innovation impact in the research field after a committee evaluation.
Abstract: Natural Language Query Formalization involves semantically parsing queries in natural language and translating them into their corresponding formal representations. It is a key component for developing question-answering (QA) systems on RDF data. The chosen formal representation language in this case is often SPARQL. In this paper, we propose a framework, called AskNow, where users can pose queries in English to a target RDF knowledge base (e.g. DBpedia), which are first normalized into an intermediary canonical syntactic form, called Normalized Query Structure (NQS), and then translated into SPARQL queries. NQS facilitates the identification of the desire (or expected output information) and the user-provided input information, and establishing their mutual semantic relationship. At the same time, it is sufficiently adaptive to query paraphrasing. We have empirically evaluated the framework with respect to the syntactic robustness of NQS and semantic accuracy of the SPARQL translator on standard benchmark datasets.
The paper and authors were honored for this publication in a special event at Fraunhofer Schloss Birlinghoven, Sankt Augustin, Germany.