Paper Accepted at WISE 2020

We are very pleased to announce that we got a paper accepted for presentation at WISE 2020 (International Conference on Web Information Systems Engineering). WISE has established itself as a community aiming at high quality research and offering the ground for advancing efforts in topics related to Web information systems. WISE 2020 will be an international forum for researchers, professionals, and industrial practitioners to share their knowledge and insights in the rapidly growing areas of Web technologies for Big Data and Artificial Intelligence (AI), two highly important areas for the world economy.

  • Encoding Knowledge Graph Entity Aliases in Attentive Neural Network for Wikidata Entity Linking
    By Isaiah Onando Mulang, Kuldeep Singh, Akhilesh Vyas, Saeedeh Shekarpour, Akhilesh Vyas, Maria Esther Vidal, Jens Lehmann, and Sören Auer.
    Abstract The collaborative knowledge graphs such as Wikidata excessively rely on the crowd to author the information. Since the crowd is not bound to a standard protocol for assigning entity titles, the knowledge graph is populated by non-standard, noisy, long or even sometimes awkward titles. The issue of long, implicit, and nonstandard entity representations is a challenge in Entity Linking (EL) approaches for gaining high precision and recall. Underlying KG in general is the source of target entities for EL approaches, however, it often contains other relevant information, such as aliases of entities (e.g., Obama and Barack Hussein Obama are aliases for the entity Barack Obama). EL models usually ignore such readily available entity attributes. In this paper, we examine the role of knowledge graph context on an attentive neural network approach for entity linking on Wikidata. Our approach contributes by exploiting the sufficient context from a KG as a source of background knowledge, which is then fed into the neural network. This approach demonstrates merit to address challenges associated with entity titles (multi-word, long, implicit, case-sensitive). Our experimental study shows approx 8% improvements over the baseline approach, and significantly outperforms an end to end approach for Wikidata entity linking.

Papers Accepted at IDEAL 2020

We are very pleased to announce that we got two papers accepted for presentatioon at IDEAL 2020 (International Conference on Intelligent Data Engineering and Automated Learning). IDEAL is an annual international conference dedicated to emerging and challenging topics in intelligent data analysis, data mining and their associated learning systems and paradigms. The conference provides a unique opportunity and stimulating forum for presenting and discussing the latest theoretical advances and real-world applications in Computational Intelligence and Intelligent Data Analysis.

Here are the pre-prints of the accepted papers with its abstract:

  • Meta-Hyperband: Hyperparameter optimization with meta-learning and coarse-to-fine
    By Samin Payrosangari, Afshin Sadeghi, Damien Graux, and Jens Lehmann.
    Abstract Hyperparameter optimization is one of the main pillars of machine learning approaches. In this paper, we introduce Meta-Hyperband: a Hyperband based algorithm that improves the search by adding levels of exploitation. Unlike Hyperband which is a pure exploration bandit-based approach for hyperparameter optimization, our meta approach generates a trade-off between exploration and exploitation, combining Hyperband with meta-learning and Coarse-to-Fine modules. We analyze the performance of Meta-Hyperband on various datasets to tune the hyperparameters of CNN and SVM. The experiments indicate that in many cases Meta-Hyperband can discover hyperparameter configurations with higher quality than Hyperband, using similar amounts of resources. In particular, we discovered a CNN configuration for classifying CIFAR10 dataset which has a 3% higher performance than the configuration founded by Hyperband, which is also 0.3% more accurate than the best-reported configuration of the Bayesian optimization approach. Additionally, we release a publicly available pool of historically well-performed configurations on several datasets for CNN and SVM to ease the adoption of Meta-Hyperband.
  • International Data Spaces Information Model – An Ontology for Sovereign Exchange of Digital Content
    By Sebastian Bader, Jaroslav Pullmann, Christian Mader, Sebastian Tramp, Christoph Quix, Andreas Mueller, Haydar Akyürek, Matthias Böckmann, Andreas Mueller, Benedikt Imbusch, Johannes Lipp, Sandra Geisler, and Christoph Lange.
    Abstract The International Data Spaces initiative (IDS) is building an ecosystem to facilitate data exchange in a secure, trusted, and semantically interoperable way. It aims at providing a basis for smart services and cross-company business processes, while at the same time guaranteeing data owners’ sovereignty over their content. The IDS Information Model is an RDFS/OWL ontology defining the fundamental concepts for describing actors in a data space, their interactions, the resources exchanged by them, and data usage restrictions. After introducing the conceptual model and design of the ontology, we explain its implementation on top of standard ontologies as well as the process for its continuous evolution and quality assurance involving a community driven by industry and research organisations. We demonstrate tools that support generation, validation, and usage of instances of the ontology with the focus on data control and protection in a federated ecosystem.

PhD Viva (Gezim Sejdiu): Efficient Distributed In-Memory Processing of RDF Datasets

Last week, on Tuesday 29th of September 2020 successfully defended my PhD thesis entitled “Efficient Distributed In-Memory Processing of RDF Datasets”. The main objective of this thesis is to lay foundations for efficient algorithms performing analytics, i.e. exploration, quality assessment, and querying over semantic knowledge graphs at a scale that has not been possible before.


See below the thesis abstract with references to the main papers, part of the work is based on (see here: for the complete list of publications).


Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies. Today, we count more than 10,000 datasets made available online following Semantic Web standards. A major and yet unsolved challenge that research faces today is to perform scalable analysis of large-scale knowledge graphs in order to facilitate applications in various domains including life sciences, publishing, and the internet of things.
The main objective of this thesis is to lay foundations for efficient algorithms performing analytics, i.e. exploration, quality assessment, and querying over semantic knowledge graphs at a scale that has not been possible before.
First, we propose a novel approach for statistical calculations of large RDF datasets [1], which scales out to clusters of machines.
In particular, we describe the first distributed in-memory approach for computing 32 different statistical criteria for RDF datasets using Apache Spark.
Many applications such as data integration, search, and interlinking, may take full advantage of the data when having a priori statistical information about its internal structure and coverage. However, such applications may suffer from low quality and not being able to leverage the full advantage of the data when the size of data goes beyond the capacity of the resources available.
Thus, we introduce a distributed approach of quality assessment of large RDF datasets [2]. It is the first distributed, in-memory approach for computing different quality metrics for large RDF datasets using Apache Spark. We also provide a quality assessment pattern that can be used to generate new scalable metrics that can be applied to big data.
Based on the knowledge of the internal statistics of a dataset and its quality, users typically want to query and retrieve large amounts of information.
As a result, it has become difficult to efficiently process these large RDF datasets.
Indeed, these processes require, both efficient storage strategies and query-processing engines, to be able to scale in terms of data size.
Therefore, we propose a scalable approach [3, 4] to evaluate SPARQL queries over distributed RDF datasets by translating SPARQL queries into Spark executable code.
We conducted several empirical evaluations to assess the scalability, effectiveness, and efficiency of our proposed approaches.
More importantly, various use cases i.e. Ethereum analysis, Mining Big Data Logs, and Scalable Integration of POIs, have been developed and leverages by our approach.
The empirical evaluations and concrete applications provide evidence that our methodology and techniques proposed during this thesis help to effectively analyze and process large-scale RDF datasets.
All the proposed approaches during this thesis are integrated into the larger SANSA framework [5].


[1]. Gezim Sejdiu; Ivan Ermilov; Jens Lehmann; and Mohamed Nadjib-Mami, “DistLODStats: Distributed Computation of RDF Dataset Statistics,” in Proceedings of 17th International Semantic Web Conference (ISWC), 2018.
[2]. Gezim Sejdiu; Anisa Rula; Jens Lehmann; and Hajira Jabeen, “A Scalable Framework for Quality Assessment of RDF Datasets,” in Proceedings of 18th International Semantic Web Conference (ISWC), 2019.
[3]. Claus Stadler; Gezim Sejdiu; Damien Graux; and Jens Lehmann, “Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries over distributed RDF datasets,” in Proceedings of 18th International Semantic Web Conference (ISWC), 2019.
[4]. Gezim Sejdiu; Damien Graux; Imran Khan; Ioanna Lytra; Hajira Jabeen; and Jens Lehmann, “Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation,” 15th International Conference on Semantic Systems (SEMANTiCS), Research & Innovation, 2019.
[5]. Jens Lehmann; Gezim Sejdiu; Lorenz Bühmann; Patrick Westphal; Claus Stadler; Ivan Ermilov; Simon Bin; Nilesh Chakraborty; Muhammad Saleem; Axel-Cyrille Ngomo Ngonga; and Hajira Jabeen, “Distributed Semantic Analytics using the SANSA Stack,”; in Proceedings of 16th International Semantic Web Conference – Resources Track (ISWC’2017), 2017.

Paper Accepted at EMNLP2020

We are very pleased to announce that our paper “Message Passing for Hyper-Relational Knowledge Graphs” was accepted for presentation at EMNLP2020 (Empirical Methods in Natural Language Processing).
EMNLP is a leading conference in the area of Natural Language Processing. EMNLP invites the submission of long and short papers on substantial, original, and unpublished research in empirical methods for Natural Language Processing.

Here is the pre-print of the accepted paper with its abstract:

  • Message Passing for Hyper-Relational Knowledge Graphs
    By Mikhail Galkin, Priyansh Trivedi, Gaurav Maheshwari, Ricardo Usbeck, and Jens Lehmann.
    Abstract Hyper-relational knowledge graphs (KGs) (e.g., Wikidata) enable associating additional key-value pairs along with the main triple to disambiguate, or restrict the validity of a fact. In this work, we propose a message passing based graph encoder – StarE capable of modeling such hyper-relational KGs. Unlike existing approaches, StarE can encode an arbitrary number of additional information (qualifiers) along with the main triple while keeping the semantic roles of qualifiers and triples intact. We also demonstrate that existing benchmarks for evaluating link prediction (LP) performance on hyper-relational KGs suffer from fundamental flaws and thus develop a new Wikidata-based dataset – WD50K. Our experiments demonstrate that StarE based LP model outperforms existing approaches across multiple benchmarks. We also confirm that leveraging qualifiers is vital for link prediction with gains up to 25 MRR points compared to triple-based representations.
    Further Resources Blogpost on Medium
    Github Repository
    Report on Weights & Biases

Papers Accepted at ISWC 2020

We are very pleased to announce that our group got four papers accepted for presentation at ISWC2020 (International Semantic Web Conference). ISWC is the premier international forum, for the Semantic Web / Linked Data Community, and will bring together researchers, practitioners and industry specialists to discuss, advance, and shape the future of semantic technologies.

Here are the pre-print of the accepted papers with their abstract:

  • Fantastic Knowledge Graph Embeddings and How to Find the Right Space for Them
    By Mojtaba Nayyeri, Chengjin Xu, Sahar Vahdati, Nadezhda Vassilyeva, Emanuel Sallinger, Hamed Shariat Yazdi , and Jens Lehmann.
    Abstract During the last few years, several knowledge graph embedding models have been devised in order to handle machine learning problems for knowledge graphs. Some of the models which are proven to be capable of inferring relational patterns, such as symmetry or transitivity, show lower performance in practice than expected by looking at their theoretical power. It is often unknown what factors contribute to such performance differences among KGE models in the inference of particular patterns. We develop the concept of a solution space as a factor that has a direct influence on the practical performance of knowledge graph embedding models as well as their capability to infer relational patterns. We showcase the effect of solution space on a newly proposed model dubbed SpacE^ss. We prove the theoretical characteristics of this method and evaluate it in practice against state-of-the-art models on a set of standard benchmarks such as WordNet and FreeBase.

  • Temporal Knowledge Graph Embedding Model based on Additive Time Series Decomposition
    By Chengjin Xu, Mojtaba Nayyeri, Fouad Alkhoury, Hamed Shariat Yazdi , and Jens Lehmann.
    Abstract Knowledge Graph (KG) embedding has attracted more attention in recent years. Most KG embedding models learn from time-unaware triples. However, the inclusion of temporal information besides triples would further improve the performance of a KGE model. In this regard, we propose ATiSE, a temporal KG embedding model which incorporates time information into entity/relation representations by using additive time series decomposition. Moreover, considering the temporal uncertainty during the evolution of entity/relation representations over time, we map the representations of temporal KGs into the space of multi-dimensional Gaussian distributions. The mean of each entity/relation embedding at a time step shows the current expected position, whereas its covariance (which is temporally stationary) represents its temporal uncertainty. Experimental results show that ATiSE remarkably outperforms the state-of-the-art KGE models and the existing temporal KGE models on link prediction over four temporal KGs

  • PNEL: Pointer Network based End-To-End Entity Linking over Knowledge Graphs
    By Debayan Banerjee, Debanjan Chaudhuri, Mohnish Dubey, and Jens Lehmann.
    Abstract Question Answering systems are generally modelled as a pipeline consisting of a sequence of steps. In such a pipeline, Entity Linking (EL) is often the first step. Several EL models first perform span detection and then entity disambiguation. In such models errors from the span detection phase cascade to later steps and result in a drop of overall accuracy. Moreover, lack of gold entity spans in training data is a limiting factor for span detector training. Hence the movement towards end-to-end EL models began where no separate span detection step is involved. In this work we present a novel approach to end-to-end EL by applying the popular Pointer Network model. It achieves competitive performance while maintaining low response times. We demonstrate this in our evaluation over three datasets on the Wikidata Knowledge Graph.

  • CASQAD: A New Dataset For Context-aware Spatial Question Answering
    By Jewgeni Rose, and Jens Lehmann.
    Abstract The task of factoid question answering (QA) faces new challenges when applied in scenarios with rapidly changing context information, for example on smartphones. Instead of asking who the architect of the “Holocaust Memorial” in Berlin was, the same question could be phrased as “Who was the architect of the many stelae in front of me?” presuming the user is standing in front of it. While traditional QA systems rely on static information from knowledge bases and the analysis of named entities and predicates in the input, question answering for temporal and spatial questions imposes new challenges to the underlying methods. To tackle these challenges, we present the Context-aware Spatial QA Dataset (CASQAD) with over 5,000 annotated questions containing visual and spatial references that require information about the user’s location and moving direction to compose a suitable query. These questions were collected in a large scale user study and annotated semi-automatically, with appropriate measures to ensure the quality.

Papers Accepted at CIKM 2020

We are very pleased to announce that our group got two papers accepted for presentation at CIKM 2020 (Conference on Information and Knowledge Management CIKM). CIKM seeks to identify challenging problems facing the development of future knowledge and information systems, and to shape future directions of research by soliciting and reviewing high quality, applied and theoretical research findings. An important part of the conference is the Workshops and Tutorial programs which focuses on timely research challenges and initiatives – and bringing together research papers, industry speakers and keynote speakers. The program also showcases posters, demonstrations, competitions, and other special events.

  • Evaluating the Impact of Knowledge Graph Context on Entity Disambiguation Models
    By Isaiah Onando Mulang, Kuldeep Singh, Chaitali Prabhu, Abhishek Nadgeri,Johannes Hoffart, and Jens Lehmann.
    Abstract Pretrained Transformer models have emerged as state-of-the-art approaches that learn contextual information from the text to improve the performance of several NLP tasks. These models, albeit powerful, still require specialized knowledge in specific scenarios. In this paper, we argue that context derived from a knowledge graph (in our case: Wikidata) provides enough signals to inform pretrained transformer models and improve their performance for named entity disambiguation (NED) on Wikidata KG. We further hypothesize that our proposed KG context can be standardized for Wikipedia, and we evaluate the impact of KG context on the state of the art NED model for the Wikipedia knowledge base. Our empirical results validate that the proposed KG context can be generalized (for Wikipedia), and providing KG context in transformer architectures considerably outperforms the existing baselines, including the vanilla transformer models.
  • MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities
    By Jason Armitage, Endri Kacupaj, Golsa Tahmasebzadeh, Swati,Maria Maleshkova, Ralph Ewerth, and Jens Lehmann.
    Abstract In this paper, we introduce the MLM (Multiple Languages and Modalities) dataset – a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic data provide a resource that further tests the ability for multitask systems to learn relationships between entities. The dataset is designed for researchers and developers who build applications that perform multiple tasks on data encountered on the web and in digital archives. A second version of MLM provides a geo-representative subset of the data with weighted samples for countries of the European Union. We demonstrate the value of the resource in developing novel applications in the digital humanities with a motivating use case and specify a benchmark set of tasks to retrieve modalities and locate entities in the dataset. Evaluation of baseline multitask and single task systems on the full and geo-representative versions of MLM demonstrate the challenges of generalising on diverse data. In addition to the digital humanities, we expect the resource to contribute to research in multimodal representation learning, location estimation, and scene understanding.

Paper Published in the Journal of Web Semantics

We are very pleased to announce that our paper “No One is Perfect: Analysing the Performance of Question Answering Components over the DBpedia Knowledge Graph” has been published in the Journal of Web Semantics. The Journal of Web Semantics is an interdisciplinary journal based on research and applications of various subject areas that contribute to the development of a knowledge-intensive and intelligent service Web.

Here is the pre-print of the published paper with its abstract:

No One is Perfect: Analysing the Performance of Question Answering Components over the DBpedia Knowledge Graph
By Kuldeep Singh, Ioanna Lytra,, Arun Sethupat Radhakrishna, Saeedeh Shekarpour, Maria Esther Vidal, and Jens Lehmann.
Abstract Question answering (QA) over knowledge graphs has gained significant momentum over the past five years due to the increasing availability of large knowledge graphs and the rising importance of Question Answering for user interaction. Existing QA systems have been extensively evaluated as black boxes and their perfor-mance has been characterised in terms of average results over all the questions of benchmarking datasets(i.e. macro evaluation). Albeit informative, macro evaluation studies do not provide evidence about QAcomponents’ strengths and concrete weaknesses. Therefore, the objective of this article is to analyse and micro evaluate available QA components in order to comprehend which question characteristics impact on their performance. For this, we measure at question level and with respect to different question features the accuracy of 29 components reused in QA frameworks for the DBpedia knowledge graph using state-of-the-art benchmarks. As a result, we provide a perspective on collective failure cases, study the similarities and synergies among QA components for different component types and suggest their characteristics preventing them from effectively solving the corresponding QA tasks. Finally, based on these extensive results, wepresent conclusive insights for future challenges and research directions in the field of Question Answeringover knowledge graphs.

Papers Accepted at DEXA 2020

We are very pleased to announce that our group got two papers accepted for presentation at DEXA2020 (International Conference on Database and Expert Systems Applications). Since 1990, DEXA has been an annual international conference which showcases state-of-the-art research activities in database, information, and knowledge systems. DEXA provides a forum to present research results and to examine advanced applications in the field. The conference and its associated workshops offer an opportunity for developers, scientists, and users to extensively discuss requirements, problems, and solutions in database, information, and knowledge systems. 

Here are the pre-print of the accepted papers with their abstract:

  • Unveiling Relations in the Industry 4.0 Standards Landscape based on Knowledge Graph Embeddings
    By Ariam Rivas, Irlán Grangel-González, Diego Collarana, Jens Lehmann, and Maria-Esther Vidal.
    Abstract Industry 4.0 (I4.0) standards and standardization frameworks have been proposed with the goal of empowering interoperability in smart factories. These standards enable the description and interaction of the main components, systems, and processes inside of a smart factory. Due to the growing number of frameworks and standards, there is an increasing need for approaches that automatically analyze the landscape of I4.0 standards. Standardization frameworks classify standards according to their functions into layers and dimensions. However, similar standards can be classified differently across the frameworks, producing, thus, interoperability conflicts among them. Semantic-based approaches that rely on ontologies and knowledge graphs, have been proposed to represent standards, known relations among them, as well as their classification according to existing frameworks. Albeit informative, the structured modeling of the I4.0 landscape only provides the foundations for detecting interoperability issues. Thus, graph-based analytical methods able to exploit knowledge encoded by these approaches, are required to uncover alignments among standards. We study the relatedness among standards and frameworks based on community analysis to discover knowledge that helps to cope with interoperability conflicts between standards. We use knowledge graph embeddings to automatically create these communities exploiting the meaning of the existing relationships. In particular, we focus on the identification of similar standards, i.e., communities of standards, and analyze their properties to detect unknown relations. We empirically evaluate our approach on a knowledge graph of I4.0 standards using the Trans* family of embedding models for knowledge graph entities. Our results are promising and suggest that relations among standards can be detected accurately.

  • SCODIS: Job Advert-derived Time Series for high-demand Skillset Discovery and Prediction
    By Elisa Margareth Sibarani and Simon Scerri.
    Abstract In this paper, we consider a dataset compiled from online job adverts for consecutive fixed periods, to identify whether repeated and automated observation of skills requested in the job market can be used to predict the relevance of skillsets and the predominance of skills in the near future. The data, consisting of co-occurring skills observed in job adverts, is used to generate a skills graph whose nodes are skills and whose edges denote the co-occurrence appearance. To better observe and interpret the evolution of this graph over a period of time, we investigate two clustering methods that can reduce the complexity of the graph. The best performing method, evaluated according to its modularity value (0.72 for the best method followed by 0.41), is then used as a basis for the SCODIS framework, which enables the discovery of in-demand skillsets based on the observation of skills clusters in a time series. The framework is used to conduct a time series forecasting experiment, resulting in the F-measures observed at 72%, which confirms that to an extent, and with enough previous observations, it is indeed possible to identify which skillsets will dominate demand for a specific sector in the short-term.

Book “Knowledge Graphs and Big Data Processing” Published as Open Access

One of the core missions of the LAMBDA EU Project is to produce learning material about Big Data Analytics. We are happy to announce that the book “Knowledge Graphs and Big Data Processing” is published as open access. This was a titanic effort from SDA and Fraunhofer IAIS colleagues.
The book can be downloaded from here. We contributed to the chapters Big Data Outlook, Tools, and Architectures (chapter 3), Scalable Knowledge Graph Processing Using SANSA (chapter 7) and Context-Based Entity Matching for Big Data (chapter 8).

PyKEEN 1.0 Release

As a member of the PyKEEN community project, we are happy to announce PyKEEN 1.0 – PyKEEN is a software package to train and evaluate knowledge graph embedding models.

The following features are currently supported by PyKEEN:

  • 23 interaction models (ComplExLiteral, ComplEx, ConvE, ConvKB, DistMult, DistMultLiteral, ERMLP, ERMLPE, HolE, KG2E, NTN, ProjE, RESCAL, RGCN, RotatE, SimplE, StructuredEmbedding, TransD, TransE, TransH, TransR, TuckER, and UnstructuredModel)
  • 7 loss functions (Binary Cross Entropy, Cross Entropy, Margin Ranking Loss, Mean Square Error, Self-Adversarial Negative Sampling Loss, and Softplus Loss)
  • 3 regularizers (LP-norm based regularizer, Power Sum regularizer, and Combined regularizer, i.e., convex combination of regularizers)
  • 2 training approaches (LCWA and sLCWA)
  • 2 negative samplers (Uniform and Bernoulli)
  • Hyper-parameter optimization (using Optuna)
  • Early stopping
  • 6 evaluation metrics (adjusted mean rank, mean rank, mean reciprocal rank, hits@k, average-precision score, and ROC-AUC score)

PyKEEN was used to extensively test existing KGE models on a wide range of configurations. You can find those results in our paper. We want to thank everyone who helped to create this release. For more updates, please view our Twitter feed and consider following us.

Greetings from the PyKEEN-Team