Paper Accepted at IDA 2021

We are very pleased to announce that our group got a paper accepted for presentation at IDA2021. Advancing Intelligent Data Analysis requires novel, potentially game-changing ideas. IDA’s mission is to promote ideas over performance: a solid motivation can be as convincing as exhaustive empirical evaluation.

Here is the abstract and the link to the paper:

HORUS-NER: A Multimodal Named Entity Recognition Framework for Noisy Data
By Diego Esteves, José Marcelino,, Piyush Chawla, Asja Fischer,, and Jens Lehmann.
Abstract Recent work based on Deep Learning presents state-of-the-art (SOTA) performance in the named entity recognition (NER) task. However, such models still have the performance drastically reduced in noisy data (eg, social media, search engines), when compared to the formal domain (eg, newswire). Thus, designing and exploring new methods and architectures is highly necessary to overcome current challenges. In this paper, we shift the focus of existing solutions to an entirely different perspective. We investigate the potential of embedding word-level features extracted from images and news. We performed a very comprehensive study in order to validate the hypothesis that images and news (obtained from an external source) may boost the task on noisy data, revealing very interesting findings. When our proposed architecture is used:(1) We beat SOTA in precision with simple CRFs models (2) The overall performance of decision trees-based models can be drastically improved.(3) Our approach overcomes off-the-shelf models for this task.(4) Images and text consistently increased recall over different datasets for SOTA, but at cost of precision. All experiment configurations, data and models are publicly available to the research community at

Paper Accepted at IJCNN 2021

We are very pleased to announce that our group got a paper accepted for presentation at IJCNN 2021. The annual International Joint Conference on Neural Networks (IJCNN) is the flagship conference of the IEEE Computational Intelligence Society and the International Neural Network Society. It covers a wide range of topics in the field of neural networks, from biological neural network modeling to artificial neural computation.

Here is the abstract and the link to the paper:

Multiple Run Ensemble Learning with Low-Dimensional Knowledge Graph Embeddings
By Chengjin Xu, Mojtaba Nayyeri, Sahar Vahdati, and Jens Lehmann.
Abstract Knowledge graphs (KGs) represent world facts in a structured form. Although knowledge graphs are quantitatively huge and consist of millions of triples, the coverage is still only a small fraction of world’s knowledge. Among the top approaches of recent years, link prediction using knowledge graph embedding (KGE) models has gained significant attention for knowledge graph completion. Various embedding models have been proposed so far, among which, some recent KGE models obtain state-of-the-art performance on link prediction tasks by using embeddings with a high dimension (e.g. 1000) which accelerate the costs of training and evaluation considering the large scale of KGs. In this paper, we propose a simple but effective performance boosting strategy for KGE models by using multiple low dimensions in different repetition rounds of the same model. For example, instead of training a model one time with a large embedding size of 1200, we repeat the training of the model 6 times in parallel with an embedding size of 200 and then combine the 6 separate models for testing while the overall numbers of adjustable parameters are same (6*200=1200) and the total memory footprint remains the same. We show that our approach enables different models to better cope with their expressiveness issues on modeling various graph patterns such as symmetric, 1-n, n-1 and n-n. In order to justify our findings, we conduct experiments on various KGE models. Experimental results on standard benchmark datasets, namely FB15K, FB15K-237 and WN18RR, show that multiple low-dimensional models of the same kind outperform the corresponding single high-dimensional models on link prediction in a certain range and have advantages in training efficiency by using parallel training while the overall numbers of adjustable parameters are same.

Paper Accepted at ICSC 2021

We are very pleased to announce that our group got a paper accepted for presentation at IEEE-ICSC 2021. The 15th IEEE International Conference on Semantic Computing (ICSC2021) addresses the derivation, description, integration, and use of semantics (“meaning”, “context”, “intention”) for all types of resource including data, document, tool, device, process and people. The scope of ICSC2021 includes, but is not limited to, analytics, semantics description languages and integration (of data and services), interfaces, and applications.

Here is the abstract and the link to the paper (we also provide a preprint):

Scalable Distributed in-Memory Semantic Similarity Estimation for RDF Knowledge Graphs with DistSim
By Carsten Draschner, Jens Lehmann, and Hajira Jabeen.
Abstract In this paper, we present DistSim, a Scalable Distributed in-Memory Semantic Similarity Estimation framework for Knowledge Graphs. DistSim provides a multitude of state-of-the-art similarity estimators. We have developed the Similarity Estimation Pipeline by combining generic software modules. For large scale RDF data, DistSim proposes MinHash with locality sensitivity hashing to achieve better scalability over all-pair similarity estimations. The modules of DistSim can be set up using a multitude of (hyper)-parameters allowing to adjust the tradeoff between information taken into account, and processing time. Furthermore, the output of the Similarity Estimation Pipeline is native RDF. DistSim is integrated into the SANSA stack, documented in scala-docs, and covered by unit tests. Additionally, the variables and provided methods follow the Apache Spark MLlib name-space conventions. The performance of DistSim was tested over a distributed cluster, for the dimensions of data set size and processing power versus processing time, which shows the scalability of DistSim w.r.t. increasing data set sizes and processing power. DistSim is already in use for solving several RDF data analytics related use cases. Additionally, DistSim is available and integrated into the open-source GitHub project SANSA.

Paper on Knowledge Graph Integration into Transformer Architectures Accepted at ACL21

We are happy to announce that we got a paper accepted for presentation at ACL 2021 (Association for Computational Linguistics). ACL is a premier Natural Language Processing conference. In the paper, we investigate the efficient integration of knowledge graphs into Transformer-based decoder architectures. The approach allows to integrate knowledge graphs into large-scale language models like GPT-2 or GPT-3, which leads to more comprehensive and interesting dialogues with such models.

Here is the pre-print of the accepted paper with its abstract:

Space Efficient Context Encoding for Non-Task-Oriented Dialogue Generation with Graph Attention Transformer
By Fabian Galetzka, Jewgeni Rose, David Schlangen,Jens Lehmann.
Abstract To improve the coherence and knowledge retrieval capabilities of non task-oriented dialogue systems, recent Transformer-based models aim to integrate fixed background context. This often comes in the form of knowledge graphs, and the integration is done by creating pseudo utterances through paraphrasing knowledge triples, added into the accumulated dialogue context. However, the context length is fixed in these architectures, which restricts how much background or dialogue context can be kept. In this work, we propose a more concise encoding for background context structured in form of knowledge graphs, by expressing the graph connections through restrictions on the attention weights. The results of our human evaluation show, that this encoding reduces space requirements without negative effects on the precision of reproduction of knowledge and perceived consistency. Further, models trained with our proposed context encoding generate dialogues that are judged to be more comprehensive and interesting.

Paper Accepted at EACL21

We are happy to announce that we got a paper accepted for presentation at EACL21 (European Chapter of the ACL). The European Chapter of the ACL (EACL) is the primary professional association for computational linguistics in Europe.

Here is the pre-print of the accepted paper with its abstract:

Conversational Question Answering over Knowledge Graphs with Transformer and Graph Attention Networks
By Endri Kacupaj, Joan Plepi, Kuldeep Singh, Harsh Thakkar,Jens Lehmann, and Maria Maleshkova.
Abstract This paper addresses the task of (complex) conversational question answering over a knowledge graph. For this task, we propose LASAGNE (muLti-task semAntic parSing with trAnsformer and Graph atteNtion nEtworks). It is the first approach, which employs a transformer architecture extended with Graph Attention Networks for multi-task neural semantic parsing. LASAGNE uses a transformer model for generating the base logical forms, while the Graph Attention model is used to exploit correlations between (entity) types and predicates to produce node representations. LASAGNE also includes a novel entity recognition module which detects, links, and ranks all relevant entities in the question context. We evaluate LASAGNE on a standard dataset for complex sequential question answering, on which it outperforms existing baseline averages on all question types. Specifically, we show that LASAGNE improves the F1-score on eight out of ten question types; in some cases, the increase in F1-score is more than 20% compared to the state of the art.

Paper Accepted at PAKDD 2021

We are happy to announce that we got a paper accepted for presentation at PAKDD 2021 (Pacific-Asia Conference on Knowledge Discovery and Data Mining). PAKDD is one of the longest established and leading international conferences in the areas of data mining and knowledge discovery. It provides an international forum for researchers and industry practitioners to share their new ideas, original research results, and practical development experiences from all KDD related areas, including data mining, data warehousing, machine learning, artificial intelligence, databases, statistics, knowledge engineering, visualization, decision-making systems, and the emerging applications.

Here is the pre-print of the accepted paper with its abstract:

Loss-aware Pattern Inference: A Correction on the Wrongly Claimed Limitations of Embedding Models
By Mojtaba Nayyeri, Chengjin Xu, Yadollah Yaghoobzadeh, Sahar Vahdati,Mirza Mohtashim Alam,Hamed Shariat Yazdi and Jens Lehmann.
Abstract Embedding knowledge graphs (KGs) into a low dimensional space has become an active research domain which is broadly utilized in many of the AI-based tasks, especially link prediction. One of the crucial aspects is the extent to which a KG embedding model (KGE) is capable to model and infer various relation patterns, such as symmetry/antisymmetry, inversion, and composition. Each embedding model is highly affected in the optimization of embedding vectors by their loss function which consequently affects the inference of relational patterns. However, most existing methods failed to consider this aspect in their inference capability. In this paper, we show that disregarding loss functions results in inaccurate or even wrong interpretation from the capability of the models. We provide deep theoretical investigations of the already existing KGE models on the example of the TransE model. To the best of our knowledge, so far, this has not been comprehensively investigated. We show that by a proper selection of the loss function for training a KGE e.g., TransE, the main inference limitations are mitigated. The provided theories together with the experimental results confirm the importance of loss functions for training KGE models and their performance.

SANSA 0.8.0 RC1 (Semantic Analytics Stack) Released

The Smart Data Analytics group [1] is happy to announce the candidate release (0.8.0 RC1) for SANSA Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark in order to allow scalable machine learning, inference, and querying capabilities for large knowledge graphs.

You can find the FAQ and usage examples at

The following features are currently supported by SANSA:

  • Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad format
  • Reading OWL files in various standard formats
  • SPARQL querying via Sparqlify, Ontop and Tensors
  • RDFS, RDFS Simple and OWL-Horst forward chaining inference

Noteworthy changes and updates since the previous release are:

  • Support for Ontop Based Query Engine over RDF.
  • Distributed Trig/Turtle record reader.
  • Support to write out RDDs of OWL axioms in a variety of formats.
  • Distributed Data Summaries with ABstraction and STATistics (ABSTAT).
  • Configurable mapping of RDD of triples dataframes.
  • Initial support for RDD of Graphs and Datasets, executing queries on each entry and aggregating over the results.
  • Sparql Transformer for ML-Pipelines.
  • Autosparql Generation for Feature Extraction.
  • Distributed Feature-based Semantic Similarity Estimations.
  • Added a common R2RML abstraction layer for Ontop, Sparqlify, and possible future query engines.
  • Consolidated SANSA layers into a single GIT repository.
  • Retired the support for Apache Flink.

We look forward to your comments on the new features to make them permanent in our upcoming release 0.8.

Kindly note that the candidate is not in the Maven Central, please follow the readme.

We want to thank everyone who helped to create this release, in particular the projects supporting us: PLATOON, BETTER, BOOST, SPECIAL, Simple-ML, LAMBDA, ML-win, CALLISTO, OpertusMundi, & Cleopatra.

Greetings from the SANSA Development Team

SANSA 0.8.0 Release

The Smart Data Analytics group ( is happy to announce SANSA 0.8.0 RC – the eighth release (candidate) of the Scalable Semantic Analytics Stack. SANSA employs distributed computing using Apache Spark in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

We look forward to your comments on the new features to make them permanent in our upcoming release.

Kindly note that the candidate is not in the Maven Central, please follow the readme.




In this release candidate, we have included:

  • Integrated Ontop as a new SPARQL engine
  • Improved SPARQL query API
  • Distributed Trig/Turtle record reader
  • Support to write out RDDs of OWL axioms in a variety of formats.
  • Distributed Data Summaries with ABstraction and STATistics (ABSTAT)
  • Configurable mapping of RDD of triples dataframes
  • Initial support for RDD of Graphs and Datasets, executing queries on each entry and aggregating over the results
  • Sparql Transformer for ML-Pipelines
  • Autosparql Generation for Feature Extraction
  • Distributed Feature based Semantic Similarity Estimations
  • Added a common R2RML abstraction layer for Ontop, Sparqlify and possible future query engines
  • Consolidated SANSA layers into a single GIT repository
  • Retired the support for Apache Flink

View this announcement on Twitter and SANSA blog:

PhD Viva: “Strategies for a Semantified Uniform Access to Large and Heterogeneous Data Sources” Mohamed Nadjib Mami

On Thursday 28th of January 2021, I have successfully defended my PhD thesis entitled “Strategies for a Semantified Uniform Access to Large and Heterogeneous Data Sources”. The research work focuses on enabling the uniform querying of large and heterogeneous data sources (using SPARQL). Two directions were explored: (1) Physical Big Data Integration, where all input data is converted into RDF data model, (2) Virtual Big Data Integration, where input data remains in its original form and accessed in an ad hoc manner (aka Semantic Data Lake). In the latter, relevant data is internally and virtually represented under RDF data model. State-of-the-art Big Data technologies, e.g., Apache Spark, Presto, Cassandra, MongoDB, were incorporated.


The remarkable advances achieved in both research and development of Data Management as well as the prevalence of high-speed Internet and technology in the last few decades have caused unprecedented data avalanche. Large volumes of data manifested in a multitude of types and formats are being generated and becoming the new norm. In this context, it is crucial to both leverage existing approaches and propose novel ones to overcome this data size and complexity, and thus facilitate data exploitation. In this thesis, we investigate two major approaches to addressing this challenge: Physical Data Integration and Logical Data Integration. The specific problem tackled is to enable querying large and heterogeneous data sources in an ad hoc manner.
In the Physical Data Integration, data is physically and wholly transformed into a canonical unique format, which can then be directly and uniformly queried. In the Logical Data Integration, data remains in its original format and form and a middleware is posed above the data allowing to map various schemata elements to a high-level unifying formal model. The latter enables the querying of the underlying original data in an ad hoc and uniform way, a framework which we call Semantic Data Lake, SDL. Both approaches have their advantages and disadvantages. For example, in the former, a significant effort and cost are devoted to pre-processing and transforming the data to the unified canonical format. In the latter, the cost is shifted to the query processing phases, e.g., query analysis, relevant source detection and results reconciliation.
In this thesis we investigate both directions and study their strengths and weaknesses. For each direction, we propose a set of approaches and demonstrate their feasibility via a proposed implementation. In both directions, we appeal to Semantic Web technologies, which provide a set of time-proven techniques and standards that are dedicated to Data Integration. In the Physical Integration, we suggest an end-to-end blueprint for the semantification of large and heterogeneous data sources, i.e., physically transforming the data to the Semantic Web data standard RDF (Resource Description Framework). A unified data representation, storage and query interface over the data are suggested. In the Logical Integration, we provide a description of the SDL architecture, which allows querying data sources right on their original form and format without requiring a prior transformation and centralization. For a number of reasons that we detail, we put more emphasis on the virtual approach. We present the effort behind an extensible implementation of the SDL, called Squerall, which leverages state-of-the-art Semantic and Big Data technologies, e.g., RML (RDF Mapping Language) mappings, FnO (Function Ontology) ontology, and Apache Spark. A series of evaluation is conducted to evaluate the implementation along with various metrics and input data scales. In particular, we describe an industrial real-world use case using our SDL implementation. In a preparation phase, we conduct a survey for the Query Translation methods in order to back some of our design choices.


The thesis is available online at: <>.


In the following, we summarize the activities of the last twelve months within the CLEOPATRA (Cross-lingual Event-centric Open Analytics Research Academy) project in which we participate.

Despite the challenges of running an international research project during a period of restricted mobility and access, the CLEOPATRA project team have enjoyed some major successes over the last twelve months. Our fourteen Early Stage Researchers (ESR) have shown remarkable ability to adapt, innovate and collaborate. They have worked together online, across different countries and time zones, to build tools and design methods for studying the digital traces of major global events. In April 2020, a hackathon and Research & Development week were quickly reorganised to be delivered virtually. Working in teams, the ESRs developed demonstrators to address such questions as how to analyse online media over time and in multiple (often under-resourced) languages. There was further opportunity to work on the demonstrators, and to develop new research ideas, at a Learning week in June 2020, held in conjunction with the University of Amsterdam Digital Methods Summer School. In January 2021, the ESRs organised a second virtual hackathon and Research & Development week, which resulted in the publication of an updated Open Event Knowledge Graph (OEKG). The OEKG is one of the key resources developed by the project, and currently contains information about more than a million events in 15 European languages. This is a unique resource which will transform our understanding of how transitional social, cultural and political events play out online. All of these activities have led to the continuation and formation of new and promising research collaborations, which we hope to see bear fruit in the coming months.

The ESRs and project beneficiaries have also been busy organising and attending conferences. In June 2020, an International Workshop on Cross-lingual Event-centric Open Analytics was held online, in association with the 17th Extended Semantic Web Conference. The award for the best paper delivered at the workshop was given to a team led by CLEOPATRA ESR, Golsa Tahmasebzadeh, working with Sherzod Hakimov, Eric Müller-Budack and Ralph Ewerth. A second international workshop will be held in April 2021, this time in association with the Web Conference. The project has been presented at numerous online conferences, in such diverse fields as the semantic web, artificial intelligence, web archive studies and spoken-language technologies. Publications arising from these events and other activities include blog posts, open datasets and no fewer than 17 conference proceedings and journal articles.

Information about all of these activities, and the various open resources that have been developed by the project team, can be found on the CLEOPATRA website, and you can follow us on Twitter @Cleopatra_ITN for the latest news and updates.