PyKEEN 1.0 Release

As a member of the PyKEEN community project, we are happy to announce PyKEEN 1.0 – PyKEEN is a software package to train and evaluate knowledge graph embedding models.

The following features are currently supported by PyKEEN:

  • 23 interaction models (ComplExLiteral, ComplEx, ConvE, ConvKB, DistMult, DistMultLiteral, ERMLP, ERMLPE, HolE, KG2E, NTN, ProjE, RESCAL, RGCN, RotatE, SimplE, StructuredEmbedding, TransD, TransE, TransH, TransR, TuckER, and UnstructuredModel)
  • 7 loss functions (Binary Cross Entropy, Cross Entropy, Margin Ranking Loss, Mean Square Error, Self-Adversarial Negative Sampling Loss, and Softplus Loss)
  • 3 regularizers (LP-norm based regularizer, Power Sum regularizer, and Combined regularizer, i.e., convex combination of regularizers)
  • 2 training approaches (LCWA and sLCWA)
  • 2 negative samplers (Uniform and Bernoulli)
  • Hyper-parameter optimization (using Optuna)
  • Early stopping
  • 6 evaluation metrics (adjusted mean rank, mean rank, mean reciprocal rank, hits@k, average-precision score, and ROC-AUC score)

PyKEEN was used to extensively test existing KGE models on a wide range of configurations. You can find those results in our paper. We want to thank everyone who helped to create this release. For more updates, please view our Twitter feed and consider following us.

Greetings from the PyKEEN-Team

Paper Accepted at IJCNN 2020

The International Joint Conference on Neural Networks (IJCNN) covers a wide range of topics in the field of neural networks, from biological neural networks to artificial neural computation. IJCNN 2020 will be featured by the IEEE World Congress on Computational Intelligence (IEEE WCCI), the world’s largest technical event in the field of computational intelligence.

Here is the pre-print of the accepted paper with its abstract:

Let the Margin SlidE± for Knowledge Graph Embeddings via a Correntropy Objective
By Mojtaba Nayyeri, Xiaotian Zhou, Sahar Vahdati, Reza Izanloo, Hamed Shariat Yazdi and Jens Lehmann.
Abstract Embedding models based on translation and rotation have gained significant attention in link prediction tasks for knowledge graphs. Most of the earlier works have modified thescore function of Knowledge Graph Embedding models in order to improve the performance of link prediction tasks. However, as proven theoretically and experimentally, the performance of such Embedding models strongly depends on the loss function. One of the prominent approaches in defining loss functions is to set a margin between positive and negative samples during the learning process. This task is particularly important because it directly affects the learning and ranking of triples and ultimately defines the final output. Approaches for setting a margin have the following challenges: a) the length of the margin has to be fixed manually, b) without a fixed point for center of the margin, the scores of positive triples are not necessarily enforced to be sufficiently small to fulfill the translation/rotation from head to tail by using the relation vector. In this paper, we propose a family of loss functions dubbed SlidE± to address the aforementioned challenges. The formulation of the proposed lossfunctions enables an automated technique to adjust the length of the margin adoptive to a defined center. In our experiments on a set of standard benchmark datasets including Freebase and WordNet, the effectiveness of our approach is confirmed for training Knowledge Graph Embedding models, specifically TransE and RotatE as a case study, on link prediction tasks.

Paper Accepted at LREC 2020

We are very pleased to announce that our group got a paper accepted for presentation at LREC 2020 (International Conference on Language Resources and Evaluation). LREC has become the major event on Language Resources (LRs) and Evaluation for Language Technologies (LT). The aim of LREC is to provide an overview of the state-of-the-art, explore new R&D directions and emerging trends, exchange information regarding LRs and their applications, evaluation methodologies and tools, ongoing and planned activities, industrial uses and needs, requirements coming from the e-society, both with respect to policy issues and to technological and organisational ones.

Here is the pre-print of the accepted paper with its abstract:

Treating Dialogue Quality Evaluation as an Anomaly Detection Problem
By Rostislav Nedelchev, Jens Lehmann and Ricardo Usbeck.
Abstract Dialogue systems for interaction with humans have been enjoying increased popularity in the research and industry fields. To this day, the best way to estimate their success is through means of human evaluation and not automated approaches, despite the abundance of work done in the field. In this paper, we investigate the effectiveness of perceiving dialogue evaluation as an anomaly detection task. The paper looks into four dialogue modeling approaches and how their objective functions correlate with human annotation scores. A high-level perspective exhibits negative results. However, a more in-depth look shows limited potential for using anomaly detection for evaluating dialogues.

Papers Accepted at ESWC

We are very pleased to announce that our group got five papers (three papers in the main tracks and two in the Cleopatra Workshop) accepted for presentation at ESWC2020 (European Semantic Web Conference 2020). The ESWC is a major venue for discussing the latest scientific results and technology innovations around semantic technologies. Building on its past success, ESWC is seeking to broaden its focus to span other relevant related research areas in which Web semantics plays an important role.

Here are the pre-prints of the papers with their abstracts that have been accepted in the main tracks:

  • A Knowledge Graph for Industry 4.0
    By Sebastian R. Bader, Irlan Grangel-Gonzalez, Priyanka Nanjappa, Maria-Esther Vidal, and Maria Maleshkova.
    Abstract One of the most crucial tasks for today’s knowledge workers is to get and retain a thorough overview on the latest state of the art. Especially in dynamic and evolving domains, the amount of relevant sources is constantly increasing, updating and overruling previous methods and approaches. For instance, the digital transformation of manufacturing systems, called Industry 4.0, currently faces an overwhelming amount of standardization efforts and reference initiatives, resulting in a sophisticated information environment.We propose a structured dataset in the form of a semantically annotated knowledge graph for Industry 4.0 related standards, norms and reference frameworks. The graph provides a Linked Data-conform collection of annotated, classified reference guidelines supporting newcomers and experts alike in understanding how to implement Industry 4.0 systems. We illustrate the suitability of the graph for various use cases, its already existing applications, present the maintenance process and evaluate its quality.

  • VQuAnDa: Verbalization QUestion Answering DAtaset
    By Endri Kacupaj, Hamid Zafar, Jens Lehmann and Maria Maleshkova.
    AbstractQuestion Answering (QA) systems over Knowledge Graphs (KGs) aim to provide a concise answer to a given natural language question. Despite the significant evolution of QA methods over the past years, there are still some core lines of work, which are lagging behind. This is especially true for methods and datasets that support the verbalization of answers in natural language. Specifically, to the best of our knowledge, none of the existing Question Answering datasets provide any verbalization data for the question-query pairs. Hence, we aim to fill this gap by providing the first QA dataset VQuAnDa that includes the verbalization of each answer. We base VQuAnDa on a commonly used large-scale QA dataset — LC-QuAD, in order to support compatibility and continuity of previous work. We complement the dataset with baseline scores for measuring future training and evaluation work, by using a set of standard sequence to sequence models and sharing the results of the experiments. This resource empowers researchers to train and evaluate a variety of models to generate answer verbalizations.

  • Embedding-based Recommendations on Scholarly Knowledge Graphs
    By Mojtaba Nayyeri, Sahar Vahdati, Xiaotian Zhou, Hamed Shariat Yazdi, and Jens Lehmann.
    Abstract The increasing availability of scholarly metadata in the form of Knowledge Graphs (KG) offers opportunities for studying the structure of scholarly communication and evolution of science.  Such KGs build the foundation for knowledge-driven tasks e.g., link discovery, prediction and entity classification which allow to provide recommendation services. knowledge graph embedding (KGE) models have been investigated for such knowledge-driven tasks in different application domains. One of the applications of KGE models is to provide link predictions, which can also be viewed as a foundation for recommendation service, e.g.~high confidence “co-author” links in a scholarly knowledge graph can be seen as suggested collaborations. In this paper, KGEs are reconciled with a specific loss function (Soft Margin) and examined with respect to their performance for co-authorship link prediction task on scholarly KGs.  The results show a significant improvement in the accuracy of the experimented KGE models on the considered scholarly KGs using this specific loss.TransE with Soft Margin (TransE-SM) obtains a score of 79.5% Hits@10 for co-authorship link prediction task while the original TransE obtains 77.2%, on the same task. In terms of accuracy and Hits@10, TransE-SM also outperforms other state-of-the-art embedding models such as ComplEx, ConvE and RotatE in this setting.The predicted co-authorship links have been validated by evaluating the profile of scholars.

Here are the pre-prints of the papers with their abstracts that have been accepted in the Cleopatra Workshop:

  • Training Multimodal Systems for Classification with Multiple Objectives
    By Jason Armitage, Shramana Thakur, Rishi Tripathi, Jens Lehmann and Maria Maleshkova.
    AbstractWe learn about the world from a diverse range of sensory information. Automated systems lack this ability and are confined to processing information presented in only a single format. Adapting architectures to learn from multiple modalities creates the potential to learn rich representations – but current systems only deliver marginal improvements on unimodal approaches. Neural networks learn sampling noise during training with the result that performance on unseen data is degraded. This research introduces a second objective over the multimodal fusion process learned with variational inference. Regularisation methods are implemented in the inner training loop to control variance and the modular structure stabilises performance as additional neurons are added to layers. This framework is evaluated on a multilabel classification task with textual and visual inputs to demonstrate the potential for multiple objectives and probabilistic methods to lower variance and improve generalisation.

Paper Accepted at IFIP SEC

We are very pleased to announce that our group got a paper accepted for presentation at the International Information Security and Privacy Conference (IFIP SEC). The IFIP SEC conferences aim to bring together primarily researchers, but also practitioners from academia, industry and governmental institutions to elaborate and discuss IT Security and Privacy Challenges that we are facing today and will be facing in the future.

Here is the pre-print of the accepted paper with its abstract:

Establishing a Strong Baseline for Privacy Policy Classification” by Najmeh Mousavi Nejad, Pablo Jabat, Rostislav Nedelchev, Simon Scerri, Damien Graux.
Abstract Digital service users are routinely exposed to Privacy Policy consent forms, through which they enter contractual agreements consenting to the specifics of how their personal data is managed and used. Nevertheless, despite  renewed importance following legislation such as the European GDPR, a majority of people still ignore policies due to their length and complexity. To counteract this potentially dangerous reality, in this paper we present three different models that are able to assign pre-defined categories to privacy policy paragraphs, using supervised machine learning. In order to train our neural networks, we exploit a dataset containing 115 privacy policies defined by US companies. An evaluation shows that our approach outperforms state-of-the-art by 5% over comparable and previously-reported F1 values. In addition, our method is completely reproducible since we provide open access to all resources. Given these two contributions, our approach can be considered as a strong baseline for privacy policy classification.

Paper Accepted at ACM SAC

We are very pleased to announce that our group got a paper accepted for presentation at ACM SAC. The ACM Symposium on Applied Computing (SAC) has been a primary and international forum for applied computer scientists, computer engineers and application developers to gather, interact and present their work.

Here is the pre-print of the accepted paper with its abstract:

Towards the Semantic Formalization of Science” by Said Fathalla, Soren Auer, Christoph Lange.
Abstract:The past decades have witnessed a huge growth in scholarly information published on the Web, mostly in unstructured or semi-structured formats, which hampers scientific literature exploration and scientometric studies. Past studies on ontologies for structuring scholarly information focused on describing scholarly articles’ components, such as document structure, metadata and bibliographies, rather than the scientific work itself. Over the past four years, we have been developing the Science Knowledge Graph Ontologies (SKGO), a set of ontologies for modeling the research findings in various fields of modern science resulting in a knowledge graph. Here, we introduce this ontology suite and discuss the design considerations taken into account during its development. We deem that within the next few years, a science knowledge graph is likely to become a crucial component for organizing and exploring scientific work.

Papers Accepted at ECAI

We are very pleased to announce that our group got two papers accepted for presentation at ECAI2020 (European Conference on Artificial Intelligence), Europe’s premier AI Research venue. Under the motto “Paving the way towards Human-Centric AI” ECAI provides an opportunity for researchers to present and discuss about the best AI research, developments, applications and results.

Here are the pre-prints of the accepted papers with their abstracts:

  • Distantly Supervised Question Parsing” by Hamid Zafar, Maryam Tavakol, Jens Lehmann.
    Abstract: The emergence of structured databases for Question Answering (QA) systems has led to developing methods, in which the problem of learning the correct answer efficiently is based on a linking task between the constituents of the question and the corresponding entries in the database. As a result, parsing the questions in order to determine their main elements, which are required for answer retrieval, becomes crucial. However, most datasets for question answering systems lack gold annotations for parsing, i.e., labels are only available in the form of (question, formal-query, answer). In this paper, we propose a distantly supervised learning framework based on reinforcement learning to learn the mentions of entities and relations in questions. We leverage the provided formal queries to characterize delayed rewards for optimizing a policy gradient objective for the parsing model. An empirical evaluation of our approach shows a significant improvement in the performance of entity and relation linking compared to the state of the art. We also demonstrate that a more accurate parsing component enhances the overall performance of QA systems.
  • “MDE: Multiple Distance Embeddings for Link Prediction in Knowledge Graphs by Afshin Sadeghi, Damien Graux, Hamed Shariat Yazdi, and Jens Lehmann. 
    Abstract: Over the past decade, knowledge graphs became popular for capturing structureddomain knowledge. Relational learning models enable the prediction of miss-ing links inside knowledge graphs. More specifically, latent distance approachesmodel the relationships among entities via a distance between latent representa-tions. Translating embedding models (e.g., TransE) are among the most popularlatent distance approaches which use one distance functionto learn multiple re-lation patterns. However, they are not capable of capturingsymmetric relations.They also force relations with reflexive patterns to become symmetric and tran-sitive. In order to improve distance based embedding, we propose multi-distanceembeddings (MDE). Our solution is based on the idea that by learning indepen-dent embedding vectors for each entity and relation one can aggregate contrastingdistance functions. Benefiting from MDE, we also develop supplementary dis-tances resolving the above-mentioned limitations of TransE. We further proposean extended loss function for distance based embeddings andshow that MDE andTransE are fully expressive using this loss function. Furthermore, we obtain abound on the size of their embeddings for full expressivity.Our empirical resultsshow that MDE significantly improves the translating embeddings and outper-forms several state-of-the-art embedding models on benchmark datasets.

SANSA 0.7.1 (Semantic Analytics Stack) Released

We are happy to announce SANSA 0.7.1 – the seventh release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

You can find usage guidelines and examples at

The following features are currently supported by SANSA:

  • Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad, TRIX format
  • Reading OWL files in various standard formats
  • Query heterogeneous sources (Data Lake) using SPARQL – CSV, Parquet, MongoDB, Cassandra, JDBC (MySQL, SQL Server, etc.) are supported
  • Support for multiple data partitioning techniques
  • SPARQL querying via Sparqlify and Ontop and Tensors
  • Graph-parallel querying of RDF using SPARQL (1.0) via GraphX traversals (experimental)
  • RDFS, RDFS Simple and OWL-Horst forward chaining inference
  • RDF graph clustering with different algorithms
  • Terminological decision trees (experimental)
  • Knowledge graph embedding approaches: TransE (beta), DistMult (beta)

Noteworthy changes or updates since the previous release are:

  • TRIX support
  • A new query engine over compressed RDF data
  • OWL/XML Support

Deployment and getting started:

  • There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
  • The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
  • Example code is available for various tasks.
  • We provide interactive notebooks for running and testing code via Docker.

We want to thank everyone who helped to create this release, in particular the projects Big Data OceanSLIPOQROWDBETTERBOOST, MLwin, PLATOON and Simple-ML. Also check out our recent articles in which we describe how to use SANSA for tensor based queryingscalable RDB2RDF query executionquality assessment and semantic partitioning.

Spread the word by retweeting our release announcement on Twitter. For more updates, please view our Twitter feed and consider following us.

Greetings from the SANSA Development Team

Quantum Natural Language Processing (QNLP) in Oxford

From 5 to 6 December, a conference on QNLP took place at St. Aldate’s Church in Oxford. This event was organized by the Quantum Group at the Department of Computer Science of the University of Oxford with support from the companies Cambridge Quantum Computing (CQC) and IBM. The two members Cedric Möller and Daniel Steinigen of the SDA team in Dresden participated in the conference. This was also the first conference ever about this combination of NLP with quantum computing.

Quantum Artificial Intelligence (QAI) has become increasingly interesting for research activities in the recent years. Noisy intermediate-scale quantum (NISQ) computers already provide the ability to perform algorithms and to find possible advantages for NLP. Since mathematical foundations of quantum theory are very similar to those of compositional NLP with applied category theory, quantum computers should provide a natural setting for compositional NLP tasks [1].

[1] Zeng, Coecke – Quantum Algorithms for Compositional Natural Language Processing

Paper accepted at IEEE-ICSC

We are very pleased to announce that our group got four papers accepted for presentation at IEEE-ICSC 2020.

The 14th IEEE International Conference on Semantic Computing (ICSC2020) addresses the derivation, description, integration, and use of semantics (“meaning”, “context”, “intention”) for all types of resources including data, document, tool, device, process and people. The scope of ICSC2020 includes, but is not limited to, analytics, semantics description languages and integration (of data and services), interfaces, and applications.

Here are the pre-prints of the accepted papers with their abstracts:

  • “DISE: A Distributed in-Memory SPARQL Processing Engine over Tensor Data” by Hajira Jabeen, Eskender Haziiev, Gezim Sejdiu, and Jens Lehmann.
    Abstract:SPARQL is a W3C standard for querying the data stored as Resource Description Framework (RDF). The SPARQL queries are represented using triple-patterns, and are tailored to search for these patterns in RDF. Most of the existing SPARQL evaluators provide centralized, DBMS inspired solutions consuming high resources and offering limited flexibility. In order to deal with the increasing RDF data, it is important to develop scalable and efficient solutions for distributed SPARQL query evaluators. In this paper we present DISE — an open source implementation of distributed in-memory SPARQL engine that can scale out to a cluster of machines. DISE represents an RDF graph as a three way distributed tensor for querying large-scale RDF datasets. This distributed tensor representation offers opportunities for novel distributed applications. DISE relies on translating SPARQL queries into Spark tensor operations by exploiting the information about the query complexity and creating a dynamic execution plan. We have tested the scalability and efficiency of DISE on different datasets and the results have been found scalable and efficient while exploiting the relatively new representation format.
  • “Let’s build Bridges, not Walls – SPARQL Querying of TinkerPop Graph Databases with sparql-gremlin” by Harsh Thakkar, Renzo Angles, Marko Rodriguez, Stephen Mallette, and Jens Lehmann.
    Abstract: This article presents sparql-gremlin, a tool to translate SPARQL queries to Gremlin pattern matching traversals. Currently, sparql-gremlin is a plugin of the Apache TinkerPop graph computing framework, thus the users can run queries expressed in the W3C SPARQL query language over a wide variety of graph data management systems, including both OLTP graph databases and OLAP graph processing frameworks. With sparql-gremlin, we perform the first step to bridgethe query interoperability gap between the Semantic Web and Graph database communities. The plugin has received adoption from both academia and industry research in its short timespan.

  • VoColReg: A Registry for Supporting Distributed Ontology Development using Version Control Systems” by Abderrahmane Khiat, Lavdim Halilaj, Ahmad Hemid and Steffen Lohmann (ICSC Resource Track).
    Abstract: The number of ontologies used for different pur-poses, such as data integration, information retrieval or search optimization, is constantly increasing. Therefore, it is crucial that ontologies can be developed and explored in an easy way by humans, and are accessible by intelligent agents. To this end, we created VoColReg on top of the VoCol platform. VoColReg provides an integrated registry that hosts VoCol instances, allowing the community to access, browse, reuse, and improve ontologies in a collaborative fashion. VoColReg integrates several improved features, such as RDF-Doctor which is able to simultaneously identify a comprehensive list of syntax errors and automatically correct a subset of them. Currently, the VoColReg platform hosts more than 21 ontologies from various domains, wherenine of them are publicly available. We analyzed those nine ontologies to discover different facts about them such as hosting platforms used, expressivity of the ontologies, number of triples and modules.

  • Learning a Lightweight Representation: First Step Towards Automatic Detection of Multidimensional Relationships between Ideas” by Abderrahmane Khiat (ICSC Research Track, Concise Paper).
    Abstract: Moving ideation from a closed paradigm (companies) to an open one (crowd) yields several benefits: (1) The crowd allows the generation of a large number of ideas and (2) Its heterogeneity increases the potential in obtaining creative ideas. In practice, however, the crowd often fails at generating innovative solutions, leading to duplicate or ideas that use each other’s description. Thus, it is practically and economically unfeasible to sift through this large number of ideas to select valuable ones. One promising solution to overcome this issue is finding relationships between idea texts such as duplicate, generalize, disjoint, alternative solution, etc. Existing approaches either rely on human judgment, which is expensive and requires domain experts or automatic approaches which compute similarity i.e. one dimension and do not consider other relations. The proposed solution is based on sequence-to-sequence learning, which allows the machine to learn a lightweight structural representation that is used next to establishing complex relations between ideas. This lightweight structural representation is obtained based on our investigation. We found that ideas contain the following patterns: what the idea is about (e.g. window with heat-sensitive material), how it works (e.g. it lights up) and when it works (e.g. in case of fire). Those extracted patterns are then compared with the corresponding patterns of other ideas to establish relations. Our preliminary investigation shows promising results to learn and leverage such lightweight structural representation in identifying the complex relationship between ideas.