Blog

Book “Knowledge Graphs and Big Data Processing” Published as Open Access

One of the core missions of the LAMBDA EU Project is to produce learning material about Big Data Analytics. We are happy to announce that the book “Knowledge Graphs and Big Data Processing” is published as open access. This was a titanic effort from SDA and Fraunhofer IAIS colleagues.
The book can be downloaded from here. We contributed to the chapters Big Data Outlook, Tools, and Architectures (chapter 3), Scalable Knowledge Graph Processing Using SANSA (chapter 7) and Context-Based Entity Matching for Big Data (chapter 8).

PyKEEN 1.0 Release

As a member of the PyKEEN community project, we are happy to announce PyKEEN 1.0 – PyKEEN is a software package to train and evaluate knowledge graph embedding models.

The following features are currently supported by PyKEEN:

  • 23 interaction models (ComplExLiteral, ComplEx, ConvE, ConvKB, DistMult, DistMultLiteral, ERMLP, ERMLPE, HolE, KG2E, NTN, ProjE, RESCAL, RGCN, RotatE, SimplE, StructuredEmbedding, TransD, TransE, TransH, TransR, TuckER, and UnstructuredModel)
  • 7 loss functions (Binary Cross Entropy, Cross Entropy, Margin Ranking Loss, Mean Square Error, Self-Adversarial Negative Sampling Loss, and Softplus Loss)
  • 3 regularizers (LP-norm based regularizer, Power Sum regularizer, and Combined regularizer, i.e., convex combination of regularizers)
  • 2 training approaches (LCWA and sLCWA)
  • 2 negative samplers (Uniform and Bernoulli)
  • Hyper-parameter optimization (using Optuna)
  • Early stopping
  • 6 evaluation metrics (adjusted mean rank, mean rank, mean reciprocal rank, hits@k, average-precision score, and ROC-AUC score)

PyKEEN was used to extensively test existing KGE models on a wide range of configurations. You can find those results in our paper. We want to thank everyone who helped to create this release. For more updates, please view our Twitter feed and consider following us.

Greetings from the PyKEEN-Team

Paper Accepted at IJCNN 2020

The International Joint Conference on Neural Networks (IJCNN) covers a wide range of topics in the field of neural networks, from biological neural networks to artificial neural computation. IJCNN 2020 will be featured by the IEEE World Congress on Computational Intelligence (IEEE WCCI), the world’s largest technical event in the field of computational intelligence.

Here is the pre-print of the accepted paper with its abstract:

Let the Margin SlidE± for Knowledge Graph Embeddings via a Correntropy Objective
By Mojtaba Nayyeri, Xiaotian Zhou, Sahar Vahdati, Reza Izanloo, Hamed Shariat Yazdi and Jens Lehmann.
Abstract Embedding models based on translation and rotation have gained significant attention in link prediction tasks for knowledge graphs. Most of the earlier works have modified thescore function of Knowledge Graph Embedding models in order to improve the performance of link prediction tasks. However, as proven theoretically and experimentally, the performance of such Embedding models strongly depends on the loss function. One of the prominent approaches in defining loss functions is to set a margin between positive and negative samples during the learning process. This task is particularly important because it directly affects the learning and ranking of triples and ultimately defines the final output. Approaches for setting a margin have the following challenges: a) the length of the margin has to be fixed manually, b) without a fixed point for center of the margin, the scores of positive triples are not necessarily enforced to be sufficiently small to fulfill the translation/rotation from head to tail by using the relation vector. In this paper, we propose a family of loss functions dubbed SlidE± to address the aforementioned challenges. The formulation of the proposed lossfunctions enables an automated technique to adjust the length of the margin adoptive to a defined center. In our experiments on a set of standard benchmark datasets including Freebase and WordNet, the effectiveness of our approach is confirmed for training Knowledge Graph Embedding models, specifically TransE and RotatE as a case study, on link prediction tasks.

Paper Accepted at LREC 2020

We are very pleased to announce that our group got a paper accepted for presentation at LREC 2020 (International Conference on Language Resources and Evaluation). LREC has become the major event on Language Resources (LRs) and Evaluation for Language Technologies (LT). The aim of LREC is to provide an overview of the state-of-the-art, explore new R&D directions and emerging trends, exchange information regarding LRs and their applications, evaluation methodologies and tools, ongoing and planned activities, industrial uses and needs, requirements coming from the e-society, both with respect to policy issues and to technological and organisational ones.

Here is the pre-print of the accepted paper with its abstract:

Treating Dialogue Quality Evaluation as an Anomaly Detection Problem
By Rostislav Nedelchev, Jens Lehmann and Ricardo Usbeck.
Abstract Dialogue systems for interaction with humans have been enjoying increased popularity in the research and industry fields. To this day, the best way to estimate their success is through means of human evaluation and not automated approaches, despite the abundance of work done in the field. In this paper, we investigate the effectiveness of perceiving dialogue evaluation as an anomaly detection task. The paper looks into four dialogue modeling approaches and how their objective functions correlate with human annotation scores. A high-level perspective exhibits negative results. However, a more in-depth look shows limited potential for using anomaly detection for evaluating dialogues.

Papers Accepted at ESWC

We are very pleased to announce that our group got five papers (three papers in the main tracks and two in the Cleopatra Workshop) accepted for presentation at ESWC2020 (European Semantic Web Conference 2020). The ESWC is a major venue for discussing the latest scientific results and technology innovations around semantic technologies. Building on its past success, ESWC is seeking to broaden its focus to span other relevant related research areas in which Web semantics plays an important role.

Here are the pre-prints of the papers with their abstracts that have been accepted in the main tracks:

  • A Knowledge Graph for Industry 4.0
    By Sebastian R. Bader, Irlan Grangel-Gonzalez, Priyanka Nanjappa, Maria-Esther Vidal, and Maria Maleshkova.
    Abstract One of the most crucial tasks for today’s knowledge workers is to get and retain a thorough overview on the latest state of the art. Especially in dynamic and evolving domains, the amount of relevant sources is constantly increasing, updating and overruling previous methods and approaches. For instance, the digital transformation of manufacturing systems, called Industry 4.0, currently faces an overwhelming amount of standardization efforts and reference initiatives, resulting in a sophisticated information environment.We propose a structured dataset in the form of a semantically annotated knowledge graph for Industry 4.0 related standards, norms and reference frameworks. The graph provides a Linked Data-conform collection of annotated, classified reference guidelines supporting newcomers and experts alike in understanding how to implement Industry 4.0 systems. We illustrate the suitability of the graph for various use cases, its already existing applications, present the maintenance process and evaluate its quality.

  • VQuAnDa: Verbalization QUestion Answering DAtaset
    By Endri Kacupaj, Hamid Zafar, Jens Lehmann and Maria Maleshkova.
    AbstractQuestion Answering (QA) systems over Knowledge Graphs (KGs) aim to provide a concise answer to a given natural language question. Despite the significant evolution of QA methods over the past years, there are still some core lines of work, which are lagging behind. This is especially true for methods and datasets that support the verbalization of answers in natural language. Specifically, to the best of our knowledge, none of the existing Question Answering datasets provide any verbalization data for the question-query pairs. Hence, we aim to fill this gap by providing the first QA dataset VQuAnDa that includes the verbalization of each answer. We base VQuAnDa on a commonly used large-scale QA dataset — LC-QuAD, in order to support compatibility and continuity of previous work. We complement the dataset with baseline scores for measuring future training and evaluation work, by using a set of standard sequence to sequence models and sharing the results of the experiments. This resource empowers researchers to train and evaluate a variety of models to generate answer verbalizations.

  • Embedding-based Recommendations on Scholarly Knowledge Graphs
    By Mojtaba Nayyeri, Sahar Vahdati, Xiaotian Zhou, Hamed Shariat Yazdi, and Jens Lehmann.
    Abstract The increasing availability of scholarly metadata in the form of Knowledge Graphs (KG) offers opportunities for studying the structure of scholarly communication and evolution of science.  Such KGs build the foundation for knowledge-driven tasks e.g., link discovery, prediction and entity classification which allow to provide recommendation services. knowledge graph embedding (KGE) models have been investigated for such knowledge-driven tasks in different application domains. One of the applications of KGE models is to provide link predictions, which can also be viewed as a foundation for recommendation service, e.g.~high confidence “co-author” links in a scholarly knowledge graph can be seen as suggested collaborations. In this paper, KGEs are reconciled with a specific loss function (Soft Margin) and examined with respect to their performance for co-authorship link prediction task on scholarly KGs.  The results show a significant improvement in the accuracy of the experimented KGE models on the considered scholarly KGs using this specific loss.TransE with Soft Margin (TransE-SM) obtains a score of 79.5% Hits@10 for co-authorship link prediction task while the original TransE obtains 77.2%, on the same task. In terms of accuracy and Hits@10, TransE-SM also outperforms other state-of-the-art embedding models such as ComplEx, ConvE and RotatE in this setting.The predicted co-authorship links have been validated by evaluating the profile of scholars.

Here are the pre-prints of the papers with their abstracts that have been accepted in the Cleopatra Workshop:

  • Training Multimodal Systems for Classification with Multiple Objectives
    By Jason Armitage, Shramana Thakur, Rishi Tripathi, Jens Lehmann and Maria Maleshkova.
    AbstractWe learn about the world from a diverse range of sensory information. Automated systems lack this ability and are confined to processing information presented in only a single format. Adapting architectures to learn from multiple modalities creates the potential to learn rich representations – but current systems only deliver marginal improvements on unimodal approaches. Neural networks learn sampling noise during training with the result that performance on unseen data is degraded. This research introduces a second objective over the multimodal fusion process learned with variational inference. Regularisation methods are implemented in the inner training loop to control variance and the modular structure stabilises performance as additional neurons are added to layers. This framework is evaluated on a multilabel classification task with textual and visual inputs to demonstrate the potential for multiple objectives and probabilistic methods to lower variance and improve generalisation.

Paper Accepted at IFIP SEC

We are very pleased to announce that our group got a paper accepted for presentation at the International Information Security and Privacy Conference (IFIP SEC). The IFIP SEC conferences aim to bring together primarily researchers, but also practitioners from academia, industry and governmental institutions to elaborate and discuss IT Security and Privacy Challenges that we are facing today and will be facing in the future.

Here is the pre-print of the accepted paper with its abstract:

Establishing a Strong Baseline for Privacy Policy Classification” by Najmeh Mousavi Nejad, Pablo Jabat, Rostislav Nedelchev, Simon Scerri, Damien Graux.
Abstract Digital service users are routinely exposed to Privacy Policy consent forms, through which they enter contractual agreements consenting to the specifics of how their personal data is managed and used. Nevertheless, despite  renewed importance following legislation such as the European GDPR, a majority of people still ignore policies due to their length and complexity. To counteract this potentially dangerous reality, in this paper we present three different models that are able to assign pre-defined categories to privacy policy paragraphs, using supervised machine learning. In order to train our neural networks, we exploit a dataset containing 115 privacy policies defined by US companies. An evaluation shows that our approach outperforms state-of-the-art by 5% over comparable and previously-reported F1 values. In addition, our method is completely reproducible since we provide open access to all resources. Given these two contributions, our approach can be considered as a strong baseline for privacy policy classification.

Paper Accepted at ACM SAC

We are very pleased to announce that our group got a paper accepted for presentation at ACM SAC. The ACM Symposium on Applied Computing (SAC) has been a primary and international forum for applied computer scientists, computer engineers and application developers to gather, interact and present their work.

Here is the pre-print of the accepted paper with its abstract:

Towards the Semantic Formalization of Science” by Said Fathalla, Soren Auer, Christoph Lange.
Abstract:The past decades have witnessed a huge growth in scholarly information published on the Web, mostly in unstructured or semi-structured formats, which hampers scientific literature exploration and scientometric studies. Past studies on ontologies for structuring scholarly information focused on describing scholarly articles’ components, such as document structure, metadata and bibliographies, rather than the scientific work itself. Over the past four years, we have been developing the Science Knowledge Graph Ontologies (SKGO), a set of ontologies for modeling the research findings in various fields of modern science resulting in a knowledge graph. Here, we introduce this ontology suite and discuss the design considerations taken into account during its development. We deem that within the next few years, a science knowledge graph is likely to become a crucial component for organizing and exploring scientific work.

Papers Accepted at ECAI

We are very pleased to announce that our group got two papers accepted for presentation at ECAI2020 (European Conference on Artificial Intelligence), Europe’s premier AI Research venue. Under the motto “Paving the way towards Human-Centric AI” ECAI provides an opportunity for researchers to present and discuss about the best AI research, developments, applications and results.

Here are the pre-prints of the accepted papers with their abstracts:

  • Distantly Supervised Question Parsing” by Hamid Zafar, Maryam Tavakol, Jens Lehmann.
    Abstract: The emergence of structured databases for Question Answering (QA) systems has led to developing methods, in which the problem of learning the correct answer efficiently is based on a linking task between the constituents of the question and the corresponding entries in the database. As a result, parsing the questions in order to determine their main elements, which are required for answer retrieval, becomes crucial. However, most datasets for question answering systems lack gold annotations for parsing, i.e., labels are only available in the form of (question, formal-query, answer). In this paper, we propose a distantly supervised learning framework based on reinforcement learning to learn the mentions of entities and relations in questions. We leverage the provided formal queries to characterize delayed rewards for optimizing a policy gradient objective for the parsing model. An empirical evaluation of our approach shows a significant improvement in the performance of entity and relation linking compared to the state of the art. We also demonstrate that a more accurate parsing component enhances the overall performance of QA systems.
  • “MDE: Multiple Distance Embeddings for Link Prediction in Knowledge Graphs by Afshin Sadeghi, Damien Graux, Hamed Shariat Yazdi, and Jens Lehmann. 
    Abstract: Over the past decade, knowledge graphs became popular for capturing structureddomain knowledge. Relational learning models enable the prediction of miss-ing links inside knowledge graphs. More specifically, latent distance approachesmodel the relationships among entities via a distance between latent representa-tions. Translating embedding models (e.g., TransE) are among the most popularlatent distance approaches which use one distance functionto learn multiple re-lation patterns. However, they are not capable of capturingsymmetric relations.They also force relations with reflexive patterns to become symmetric and tran-sitive. In order to improve distance based embedding, we propose multi-distanceembeddings (MDE). Our solution is based on the idea that by learning indepen-dent embedding vectors for each entity and relation one can aggregate contrastingdistance functions. Benefiting from MDE, we also develop supplementary dis-tances resolving the above-mentioned limitations of TransE. We further proposean extended loss function for distance based embeddings andshow that MDE andTransE are fully expressive using this loss function. Furthermore, we obtain abound on the size of their embeddings for full expressivity.Our empirical resultsshow that MDE significantly improves the translating embeddings and outper-forms several state-of-the-art embedding models on benchmark datasets.



SANSA 0.7.1 (Semantic Analytics Stack) Released

We are happy to announce SANSA 0.7.1 – the seventh release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

You can find usage guidelines and examples at http://sansa-stack.net/user-guide.

The following features are currently supported by SANSA:

  • Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad, TRIX format
  • Reading OWL files in various standard formats
  • Query heterogeneous sources (Data Lake) using SPARQL – CSV, Parquet, MongoDB, Cassandra, JDBC (MySQL, SQL Server, etc.) are supported
  • Support for multiple data partitioning techniques
  • SPARQL querying via Sparqlify and Ontop and Tensors
  • Graph-parallel querying of RDF using SPARQL (1.0) via GraphX traversals (experimental)
  • RDFS, RDFS Simple and OWL-Horst forward chaining inference
  • RDF graph clustering with different algorithms
  • Terminological decision trees (experimental)
  • Knowledge graph embedding approaches: TransE (beta), DistMult (beta)

Noteworthy changes or updates since the previous release are:

  • TRIX support
  • A new query engine over compressed RDF data
  • OWL/XML Support

Deployment and getting started:

  • There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
  • The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
  • Example code is available for various tasks.
  • We provide interactive notebooks for running and testing code via Docker.

We want to thank everyone who helped to create this release, in particular the projects Big Data OceanSLIPOQROWDBETTERBOOST, MLwin, PLATOON and Simple-ML. Also check out our recent articles in which we describe how to use SANSA for tensor based queryingscalable RDB2RDF query executionquality assessment and semantic partitioning.

Spread the word by retweeting our release announcement on Twitter. For more updates, please view our Twitter feed and consider following us.

Greetings from the SANSA Development Team

Quantum Natural Language Processing (QNLP) in Oxford

From 5 to 6 December, a conference on QNLP took place at St. Aldate’s Church in Oxford. This event was organized by the Quantum Group at the Department of Computer Science of the University of Oxford with support from the companies Cambridge Quantum Computing (CQC) and IBM. The two members Cedric Möller and Daniel Steinigen of the SDA team in Dresden participated in the conference. This was also the first conference ever about this combination of NLP with quantum computing.

Quantum Artificial Intelligence (QAI) has become increasingly interesting for research activities in the recent years. Noisy intermediate-scale quantum (NISQ) computers already provide the ability to perform algorithms and to find possible advantages for NLP. Since mathematical foundations of quantum theory are very similar to those of compositional NLP with applied category theory, quantum computers should provide a natural setting for compositional NLP tasks [1].

[1] Zeng, Coecke – Quantum Algorithms for Compositional Natural Language Processing https://arxiv.org/abs/1608.01406