Blog

Papers Accepted at CIKM 2020

We are very pleased to announce that our group got two papers accepted for presentation at CIKM 2020 (Conference on Information and Knowledge Management CIKM). CIKM seeks to identify challenging problems facing the development of future knowledge and information systems, and to shape future directions of research by soliciting and reviewing high quality, applied and theoretical research findings. An important part of the conference is the Workshops and Tutorial programs which focuses on timely research challenges and initiatives – and bringing together research papers, industry speakers and keynote speakers. The program also showcases posters, demonstrations, competitions, and other special events.

  • Evaluating the Impact of Knowledge Graph Context on Entity Disambiguation Models
    By Isaiah Onando Mulang, Kuldeep Singh, Chaitali Prabhu, Abhishek Nadgeri,Johannes Hoffart, and Jens Lehmann.
    Abstract Pretrained Transformer models have emerged as state-of-the-art approaches that learn contextual information from the text to improve the performance of several NLP tasks. These models, albeit powerful, still require specialized knowledge in specific scenarios. In this paper, we argue that context derived from a knowledge graph (in our case: Wikidata) provides enough signals to inform pretrained transformer models and improve their performance for named entity disambiguation (NED) on Wikidata KG. We further hypothesize that our proposed KG context can be standardized for Wikipedia, and we evaluate the impact of KG context on the state of the art NED model for the Wikipedia knowledge base. Our empirical results validate that the proposed KG context can be generalized (for Wikipedia), and providing KG context in transformer architectures considerably outperforms the existing baselines, including the vanilla transformer models.
  • MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities
    By Jason Armitage, Endri Kacupaj, Golsa Tahmasebzadeh, Swati,Maria Maleshkova, Ralph Ewerth, and Jens Lehmann.
    Abstract In this paper, we introduce the MLM (Multiple Languages and Modalities) dataset – a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic data provide a resource that further tests the ability for multitask systems to learn relationships between entities. The dataset is designed for researchers and developers who build applications that perform multiple tasks on data encountered on the web and in digital archives. A second version of MLM provides a geo-representative subset of the data with weighted samples for countries of the European Union. We demonstrate the value of the resource in developing novel applications in the digital humanities with a motivating use case and specify a benchmark set of tasks to retrieve modalities and locate entities in the dataset. Evaluation of baseline multitask and single task systems on the full and geo-representative versions of MLM demonstrate the challenges of generalising on diverse data. In addition to the digital humanities, we expect the resource to contribute to research in multimodal representation learning, location estimation, and scene understanding.

Paper Published in the Journal of Web Semantics

We are very pleased to announce that our paper “No One is Perfect: Analysing the Performance of Question Answering Components over the DBpedia Knowledge Graph” has been published in the Journal of Web Semantics. The Journal of Web Semantics is an interdisciplinary journal based on research and applications of various subject areas that contribute to the development of a knowledge-intensive and intelligent service Web.

Here is the pre-print of the published paper with its abstract:

No One is Perfect: Analysing the Performance of Question Answering Components over the DBpedia Knowledge Graph
By Kuldeep Singh, Ioanna Lytra,, Arun Sethupat Radhakrishna, Saeedeh Shekarpour, Maria Esther Vidal, and Jens Lehmann.
Abstract Question answering (QA) over knowledge graphs has gained significant momentum over the past five years due to the increasing availability of large knowledge graphs and the rising importance of Question Answering for user interaction. Existing QA systems have been extensively evaluated as black boxes and their perfor-mance has been characterised in terms of average results over all the questions of benchmarking datasets(i.e. macro evaluation). Albeit informative, macro evaluation studies do not provide evidence about QAcomponents’ strengths and concrete weaknesses. Therefore, the objective of this article is to analyse and micro evaluate available QA components in order to comprehend which question characteristics impact on their performance. For this, we measure at question level and with respect to different question features the accuracy of 29 components reused in QA frameworks for the DBpedia knowledge graph using state-of-the-art benchmarks. As a result, we provide a perspective on collective failure cases, study the similarities and synergies among QA components for different component types and suggest their characteristics preventing them from effectively solving the corresponding QA tasks. Finally, based on these extensive results, wepresent conclusive insights for future challenges and research directions in the field of Question Answeringover knowledge graphs.

Papers Accepted at DEXA 2020

We are very pleased to announce that our group got two papers accepted for presentation at DEXA2020 (International Conference on Database and Expert Systems Applications). Since 1990, DEXA has been an annual international conference which showcases state-of-the-art research activities in database, information, and knowledge systems. DEXA provides a forum to present research results and to examine advanced applications in the field. The conference and its associated workshops offer an opportunity for developers, scientists, and users to extensively discuss requirements, problems, and solutions in database, information, and knowledge systems. 

Here are the pre-print of the accepted papers with their abstract:

  • Unveiling Relations in the Industry 4.0 Standards Landscape based on Knowledge Graph Embeddings
    By Ariam Rivas, Irlán Grangel-González, Diego Collarana, Jens Lehmann, and Maria-Esther Vidal.
    Abstract Industry 4.0 (I4.0) standards and standardization frameworks have been proposed with the goal of empowering interoperability in smart factories. These standards enable the description and interaction of the main components, systems, and processes inside of a smart factory. Due to the growing number of frameworks and standards, there is an increasing need for approaches that automatically analyze the landscape of I4.0 standards. Standardization frameworks classify standards according to their functions into layers and dimensions. However, similar standards can be classified differently across the frameworks, producing, thus, interoperability conflicts among them. Semantic-based approaches that rely on ontologies and knowledge graphs, have been proposed to represent standards, known relations among them, as well as their classification according to existing frameworks. Albeit informative, the structured modeling of the I4.0 landscape only provides the foundations for detecting interoperability issues. Thus, graph-based analytical methods able to exploit knowledge encoded by these approaches, are required to uncover alignments among standards. We study the relatedness among standards and frameworks based on community analysis to discover knowledge that helps to cope with interoperability conflicts between standards. We use knowledge graph embeddings to automatically create these communities exploiting the meaning of the existing relationships. In particular, we focus on the identification of similar standards, i.e., communities of standards, and analyze their properties to detect unknown relations. We empirically evaluate our approach on a knowledge graph of I4.0 standards using the Trans* family of embedding models for knowledge graph entities. Our results are promising and suggest that relations among standards can be detected accurately.

  • SCODIS: Job Advert-derived Time Series for high-demand Skillset Discovery and Prediction
    By Elisa Margareth Sibarani and Simon Scerri.
    Abstract In this paper, we consider a dataset compiled from online job adverts for consecutive fixed periods, to identify whether repeated and automated observation of skills requested in the job market can be used to predict the relevance of skillsets and the predominance of skills in the near future. The data, consisting of co-occurring skills observed in job adverts, is used to generate a skills graph whose nodes are skills and whose edges denote the co-occurrence appearance. To better observe and interpret the evolution of this graph over a period of time, we investigate two clustering methods that can reduce the complexity of the graph. The best performing method, evaluated according to its modularity value (0.72 for the best method followed by 0.41), is then used as a basis for the SCODIS framework, which enables the discovery of in-demand skillsets based on the observation of skills clusters in a time series. The framework is used to conduct a time series forecasting experiment, resulting in the F-measures observed at 72%, which confirms that to an extent, and with enough previous observations, it is indeed possible to identify which skillsets will dominate demand for a specific sector in the short-term.

Book “Knowledge Graphs and Big Data Processing” Published as Open Access

One of the core missions of the LAMBDA EU Project is to produce learning material about Big Data Analytics. We are happy to announce that the book “Knowledge Graphs and Big Data Processing” is published as open access. This was a titanic effort from SDA and Fraunhofer IAIS colleagues.
The book can be downloaded from here. We contributed to the chapters Big Data Outlook, Tools, and Architectures (chapter 3), Scalable Knowledge Graph Processing Using SANSA (chapter 7) and Context-Based Entity Matching for Big Data (chapter 8).

PyKEEN 1.0 Release

As a member of the PyKEEN community project, we are happy to announce PyKEEN 1.0 – PyKEEN is a software package to train and evaluate knowledge graph embedding models.

The following features are currently supported by PyKEEN:

  • 23 interaction models (ComplExLiteral, ComplEx, ConvE, ConvKB, DistMult, DistMultLiteral, ERMLP, ERMLPE, HolE, KG2E, NTN, ProjE, RESCAL, RGCN, RotatE, SimplE, StructuredEmbedding, TransD, TransE, TransH, TransR, TuckER, and UnstructuredModel)
  • 7 loss functions (Binary Cross Entropy, Cross Entropy, Margin Ranking Loss, Mean Square Error, Self-Adversarial Negative Sampling Loss, and Softplus Loss)
  • 3 regularizers (LP-norm based regularizer, Power Sum regularizer, and Combined regularizer, i.e., convex combination of regularizers)
  • 2 training approaches (LCWA and sLCWA)
  • 2 negative samplers (Uniform and Bernoulli)
  • Hyper-parameter optimization (using Optuna)
  • Early stopping
  • 6 evaluation metrics (adjusted mean rank, mean rank, mean reciprocal rank, hits@k, average-precision score, and ROC-AUC score)

PyKEEN was used to extensively test existing KGE models on a wide range of configurations. You can find those results in our paper. We want to thank everyone who helped to create this release. For more updates, please view our Twitter feed and consider following us.

Greetings from the PyKEEN-Team

Paper Accepted at IJCNN 2020

The International Joint Conference on Neural Networks (IJCNN) covers a wide range of topics in the field of neural networks, from biological neural networks to artificial neural computation. IJCNN 2020 will be featured by the IEEE World Congress on Computational Intelligence (IEEE WCCI), the world’s largest technical event in the field of computational intelligence.

Here is the pre-print of the accepted paper with its abstract:

Let the Margin SlidE± for Knowledge Graph Embeddings via a Correntropy Objective
By Mojtaba Nayyeri, Xiaotian Zhou, Sahar Vahdati, Reza Izanloo, Hamed Shariat Yazdi and Jens Lehmann.
Abstract Embedding models based on translation and rotation have gained significant attention in link prediction tasks for knowledge graphs. Most of the earlier works have modified thescore function of Knowledge Graph Embedding models in order to improve the performance of link prediction tasks. However, as proven theoretically and experimentally, the performance of such Embedding models strongly depends on the loss function. One of the prominent approaches in defining loss functions is to set a margin between positive and negative samples during the learning process. This task is particularly important because it directly affects the learning and ranking of triples and ultimately defines the final output. Approaches for setting a margin have the following challenges: a) the length of the margin has to be fixed manually, b) without a fixed point for center of the margin, the scores of positive triples are not necessarily enforced to be sufficiently small to fulfill the translation/rotation from head to tail by using the relation vector. In this paper, we propose a family of loss functions dubbed SlidE± to address the aforementioned challenges. The formulation of the proposed lossfunctions enables an automated technique to adjust the length of the margin adoptive to a defined center. In our experiments on a set of standard benchmark datasets including Freebase and WordNet, the effectiveness of our approach is confirmed for training Knowledge Graph Embedding models, specifically TransE and RotatE as a case study, on link prediction tasks.

Paper Accepted at LREC 2020

We are very pleased to announce that our group got a paper accepted for presentation at LREC 2020 (International Conference on Language Resources and Evaluation). LREC has become the major event on Language Resources (LRs) and Evaluation for Language Technologies (LT). The aim of LREC is to provide an overview of the state-of-the-art, explore new R&D directions and emerging trends, exchange information regarding LRs and their applications, evaluation methodologies and tools, ongoing and planned activities, industrial uses and needs, requirements coming from the e-society, both with respect to policy issues and to technological and organisational ones.

Here is the pre-print of the accepted paper with its abstract:

Treating Dialogue Quality Evaluation as an Anomaly Detection Problem
By Rostislav Nedelchev, Jens Lehmann and Ricardo Usbeck.
Abstract Dialogue systems for interaction with humans have been enjoying increased popularity in the research and industry fields. To this day, the best way to estimate their success is through means of human evaluation and not automated approaches, despite the abundance of work done in the field. In this paper, we investigate the effectiveness of perceiving dialogue evaluation as an anomaly detection task. The paper looks into four dialogue modeling approaches and how their objective functions correlate with human annotation scores. A high-level perspective exhibits negative results. However, a more in-depth look shows limited potential for using anomaly detection for evaluating dialogues.

Papers Accepted at ESWC

We are very pleased to announce that our group got five papers (three papers in the main tracks and two in the Cleopatra Workshop) accepted for presentation at ESWC2020 (European Semantic Web Conference 2020). The ESWC is a major venue for discussing the latest scientific results and technology innovations around semantic technologies. Building on its past success, ESWC is seeking to broaden its focus to span other relevant related research areas in which Web semantics plays an important role.

Here are the pre-prints of the papers with their abstracts that have been accepted in the main tracks:

  • A Knowledge Graph for Industry 4.0
    By Sebastian R. Bader, Irlan Grangel-Gonzalez, Priyanka Nanjappa, Maria-Esther Vidal, and Maria Maleshkova.
    Abstract One of the most crucial tasks for today’s knowledge workers is to get and retain a thorough overview on the latest state of the art. Especially in dynamic and evolving domains, the amount of relevant sources is constantly increasing, updating and overruling previous methods and approaches. For instance, the digital transformation of manufacturing systems, called Industry 4.0, currently faces an overwhelming amount of standardization efforts and reference initiatives, resulting in a sophisticated information environment.We propose a structured dataset in the form of a semantically annotated knowledge graph for Industry 4.0 related standards, norms and reference frameworks. The graph provides a Linked Data-conform collection of annotated, classified reference guidelines supporting newcomers and experts alike in understanding how to implement Industry 4.0 systems. We illustrate the suitability of the graph for various use cases, its already existing applications, present the maintenance process and evaluate its quality.

  • VQuAnDa: Verbalization QUestion Answering DAtaset
    By Endri Kacupaj, Hamid Zafar, Jens Lehmann and Maria Maleshkova.
    AbstractQuestion Answering (QA) systems over Knowledge Graphs (KGs) aim to provide a concise answer to a given natural language question. Despite the significant evolution of QA methods over the past years, there are still some core lines of work, which are lagging behind. This is especially true for methods and datasets that support the verbalization of answers in natural language. Specifically, to the best of our knowledge, none of the existing Question Answering datasets provide any verbalization data for the question-query pairs. Hence, we aim to fill this gap by providing the first QA dataset VQuAnDa that includes the verbalization of each answer. We base VQuAnDa on a commonly used large-scale QA dataset — LC-QuAD, in order to support compatibility and continuity of previous work. We complement the dataset with baseline scores for measuring future training and evaluation work, by using a set of standard sequence to sequence models and sharing the results of the experiments. This resource empowers researchers to train and evaluate a variety of models to generate answer verbalizations.

  • Embedding-based Recommendations on Scholarly Knowledge Graphs
    By Mojtaba Nayyeri, Sahar Vahdati, Xiaotian Zhou, Hamed Shariat Yazdi, and Jens Lehmann.
    Abstract The increasing availability of scholarly metadata in the form of Knowledge Graphs (KG) offers opportunities for studying the structure of scholarly communication and evolution of science.  Such KGs build the foundation for knowledge-driven tasks e.g., link discovery, prediction and entity classification which allow to provide recommendation services. knowledge graph embedding (KGE) models have been investigated for such knowledge-driven tasks in different application domains. One of the applications of KGE models is to provide link predictions, which can also be viewed as a foundation for recommendation service, e.g.~high confidence “co-author” links in a scholarly knowledge graph can be seen as suggested collaborations. In this paper, KGEs are reconciled with a specific loss function (Soft Margin) and examined with respect to their performance for co-authorship link prediction task on scholarly KGs.  The results show a significant improvement in the accuracy of the experimented KGE models on the considered scholarly KGs using this specific loss.TransE with Soft Margin (TransE-SM) obtains a score of 79.5% Hits@10 for co-authorship link prediction task while the original TransE obtains 77.2%, on the same task. In terms of accuracy and Hits@10, TransE-SM also outperforms other state-of-the-art embedding models such as ComplEx, ConvE and RotatE in this setting.The predicted co-authorship links have been validated by evaluating the profile of scholars.

Here are the pre-prints of the papers with their abstracts that have been accepted in the Cleopatra Workshop:

  • Training Multimodal Systems for Classification with Multiple Objectives
    By Jason Armitage, Shramana Thakur, Rishi Tripathi, Jens Lehmann and Maria Maleshkova.
    AbstractWe learn about the world from a diverse range of sensory information. Automated systems lack this ability and are confined to processing information presented in only a single format. Adapting architectures to learn from multiple modalities creates the potential to learn rich representations – but current systems only deliver marginal improvements on unimodal approaches. Neural networks learn sampling noise during training with the result that performance on unseen data is degraded. This research introduces a second objective over the multimodal fusion process learned with variational inference. Regularisation methods are implemented in the inner training loop to control variance and the modular structure stabilises performance as additional neurons are added to layers. This framework is evaluated on a multilabel classification task with textual and visual inputs to demonstrate the potential for multiple objectives and probabilistic methods to lower variance and improve generalisation.

Paper Accepted at IFIP SEC

We are very pleased to announce that our group got a paper accepted for presentation at the International Information Security and Privacy Conference (IFIP SEC). The IFIP SEC conferences aim to bring together primarily researchers, but also practitioners from academia, industry and governmental institutions to elaborate and discuss IT Security and Privacy Challenges that we are facing today and will be facing in the future.

Here is the pre-print of the accepted paper with its abstract:

Establishing a Strong Baseline for Privacy Policy Classification” by Najmeh Mousavi Nejad, Pablo Jabat, Rostislav Nedelchev, Simon Scerri, Damien Graux.
Abstract Digital service users are routinely exposed to Privacy Policy consent forms, through which they enter contractual agreements consenting to the specifics of how their personal data is managed and used. Nevertheless, despite  renewed importance following legislation such as the European GDPR, a majority of people still ignore policies due to their length and complexity. To counteract this potentially dangerous reality, in this paper we present three different models that are able to assign pre-defined categories to privacy policy paragraphs, using supervised machine learning. In order to train our neural networks, we exploit a dataset containing 115 privacy policies defined by US companies. An evaluation shows that our approach outperforms state-of-the-art by 5% over comparable and previously-reported F1 values. In addition, our method is completely reproducible since we provide open access to all resources. Given these two contributions, our approach can be considered as a strong baseline for privacy policy classification.

Paper Accepted at ACM SAC

We are very pleased to announce that our group got a paper accepted for presentation at ACM SAC. The ACM Symposium on Applied Computing (SAC) has been a primary and international forum for applied computer scientists, computer engineers and application developers to gather, interact and present their work.

Here is the pre-print of the accepted paper with its abstract:

Towards the Semantic Formalization of Science” by Said Fathalla, Soren Auer, Christoph Lange.
Abstract:The past decades have witnessed a huge growth in scholarly information published on the Web, mostly in unstructured or semi-structured formats, which hampers scientific literature exploration and scientometric studies. Past studies on ontologies for structuring scholarly information focused on describing scholarly articles’ components, such as document structure, metadata and bibliographies, rather than the scientific work itself. Over the past four years, we have been developing the Science Knowledge Graph Ontologies (SKGO), a set of ontologies for modeling the research findings in various fields of modern science resulting in a knowledge graph. Here, we introduce this ontology suite and discuss the design considerations taken into account during its development. We deem that within the next few years, a science knowledge graph is likely to become a crucial component for organizing and exploring scientific work.