Thesis Announcements

The Smart Data Analytics group is always looking for good students to write theses. The topics can be in one of the following broad areas:

For Students: Thesis Registration Workflow with Prof. Lehmann as First Supervisor during the COVID-19 crisis

  • Please find a thesis topic (see below) including a mentor (please contact those indicated there directly) and a second examiner (ask other professors and/or your mentor).
  • The mentor needs to get approval from Prof. Lehmann.
  • Please contact Martina Doelp for the thesis registration form.
  • Fill and sign the form and sent it to Martina Doelp.
  • Prof. Lehmann will then sign it (on paper) and Martina Doelp will send it to the second examiner (via mail) and forward it to the exam office afterwards (on paper). You will get a copy of the registration form.


Please note that the list below is only a small sample of possible topics and ideas. Please contact us to discuss further, to find new topics, or to suggest a topic of your own.

Open Theses

Topic                                                                                                                                                                                                           LevelContact Person
Knowledge Graph Embeddings Benchmarking
Knowledge Graph embedding methods (KGE) have become the standard toolkit for analyzing and learning from data on knowlede graphs. They have been successfully applied to many domains including chemistry, physics, social sciences and bioinformatics. As the field grows, it becomes critical to identify the architectures and key mechanisms which generalize across knowledge graphs sizes, enabling us to tackle larger, more complex datasets and domains. In this thesis you should study and apply the main tests, metrics and testing models.
Technology to use: Machine Learning, Knoweldge Graphs
MAfshin Sadeghi, Dr. Diego Esteves
eCl@ss Vocabulary goes Linked Data
The eCl@ss [1] vocabulary is currently the de facto standard to describe machines, components, processes and any form of assets in the upcoming Industry 4.0. However, the vocabulary catalog itself is only provided in a limited XML serialization. Smart machines and AI-enabled devices on the over hand require meaningful terminologies and formal descriptions. While eCl@ss contains this kinds of information, it is not (yet) published in proper ways or formats.
Linked Data [2] can help to solve this challenge. The machine-readable expressions in various serializations (JSON, XML, Turtle, NTriples, …) enable the unambiguous and seamless integration of interconnected data. Combining eCl@ss with Linked Data technologies is therefore a very promising approach for Industry 4.0.
The goal of this Bachelor thesis is the prototypical implementation of a Linked Data server serving eCl@ss content of one distinct domain. The thesis shall examine the feasibility of a state-of-the-art technology stack to translate and host the vocabulary catalog in Linked Data, similar to the work of Hepp and Radinger [3]. While this work solved the task for eCl@ss version 5.1.4, this thesis will target version 11.0 and in addition regard the specific requirements of Industry 4.0 applications.
Requirements: programming skills (Java, C, Python or any other commonly used language); interest in Web servers and internet-based applications
[1] https://www.eclass.eu/standard/
[2] https://en.wikipedia.org/wiki/Linked_data
[3] http://www.heppnetz.de/projects/eclassowl/
BSebastian Bader
Ontology editor in a collaborative distributed and heterogeneous environment with a conflict resolution
VoCoEditor: An ontology editor which supports different serialization formats of RDF, works in collaborative distributed and heterogeneous environments, enables syntax highlighting, validation and auto-completion, detects multiple syntax errors, and shows hints for detected syntax errors for correction is needed. VoCoEditor is supposed to provide the mentioned features in a multi-user environment where users can view, edit, delete their text or files, included in their ontology. The ontology can be either a version-control-based (hosted on a version control system, e.g., GitHub) or a local-file-based one (uploaded locally). Furthermore, VoCoEditor should authenticate the version-control-based user based on the different authentication mechanisms which are activated by such a version control system, e.g., token-authentication. Usually, authentication is required when the repository is private and it is required before of any further Git actions. Finally, VoCoEditor will tackle the synchronization of the ontology when multiple users are working in the editing mode, then this will get rid of the conflict occurrence if each does not have the last updated version of the ontology, hosted in the repository.
[1] TurtleEditor: A web-based RDF editor to support distributed ontology development on repository hosting platforms. ‏ [2] https://www.w3.org/wiki/Ontology_editors
MAhmad Hemid, Dr. Abderrahmane Khiat
Solving Mini Chess via Distributed Deep Reinforcement Learning and Proof Number Search

In 1956, a chess program beat a (novice) human opponent for the first time in a chess variant called Los Alamos chess. Los Alamos Chess has a 6×6 board size and is therefore more suitable for computer programs – such variants are called “mini chess”. While chess engines have improved dramatically over the past decades, the game theoretic value (either a draw or a win for White can be forced) of Los Alamos Chess is still unknown. In this thesis, the goal is to prove the game theoretic value of the game using a combination of a) deep reinforcement learning techniques for determining the best move in a particular position, b) position solvers based on proof number search, c) endgame table bases generated for Los Alamos Chess and d) a distributed proof tree manager allowing to execute the approach over a cluster. If the student is capable of implementing the approach on a small cluster, we will apply for an execution within a large-scale cluster consisting of hundreds of computing nodes. Given that substantial resources may be invested for this thesis, students applying for it should have excellent grades, excellent programming skills and enthusiasm for game solving or chess.
MProf. Dr. Jens Lehmann
Conversational AI & Climate Change
Climate Change is one of the major challenges humanity has to face. One of the key elements required for addressing its consequences is to provide objective information to citizens. Conversational AI methods (aka chatbots, speech assistants, dialogue systems, …) have become increasingly important over the past few years. In this master thesis, students can contribute to building a climate change chatbot by addressing one of the main challenges in deep learning and natural language processing below:
  • Improving reading comprehension techniques by transfer learning from large text corpora such as SquAD to text documents describing climate change
  • Improving the translation of natural language questions to queries against (climate) knowledge graphs)
  • Interactive question answering techniques for capturing user feedback
MDr. Ricardo Usbeck,
Prof. Dr. Jens Lehmann
OWL Git-Diff: Towards an expressive Gitt-Diff tool for OWL Ontologies
The current git-diff tools use a text-based comparison to differentiate between OWL ontologies where the git users are not well-informed with the real semantic-based changes of those ontologies. OWL Git-Diff will be integrated within version-control systems to express in a userfriendly way the changes in OWL ontology versions. Ecco [1] tool compares OWL 2 ontologies where its output is an XML-based text which cannot be easily integrated within such systems, also it does not support various serialization formats of OWL representation. However, OWL GitDiff is intended to be well-integrated within different version-control systems and to show expressive messages to the users, allowing them to understand various changes in an ontology. An extended work of OWL Git-Diff can be to show graphically the different changes in a sequence of ontology versions.
[1] Gonçalves, Rafael S. et al. “Ecco: A Hybrid Diff Tool for OWL 2 ontologies.” OWLED (2012). URL:here
MAhmad Hemid, Dr. Abderrahmane Khiat
Generating Creative Ideas for Crowd Ideation Platforms
subtopic1: knowledge graph-based approach
subtopic2: machine learning-based approach
Research on creativity claims that new ideas come through a combination of ideas, also known as combinational creativity, for example Bracelet + lifebuoy = Self Rescue Bracelet. However, it is quite challenging for computer-based approaches to generate valuable combinations and recognize their values (Boden 2009). The goal of this thesis is to develop a solution to obtain valuable combinations of ideas.
  • For the knowledge graph-based approach, the solution consists of (1) representing ideas more formally (close to logic representation), (2) find relations between ideas and (3) employ some operators to produce new idea compositions.
  • For the machine learning-based approach, the solution consists of structuring ideas into Purpose and Mechanism and then employing text generation techniques such as Markov chain model to produce new idea combinations.

Keywords: Combinatorial Creativity, Crowd Ideation, Creativity, Machine Learning, Description Logics, Knowledge Graphs, Information Extraction.
MDr. Abderrahmane Khiat
Applying Knowledge graph embeddings for Context-aware Question Answering
The task for Question Answering faces new challenges when applied in scenarios with frequently changing information sets, such as a driving car. Current semantic parsing approaches rely on the extraction of named entities and according predicates from the input to match these with patterns in static Knowledge Bases. So far, there is little to no effort to include knowledge about the environment (i.e. context) into the QA pipeline. To improve the performance for the so-called Context-aware QA, you will work on solutions to adopt different Graph embeddings approaches into the QA process. Please refer to the job description for further information.
MJewgeni Rose
Smart Home – Akquise von Individualwissen im Kundendienst-Umfeld
Refer to the Miele job description for further information.
MGiulio Napolitano
Scalable graph kernels for RDF data
Develop graph kernels forRDF data and use traditional machine learning methods for classification.
B, MDr. Hajira Jabeen
Distributed Knowledge graph Clustering
Clustering of heterogenous data contained in a Knowledge graphs
B, MDr. Hajira Jabeen
Distributed Anomaly Detection in RDF
Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured graph data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data.
B, MDr. Hajira Jabeen
PyTorch Integration in Spark
PyTorch is an open source deep learning platform that provides a seamless path from research prototyping to production deployment. This thesis integrates PyTorch in Apache Spark following guidelines and run KG embeddig models as preliminary tests. https://docs.databricks.com/applications/deep-learning/spark-integration.html
BDr. Hajira Jabeen
Use of Ontology information in Knowledge graph embeddings
Knowledge graph embedding represents entities as vectors in a common vector space. The objective of this work is to extend the existing KG embedding models to include schema information in the KG embedding models like in TransE, or ConvE etc and compare the performance.
B, MDr. Hajira Jabeen
Negative sampling in Knowledge graph embeddings
Knowledge graph embedding represents entities as vectors in a common vector space. The objective of this work is to extend the existing KG embedding models to include more efficient and effective negative sampling methods like in TransE, or ConvE etc and compare the performance.
B, MDr. Hajira Jabeen
Entity Resolution
Entity resolution is the task of identifying all mentions that represent the same real-world entity within a knowledge base or across multiple knowledge bases. We address the problem of performing entity resolution on RDF graphs containing multiple types of nodes, using the links between instances of different types to improve the accuracy.
B, MDr. Hajira Jabeen
Rule/Concept Learning in Knowledge Graphs
In the Semantic Web context, OWL ontologies play the key role of domain conceptualizations while the corresponding assertional knowledge is given by the heterogeneous Web resources referring to them. However, being strongly decoupled, ontologies and assertional bases can be out of sync. In particular, an ontology may be incomplete, noisy, and sometimes inconsistent with the actual usage of its conceptual vocabulary in the assertions. Despite of such problematic situations, we aim at discovering hidden knowledge patterns from ontological knowledge bases, in the form of multi-relational association rules, by exploiting the evidence coming from the (evolving) assertional data. The final goal is to make use of such patterns for (semi-)automatically enriching/completing existing ontologies.
 B, MDr. Hajira Jabeen
Intelligent Semantic Creativity : Culinarian
Computational creativity is an emerging branch of artificial intelligence that places computers in the center of the creative process. We aimt to create a computational system that creates flavorful, novel, and perhaps healthy culinary recipes by drawing on big data techniques. It brings analytics algorithms together with disparate data sources from culinary science.
In the most ambitious form, the system would employ human-computer interaction for rating different recipes and model the human cogitive ability for the cooking process.
The end result is going to be an ingredient list, proportions, and as well as a directed acyclic graph representing a partial ordering of culinary recipe steps.
B, MDr. Hajira Jabeen
IoT Data Catalogues
While platforms and tools such as Hadoop and Apache Spark allow for efficient processing of Big Data sets, it becomes increasingly challenging to organize and structure these data sets. Data sets have various forms ranging from unstructured data in files to structured data in databases. Often the data sets reside in different storage systems ranging from traditional file systems, over Big Data files systems (HDFS) to heterogeneous storage systems (S3, RDBMS, MongoDB, Elastic Search, …). At AGT International, we are dealing primarily with IoT data sets, i.e. data sets that have been collected from sensors and that are processed using Machine Learning-based (ML) analytic pipelines. The number of these data sets is rapidly growing increasing the importance of generating metadata that captures both technical (e.g. storage location, size) and domain metadata and correlates the data sets with each other, e.g. by storing provenance (data set x is a processed version of data set y) and domain relationships.
MDr. Martin Strohbach, Prof. Dr. Jens Lehmann

 

(Work at AGT International in Darmstadt)

Named Entity Recognition for Short-Text Named Entity Recognition (NER) models play an important role in the Information Extraction (IE) pipeline. However, despite decent performance of NER models on newswire datasets, to date, conventional approaches are not able to successfully identify classical named-entity types in short/noisy texts. This thesis will thoroughly investigate NER in microblogs and propose new algorithms to overcome current state-of-the-art models in this research area.M Dr. Diego Esteves
Multilingual Fact Validation Algorithms DeFacto (Deep Fact Validation) is an algorithm able to validate facts by finding trustworthy sources for them on the Web. Currently, it supports 3 main languages (en, de and fr). The goal of this thesis is to explore and implement alternative information retrieval (IR) methods to minimize the dependency of external tools on verbalizing natural language patterns. As a result, we expect to enhance the algorithm performance by expanding its coverage.MDr. Diego Esteves
An Approach for (Big) Product Matching
Consider comparing the same product data from thousands of e-shops. However, there are two main challenges that make the comparison difficult. First, the completeness of the product specifications used for organizing the products differs across different e-shops. Second, the ability to represent information about product data and their taxonomy is very diverse. To improve the consumer experience, e.g., by allowing for easily comparing offers by different vendors, approaches for product integration on the Web are needed.
The main focus of this work is on data modeling and semantic enrichment of product data in order to obtain an effective and efficient product matching result.
MDr. Giulio NapolitanoDebanjan Chaudhuri
Learning word representations for out-of-vocabulary words using their contexts.
Natural language processing (NLP) research has recently witnessed a significant boost, following the introduction of word embeddings as proposed by Mikolov et. al. (2013) (Distributed Representations of Words and Phrases and their Compositionality). However, one of the biggest challenges of using word embeddings using the vanilla neural net architecture with words as input and context as outputs is the handling of out-of-vocabulary (oov) words, as the model fails badly on unseen words. In this project we are suggesting an architecture using the proposed word2vec model only. Here, given an unseen word, we would predict a distributed embedding for it using the contexts it is being used in using the matrix that has learned to predict context given the word. (More details)
MDr. Giulio NapolitanoDebanjan Chaudhuri
Semantic Integration Approach for Big Data
Dimension = Volume & Variety
Current Big Data platforms do not provide a semantic integration mechanism, especially in the context of integrating semantically equivalent entities that not share an ID.
In the context of this thesis, the student will evaluate and make the necessary extensions to the MINTE integration framework in a Big Data scenario.
Datasets: We are going to work with Biomedical Dataset
Programming Language: Scala
Frameworks: Ideally integrated in SANSA platform, but this is not a must.
References:
Synthesizing Knowledge Graphs from web sources with the MINTE framework
Semantic Join Operator to Integrate Heterogeneous RDF Graphs
MINTE semantically integrating RDF graphs
MDiego Collarana
Semantic Similarity Metric for Big Data
Dimension = Volume, Variety
Identifying when two entities, coming from different data sources, are the same is a key step in the data analysis process.
The goal of this thesis topic is to evaluate the performance of the semantic similarity metrics we have develop in a Big Data scenario.
So we will build a framework/operators of the semantic similarity functions and evaluate.We are going to work with the following metrics: GADES, GARUM, FCA (New to be develop) (See references).Datasets: We are going to work with Biomedical Dataset.
Programming Language: Scala, Java
Frameworks: Ideally integrated in SANSA platform, but this is not a must.
References:
A Semantic Similarity Measure Based on Machine Learning and Entity Characteristics
A Graph-based Semantic Similarity Measure
MDiego Collarana
Embedding’s for RDF Molecules
The use of embeddings in the NLP community is already a common practice. Currently there are the same efforts in the Knowledge Graphs community.
Several approaches such as TransE, RDF2Vec, etc… propose models to create embeddings out of the RDF molecules.
The goal of this thesis is to extend the similarity metric MateTee (see references) with the state-of-the-art-approaches to create embedding from Knowledge Graph Entities.
Datasets: We are going to work with Knowledge Graphs such as DBpedia y Drugbank.
Programming Language: Phython
References:
A Semantic Similarity Metric Based on Translation Embeddings for Knowledge Graphs
http://usc-isi-i2.github.io/DL4KGS/
MDiego Collarana
Hybrid Embedding for RDF Molecules
Following with the topic discussed above, the goal in this thesis is to research about hybrid embeddings. i.e., combining Word Embeddings with Knowledge Graph embeddings.
This more a foundational research.
Programming Language: Python
References:
No references for the moment, part of the work is to find some related literature.
MDiego Collarana
RDF Molecules Browser
Forster serendipitous discoveries by browsing RDF molecules of data, specially focus on the facets/filters to promote knowledge discovery not intended initially.
Programming Language: ReactJS
References:
A Faceted Reactive Browsing Interface for Multi RDF Knowledge Graph Exploration
A Serendipity-Fostering Faceted Browser for Linked Data
Fostering Serendipitous Knowledge Discovery using an Adaptive Multigraph-based Faceted Browser
MDiego Collarana

Completed Theses