Thesis Announcements

The Smart Data Analytics group is always looking for good students to write theses. The topics can be in one of the following broad areas:

For Students: Thesis Registration Workflow with Prof. Lehmann as First Supervisor during the COVID-19 crisis

  • Please find a thesis topic (see below) including a mentor (please contact those indicated there directly).
  • The mentor needs to get approval from Prof. Lehmann.
  • Please contact Martina Doelp for the thesis registration form.
  • Fill and sign the form and sent it to Martina Doelp (via regular mail, not electronically). The date of registration needs to be added by Prof. Lehmann.
  • Prof. Lehmann will then sign it (on paper) and Martina Doelp will send it to the second examiner (via mail) and forward it to the exam office afterwards (on paper). You can then verify in BASIS that the registration was successful.

Please note that the defense of your thesis and the final submission of the thesis should be in the same semester to avoid that the student has to register for another semester. Please contact Shimaa Ibrahim for scheduling a defense date.

Please note that the list below is only a small sample of possible topics and ideas. Please contact us to discuss further, to find new topics, or to suggest a topic of your own.

Open Theses

TopicLevelContact Person
Knowledge Graph Embeddings Benchmarking
Knowledge Graph embedding methods (KGE) have become the standard toolkit for analyzing and learning from data on knowlede graphs. They have been successfully applied to many domains including chemistry, physics, social sciences and bioinformatics. As the field grows, it becomes critical to identify the architectures and key mechanisms which generalize across knowledge graphs sizes, enabling us to tackle larger, more complex datasets and domains. In this thesis you should study and apply the main tests, metrics and testing models.
Technology to use: Machine Learning, Knoweldge Graphs
MAfshin Sadeghi, Dr. Diego Esteves
Solving Mini Chess via Distributed Deep Reinforcement Learning and Proof Number Search
In 1956, a chess program beat a (novice) human opponent for the first time in a chess variant called Los Alamos chess. Los Alamos Chess has a 6×6 board size and is therefore more suitable for computer programs – such variants are called “mini chess”. While chess engines have improved dramatically over the past decades, the game theoretic value (either a draw or a win for White can be forced) of Los Alamos Chess is still unknown. In this thesis, the goal is to prove the game theoretic value of the game using a combination of a) deep reinforcement learning techniques for determining the best move in a particular position, b) position solvers based on proof number search, c) endgame table bases generated for Los Alamos Chess and d) a distributed proof tree manager allowing to execute the approach over a cluster. If the student is capable of implementing the approach on a small cluster, we will apply for an execution within a large-scale cluster consisting of hundreds of computing nodes. Given that substantial resources may be invested for this thesis, students applying for it should have excellent grades, excellent programming skills and enthusiasm for game solving or chess.
MProf. Dr. Jens Lehmann
Conversational AI & Climate Change
Climate Change is one of the major challenges humanity has to face. One of the key elements required for addressing its consequences is to provide objective information to citizens. Conversational AI methods (aka chatbots, speech assistants, dialogue systems, …) have become increasingly important over the past few years. In this master thesis, students can contribute to building a climate change chatbot by addressing one of the main challenges in deep learning and natural language processing below:
Improving reading comprehension techniques by transfer learning from large text corpora such as SquAD to text documents describing climate change
Improving the translation of natural language questions to queries against (climate) knowledge graphs)
Interactive question answering techniques for capturing user feedback
MDr. Liubov Kovriguina
Applying Knowledge graph embeddings for Context-aware Question Answering
The task for Question Answering faces new challenges when applied in scenarios with frequently changing information sets, such as a driving car. Current semantic parsing approaches rely on the extraction of named entities and according predicates from the input to match these with patterns in static Knowledge Bases. So far, there is little to no effort to include knowledge about the environment (i.e. context) into the QA pipeline. To improve the performance for the so-called Context-aware QA, you will work on solutions to adopt different Graph embeddings approaches into the QA process. Please refer to the job description for further information.
MJewgeni Rose
Smart Home – Akquise von Individualwissen im Kundendienst-Umfeld
Refer to the Miele job description for further information.
MGiulio Napolitano
IoT Data Catalogues
While platforms and tools such as Hadoop and Apache Spark allow for efficient processing of Big Data sets, it becomes increasingly challenging to organize and structure these data sets. Data sets have various forms ranging from unstructured data in files to structured data in databases. Often the data sets reside in different storage systems ranging from traditional file systems, over Big Data files systems (HDFS) to heterogeneous storage systems (S3, RDBMS, MongoDB, Elastic Search, …). At AGT International, we are dealing primarily with IoT data sets, i.e. data sets that have been collected from sensors and that are processed using Machine Learning-based (ML) analytic pipelines. The number of these data sets is rapidly growing increasing the importance of generating metadata that captures both technical (e.g. storage location, size) and domain metadata and correlates the data sets with each other, e.g. by storing provenance (data set x is a processed version of data set y) and domain relationships.
MDr. Martin Strohbach, Prof. Dr. Jens Lehmann  
(Work at AGT International in Darmstadt)
Named Entity Recognition for Short-Text
Named Entity Recognition (NER) models play an important role in the Information Extraction (IE) pipeline. However, despite decent performance of NER models on newswire datasets, to date, conventional approaches are not able to successfully identify classical named-entity types in short/noisy texts. This thesis will thoroughly investigate NER in microblogs and propose new algorithms to overcome current state-of-the-art models in this research area.
Dr. Diego Esteves
Multilingual Fact Validation Algorithms
DeFacto (Deep Fact Validation) is an algorithm able to validate facts by finding trustworthy sources for them on the Web. Currently, it supports 3 main languages (en, de and fr). The goal of this thesis is to explore and implement alternative information retrieval (IR) methods to minimize the dependency of external tools on verbalizing natural language patterns. As a result, we expect to enhance the algorithm performance by expanding its coverage.
Dr. Diego Esteves
An Approach for (Big) Product Matching
Consider comparing the same product data from thousands of e-shops. However, there are two main challenges that make the comparison difficult. First, the completeness of the product specifications used for organizing the products differs across different e-shops. Second, the ability to represent information about product data and their taxonomy is very diverse. To improve the consumer experience, e.g., by allowing for easily comparing offers by different vendors, approaches for product integration on the Web are needed.
The main focus of this work is on data modeling and semantic enrichment of product data in order to obtain an effective and efficient product matching result.
Dr. Giulio NapolitanoDebanjan Chaudhuri
Learning word representations for out-of-vocabulary words using their contexts.
Natural language processing (NLP) research has recently witnessed a significant boost, following the introduction of word embeddings as proposed by Mikolov et. al. (2013) (Distributed Representations of Words and Phrases and their Compositionality). However, one of the biggest challenges of using word embeddings using the vanilla neural net architecture with words as input and context as outputs is the handling of out-of-vocabulary (oov) words, as the model fails badly on unseen words. In this project we are suggesting an architecture using the proposed word2vec model only. Here, given an unseen word, we would predict a distributed embedding for it using the contexts it is being used in using the matrix that has learned to predict context given the word. (More details)
Dr. Giulio NapolitanoDebanjan Chaudhuri
Semantic Integration Approach for Big Data
Dimension = Volume & Variety
Current Big Data platforms do not provide a semantic integration mechanism, especially in the context of integrating semantically equivalent entities that not share an ID.
In the context of this thesis, the student will evaluate and make the necessary extensions to the MINTE integration framework in a Big Data scenario.
Datasets: We are going to work with Biomedical Dataset
Programming Language: Scala
Frameworks: Ideally integrated in SANSA platform, but this is not a must.
Synthesizing Knowledge Graphs from web sources with the MINTE framework
Semantic Join Operator to Integrate Heterogeneous RDF Graphs
MINTE semantically integrating RDF graphs
Diego Collarana
Semantic Similarity Metric for Big Data
Dimension = Volume, Variety
Identifying when two entities, coming from different data sources, are the same is a key step in the data analysis process.
The goal of this thesis topic is to evaluate the performance of the semantic similarity metrics we have develop in a Big Data scenario.
So we will build a framework/operators of the semantic similarity functions and evaluate.We are going to work with the following metrics: GADES, GARUM, FCA (New to be develop) (See references).Datasets: We are going to work with Biomedical Dataset.
Programming Language: Scala, Java
Frameworks: Ideally integrated in SANSA platform, but this is not a must.
A Semantic Similarity Measure Based on Machine Learning and Entity Characteristics
A Graph-based Semantic Similarity Measure
MDiego Collarana
Embedding’s for RDF Molecules
The use of embeddings in the NLP community is already a common practice. Currently there are the same efforts in the Knowledge Graphs community.
Several approaches such as TransE, RDF2Vec, etc… propose models to create embeddings out of the RDF molecules.
The goal of this thesis is to extend the similarity metric MateTee (see references) with the state-of-the-art-approaches to create embedding from Knowledge Graph Entities.
Datasets: We are going to work with Knowledge Graphs such as DBpedia y Drugbank.
Programming Language: Phython
A Semantic Similarity Metric Based on Translation Embeddings for Knowledge Graphs
MDiego Collarana
Hybrid Embedding for RDF Molecules
Following with the topic discussed above, the goal in this thesis is to research about hybrid embeddings. i.e., combining Word Embeddings with Knowledge Graph embeddings.
This more a foundational research.
Programming Language: Python
No references for the moment, part of the work is to find some related literature.
MDiego Collarana
RDF Molecules Browser
Forster serendipitous discoveries by browsing RDF molecules of data, specially focus on the facets/filters to promote knowledge discovery not intended initially.
Programming Language: ReactJS
A Faceted Reactive Browsing Interface for Multi RDF Knowledge Graph Exploration
A Serendipity-Fostering Faceted Browser for Linked Data
Fostering Serendipitous Knowledge Discovery using an Adaptive Multigraph-based Faceted Browser
MDiego Collarana

Completed Theses