Hidden Research Comunity Detection 2.0 (en-US)

Scientific communities are well known as research fields, however, researchers communicate in hidden communities that are built considering the types of communities considering the co-authorship, topic interest, attended events etc. In this thesis which will be the second phase of an already done master thesis, we will focus on identifying more of such communities by defining similarity metrics inside objects of a research knowledge graph we will build using several datasets.

Exploring negative sampling in Knowledge graph embeddings (en-US)

Knowledge graph embedding represents entities as vectors in a common vector space. The objective of this work is to extend the existing KG embedding models to include more efficient and effective negative sampling methods like in TransE, or ConvE etc and compare the performance.

End to End Task Completion dialogue System (en-US)

Dialogue systems and chatbots (Alexa, Siri etc) are increasingly becoming part of our day to day life. Since the inception of the field, component based dialogue systems have been developed by the research community by developing separate components for 1) natural language understanding 2) dialogue management, and 3) natural language generation. This approach has its own weaknesses. Recently, focus has been shifted to build end to end dialogue system and training all the components at once. In this thesis, we aim to build an end to end task completion dialogue system. Microsoft started a challenge for task based dialogue systems which include three domains 1) movie booking 2) restaurant booking and 3) taxi ordering. Following are the expected outcomes of the masters thesis : End to end dialogue system built for three domains given in https://github.com/xiul-msr/e2e_dialog_challenge. In depth evaluation of the approach.
Requirements: strong programming skills (python is preferred), good knowledge of using servers via unix commands, good understanding of machine learning and deep learning state of the art models and ability to apply them, ability to learn new reinforcement learning models.
Keywords: Dialogue Systems, Chatbots, Machine Learning, Deep Learning

Semantic Integration Approach for Big Data (en-US)

Current Big Data platforms do not provide a semantic integration mechanism, especially in the context of integrating semantically equivalent entities that not share an ID. In the context of this thesis, the student will evaluate and make the necessary extensions to the MINTE integration framework in a Big Data scenario.

Provide tools for LaTeX leveraging semantic web standards (en-US)

When articles are written (and submitted to pear review), one of the biggest fear of researchers is to forget some state-of-the-art works. Indeed, articles should be positioned among the already existing ones to show they are new. However, a specific relevant paper can sometimes be forgotten by authors. To avoid this unpleasant situation, one could imagine a LaTeX package able to check if no citation is missing in a manuscript. To do so, several things might be implemented: (i) extending the already existing pdf2rdf tool by implementing a tex2rdf module; (ii) generating bib-code from these RDF data; (iii) extracting RDF data from the reference sections of articles; (iv) aggregating all these RDF data and loading this dataset into a store; (v) developing a LaTeX package which would be able to automatically query this endpoint to possibly provide missing references.

Knowledge Data Containers with Access Control and Security Capabilities (en-US)

The amount of Linked Data both open, made available on the Web, and private, exchanged across companies and organizations, have been increasing in recent years. This data can be distributed in form of Knowledge Graphs (KGs), but maintaining these KGs is mainly the responsibility of data owners or providers. Moreover, building applications on top of KGs in order to provide, for instance, analytics, data access control, and privacy is left to the end user or data consumers. However, many resources in terms of development costs and equipment are required by both data providers and consumers, thus impeding the development of real-world applications over KGs. KGs as well as data processing functionalities can be encapsulated in a client-side system called Knowledge Graph Container, intended to be used by data providers or data consumers. The goal of this thesis is to integrate access control and security capabilities in these KG containers.

Experimental Analysis of Class CS Problems (en-US)

In this thesis, we explore unsolved problems of theoretical computer science with machine learning methods, especially reinforcement learning.

An Approach for (Big) Product Matching (en-US)

Consider comparing the same product data from thousands of e-shops. However, there are two main challenges that make the comparison difficult. First, the completeness of the product specifications used for organizing the products differs across different e-shops. Second, the ability to represent information about product data and their taxonomy is very diverse. To improve the consumer experience, e.g., by allowing for easily comparing offers by different vendors, approaches for product integration on the Web are needed. The main focus of this work is on data modeling and semantic enrichment of product data in order to obtain an effective and efficient product matching result.

Towards a scalable, extensible and sustainable DBpedia Extraction Framework (en-US)

DBpedia is a semantic extract of Wikipedia content. It feeds today hundreds of applications and research prototypes. DBpedia Extraction Framework (EF) is the software that performs the data extraction from Wikipedia. DBpedia EF consists of several Extractors, each is responsible for retrieving a specific part of a Wikipedia article/wiki. Infobox-Extractor is responsible for extracting information found inside the Infobox section. Two main steps are involved (1) mapping schema information to DBpedia ontology, and (2) transforming data to a compliant shape before ingestion (RDF generation). The current extraction framework faces two major issues: (1) strong coupling and (2) severe performance. The first issue is manifested in the mapping and transformation processes. The mapping process is based on manually-created and curated mappings expressed in a custom in-house mapping language, and the transformations are hard-coded. The second issue is manifested in the long time of extraction, which can take up to several days. In order to solve the strong coupling issue, the work [1] suggests to amend the original extractor implementation so it uses RML [2] for mappings declaration and FNO [3] for transformations declaration. The evaluation of this approach proved its feasibility. However, the performance of data extraction was worse than the original implementation. On the other hand, as per DBpedia EF design, the original extractor can not be replaced, any extraction features, like the ones from [1], have to be implemented as a custom sub-extractor. Certain functionalities can still only be performed by the original extractor. In light of all the above, we see clearly an empty spot to fill and see already potential solution building blocks. That is the design and implementation of a completely new DBpedia Extractor that uses scalable techniques and Big Data engines for the extraction e.g. Spark, RML for mappings declaration and FNO for transformation declarations. Such a solution would presumably solve the aforementioned two issues. As in (certain parts of) the original extractor, the suggested solution should keep a high level of generality and extensibility, so the community could also participate in the creation and curation of the framework itself. The theoretical and practical experience gained in [1] would be of a great help to implement the new proposal. For example, the student could choose to start by deploying [1].
[1] Maroy, Wouter, et al. ""Sustainable linked data generation: The case of DBpedia"" ISWC, 2017. http://jens-lehmann.org/files/2017/iswc_dbpedia_rml.pdf
[2] http://rml.io/
[3] De Meester, Ben, et al. ""Declarative data transformations for Linked Data generation: the case of DBpedia."" European Semantic Web Conference. Springer, Cham, 2017.

Use of Ontology information in Knowledge graph embeddings (en-US)

Knowledge graph embedding represents entities as vectors in a common vector space. The objective of this work is to extend the existing KG embedding models to include schema information in the KG embedding models like in TransE, or ConvE etc and compare the performance.

RDF data anonymization (en-US)

Anonymization of knowledge graphs data for security purposes

Predicate Linking in a Sentence (en-US)

Entity recognition and linking in a sentence has been a long standing field. For example, in sentence “Michelle Obama is the wife of Barack Obama”, there are two entities Michelle Obama, and Barack Obama. The entity linking tools (DBpedia Spotlight, TagMe, Texrazor etc.) in perfect case link these two entities to their knowledge graph mentions in DBpedia and Wikidata. In this thesis we go one step further and link predicates of the sentence to their knowledge graph mentions. In the above example, we aim to link “wife of” to dbo:spouse (http://dbpedia.org/ontology/spouse) or wikidata:spouse (https://www.wikidata.org/wiki/Property:P26 ). For the same in the scope of the thesis, we aim to train deep learning or machine learning models to correctly recognise the relations in the text and then link them to knowledge graph. Expected outcomes of the masters thesis: 1) A novel approach for linking predicates in the sentence 2) Implementation of the approach 3) In depth evaluation
Requirements: strong programming skills, good knowledge of state of the art machine and deep learning models.
Keywords: NLP, Machine Learning, Deep Learning, Predicate Linking, Entity Linking

Evaluation of verbalization apporach for QA system (en-US)

Considering the fact that QA system are note perfect, they might present theend-user with incorrect result. Thus, a user-friendly tool ought to be providedto enable the end-user to assess whether the underlying QA system has correctlyunderstand the question as the end-user intended.Considering the fact that QA system over KG, usually produce a formalquery (e.g. SPARQL) that supposedly capture the intention of input natu-ral language question, one can simple ask the end-user to judge the accuracyof the generated SPARQL. However, this representation might not be easy tounderstand to the end-user as SPARQL is a formal query language which re-quires a certain skills that cannot be expected from the end-users. Furthermore,there are some tools such as Spartiqulator [1] and SPARQL2NL [2] that givena SPARQL query produce a semi-natural language representation.Nevertheless, since these tools are also not perfect, and might generate ques-tions that are not grammatically correct, other researchers (such as SPARQL-toUser [3]) suggested a more schematic representation which is language agnos-tic.Nonetheless, as there is no well-defined qualitative metric to evaluate theapproach, the aim of this thesis as follows:
  • To study related works on verbalization in QA system over KG
  • To implement/reuse at least one of existing approaches in the aforemen-tioned groups.
  • To design and carry out user study, including related metrics to evaluatethe user-satisfaction.

RDF Data Clustering (en-US)

Clustering of heterogenous data contained in a Knowledgegraph

Rule/Concept Learning using Swarm and Evolutionary Computation (en-US)

In the Semantic Web context, OWL ontologies play the key role of domain conceptualizations while the corresponding assertional knowledge is given by the heterogeneous Web resources referring to them. However, being strongly decoupled, ontologies and assertional bases can be out of sync. In particular, an ontology may be incomplete, noisy, and sometimes inconsistent with the actual usage of its conceptual vocabulary in the assertions. Despite of such problematic situations, we aim at discovering hidden knowledge patterns from ontological knowledge bases, in the form of multi-relational association rules, by exploiting the evidence coming from the (evolving) assertional data. The final goal is to make use of such patterns for (semi-)automatically enriching/completing existing ontologies.

Product matching through Embedding Representation (en-US)

A central problem in the context of the Web of Data, as well as in data integration in general is to identify entities in different datasets that describe the same real-world object. The LiteralE approach [1] will be used to represent the entities from two datasets in a vector space. The aim of the project is to implement a neural network to learn a similarity function between entities.
[1]. https://arxiv.org/abs/1802.00934

Named Entity Recognition (NER) models play an important role in the Information Extraction (IE) pipeline. However, despite decent performance of NER models on newswire datasets, to date, conventional approaches are not able to successfully identify classical named-entity types in short/noisy texts. This thesis will thoroughly investigate NER in microblogs and propose new algorithms to overcome current state-of-the-art models in this research area.

Triple Generation for Question Answering Using Hidden Markov Model (en-US)

In knowledge graph based question answering, a question in natural language is expected to be translated in its associated formal query (i.e. SPARQL) by a QA system. For the same, QA system is desired to identify entity in the question, and relation, then link it to knowledge graph mentions. Consider the question “Which river crosses Bonn?”, here entity is http://dbpedia.org/page/Bonn, predicate is http://dbpedia.org/ontology/crosses, ontology class is http://dbpedia.org/ontology/River. In the scope of thesis, we aim to generate http://dbpedia.org/page/Bonn, http://dbpedia.org/ontology/crosses, http://dbpedia.org/ontology/River for the given example sentance using Hidden Markov Model similar to SINA [a]. Please note that we do not aim to further build SPARQL query of it. Expected outcome of masters thesis:
1) Reusing SINA approach, and improve proposing novel approach for the triple generation using HMM model
2) In depth evaluation of the approach using question answering datasets.
Requirements: good programming skills in Java or python and willingness to learn hidden markov model
a) Shekarpour, Saeedeh, Axel-Cyrille Ngonga Ngomo, and Sören Auer. "Question answering on interlinked data." In Proceedings of the 22nd international conference on World Wide Web, pp. 1145-1156. ACM, 2013.
Keywords: Question Answering, HMM, Knowledge Graph

Data Integration and Data Quality in the Big Data Lifecycle (en-US)

Mapping relational databases to RDF is a fundamental problem for the development of the Semantic Web. The problem is to define what are the qualtiy metrics preserved or not preserved from RDB data model to the RDF data model. Understand the relation between quality metrics in RDB and quality metrics in RDF and define quality metrics fro RDB2RDF

Multimodal Representation of Product Data (en-US)

Knowledge graph embedding represents entities as vectors in a common vector space. Their multimodal representation learning is carried out by a neural network using structural, numerical and string information. In particular, the LiteralE approach [1] merges entity embeddings with their literal information using a learnable, parametrized function, such as a simple linear or nonlinear transformation, or a multilayer neural network. The objective of this work is to extend the LiteralE algorithm to augment multimodal representation of an entity with string information.
[1]. https://arxiv.org/abs/1802.00934

Semantic Similarity Metric for Big Data (en-US)

Identifying when two entities, coming from different data sources, are the same is a key step in the data analysis process. The goal of this thesis topic is to evaluate the performance of the semantic similarity metrics we have develop in a Big Data scenario. So we will build a framework/operators of the semantic similarity functions and evaluate.

Learning word representations for out-of-vocabulary words using their contexts. (en-US)

Natural language processing (NLP) research has recently witnessed a significant boost, following the introduction of word embeddings as proposed by Mikolov et. al. (2013) (Distributed Representations of Words and Phrases and their Compositionality). However, one of the biggest challenges of using word embeddings using the vanilla neural net architecture with words as input and context as outputs is the handling of out-of-vocabulary (oov) words, as the model fails badly on unseen words. In this project we are suggesting an architecture using the proposed word2vec model only. Here, given an unseen word, we would predict a distributed embedding for it using the contexts it is being used in using the matrix that has learned to predict context given the word. (More details)

A block-chain forecast model: Extracting & analyzing user most used smart-contract features to predict block-chain future (en-US)

In the recent past years, the block-chain concept [https://en.wikipedia.org/wiki/Blockchain] has become a key technology to record transactions between two parties while providing several properties. Up on the general block-chain architecture, some distributed computing platform have emerged such as the Ethereum [https://en.wikipedia.org/wiki/Ethereum] which gives the opportunity of building and deploying smart contracts [https://en.wikipedia.org/wiki/Smart_contract]: automatic actions that can be triggered by specific events in the chain.By construction, block-chain-related technologies are open-source and publicly available which allows one user to check for instance the complete history of the chain or some specific events. Moreover, the structure of the chain itself can also be represented as a large knowledge graph.The goal of this study is to crawl the Ethereum block-chain smart-contracts history -leveraging the knowledge graph introduced above- in order to compute statistics and then try to predict the future of the chain. To do so, several steps have to be done: 1. being able to retrieve information inside the large RDF graph representing the chain using the SPARQL query language; 2. understanding the way smart-contracts are scripted; 3. deploying ML algorithm on these data excerpts; 4. drawing conclusions

Quality metric preservation from relational to RDF data (en-US)

Mapping relational databases to RDF is a fundamental problem for the development of the Semantic Web. The problem is to define what are the qualtiy metrics preserved or not preserved from RDB data model to the RDF data model. Understand the relation between quality metrics in RDB and quality metrics in RDF and define quality metrics fro RDB2RDF

Generating Property Graphs from RDF using a semantic preserving conversion approach (en-US)

Graph Databases are on a rise since the last decade due to their dominance in mining and analysis of complex networks. Property Graphs (PGs), one of the graph data models which Graph Databases use, are suitable for the representation of many real-life application scenarios. They allow to efficiently represent complex networks (e.g. social networks, E-commerce) and interactions. In order to leverage this advantage of graph databases, conversions of other data models to property graphs are a current area of research. The aim of this thesis is to (i) propose a novel systematic conversion approach for generating PGs from RDF (one of the graph data models) (ii) and carry out exhaustive experiments on both RDF and PG datasets with respect to their native storage databases (i.e. Graph DBs vs Triplestores). This will allow to identify the types of queries for which graph databases offer performance advantages and ideally allow to adapt the storage mechanism accordingly. The outcome of this work will be integrated into the LITMUS framework, which is an open extensible framework for benchmarking of diverse Data Management Solutions.

DeFacto (Deep Fact Validation) is an algorithm able to validate facts by finding trustworthy sources for them on the Web. Currently, it supports 3 main languages (en, de and fr). The goal of this thesis is to explore and implement alternative information retrieval (IR) methods to minimize the dependency of external tools on verbalizing natural language patterns. As result, we expect to enhance the algorithm performance by expanding its coverage.

RDF Molecules Browser (en-US)

Forster serendipitous discoveries by browsing RDF molecules of data, specially focus on the facets/filters to promote knowledge discovery not intended initially.

Embedding’s for RDF Molecules (en-US)

The use of embeddings in the NLP community is already a common practice. Currently there are the same efforts in the Knowledge Graphs community. Several approaches such as TransE, RDF2Vec, etc… propose models to create embeddings out of the RDF molecules. The goal of this thesis is to extend the similarity metric MateTee with the state-of-the-art-approaches to create embedding from Knowledge Graph Entities.

While platforms and tools such as Hadoop and Apache Spark allow for efficient processing of Big Data sets, it becomes increasingly challenging to organize and structure these data sets. Data sets have various forms ranging from unstructured data in files to structured data in databases. Often the data sets reside in different storage systems ranging from traditional file systems, over Big Data files systems (HDFS) to heterogeneous storage systems (S3, RDBMS, MongoDB, Elastic Search, …). At AGT International, we are dealing primarily with IoT data sets, i.e. data sets that have been collected from sensors and that are processed using Machine Learning-based (ML) analytic pipelines. The number of these data sets is rapidly growing increasing the importance of generating metadata that captures both technical (e.g. storage location, size) and domain metadata and correlates the data sets with each other, e.g. by storing provenance (data set x is a processed version of data set y) and domain relationships.

Development and implementation of a semantic Configuration- and Change-Management (en-US)

This thesis is offered in cooperation with Schaeffler Technologies AG & Co. KG. A solid knowledge of OWL and RDF is needed and a general interest in configuration and change management. The thesis is available and work environment is possible in English. A more detailed description in German is available here (pdf).

Relation Linking for Question Answering in German (en-US)

The task of relation linking in question answering is the identification of the relation (predicate) in a given question and its linking to the corresponding entity in a knowledge base. It is an important step in question answering, which allows us afterwards to build formal queries against, e.g., a knowledge graph. Most of the existing question answering systems focus on the English language and very few question answering components support other languages like German. The goal of this thesis is to identify from the literature as well as develop relation extraction tools that could be adapted to work for German questions.

Reflecting on the User Experience Challenges of CEUR Make GUI and Harnessing the Experience: CEUR Make GUI (en-US)

CEUR Make GUI is a graphical user interface supporting the workflow of publishing open access proceedings of scientific workshops via CEUR-WS.org, one of the largest open access repositories. For more details on the topic please go through the publications mentioned below. In this thesis we aim to work on producing a collaborative workspace for editing workshop proceedings and enhancing the user experience of the software. Based on the development of collaborative workspace we would also like to address the user experience and collaborative and cooperative workspace challenges through a structured protocol.Email:Muhammad.rohan.ali.asmat@iais.fraunhofer.de Current Repository . Thesis . Publication . Task Board

Complex Factoid Question Answering with Paraphrase Clusters (en-US)

Question answering has gained momentum in recent years and researchers have build several QA systems targeting various knowledge sources. A recently released dataset ComQA (http://qa.mpi-inf.mpg.de/comqa/) provides a rich source of paraphrased questions and its answers extracted from the WikiAnswers. The dataset contains questions with various challenging phenomena such as the need for temporal reasoning, comparison (e.g., comparatives, superlatives, ordinals), compositionality (multiple, possibly nested, sub questions with multiple entities), and unanswerable questions (e.g., Who was the first human being on Mars?). Through a large crowdsourcing effort, questions in ComQA are grouped into 4,834 paraphrase clusters that express the same information need. Each cluster is annotated with its answer(s). In this thesis, we aim to build an end to end QA system over this dataset which is capable to advance state of the art by proposing new approach for targeting such complex questions. Expected outcomes of masters thesis are:
1) A novel approach for complex factoid question answering systems
2) Implementation of prototype QA system
3) In depth evaluation of QA system over dataset
Requirements: strong programming skills (java/python), good understanding of machine learning, deep learning state of the art models and experience in working with them as coursework or lab work in the university.
Keywords: Question Answering, Machine Learning, Deep Learning

RDF data rulemining (en-US)

Mining rules from RDF data for knowledge base completion

Entity resolution is the task of identifying all mentions that represent the same real-world entity within a knowledge base or across multiple knowledge bases. We address the problem of performing entity resolution on RDF graphs containing multiple types of nodes, using the links between instances of different types to improve the accuracy.

Movement of Research Results and Education through OpenCourseWare (en-US)

This thesis is a research based work in which we will build a knowledge graph for OCW (online courses) and development of research topics considered in this KG, we will use an analytics tool to define interesting queries that can give us insights on answering the research question of how aligned is research with teaching material.

Intelligent Semantic Creativity : Culinarian (en-US)

Computational creativity is an emerging branch of artificial intelligence that places computers in the center of the creative process. We aimt to create a computational system that creates flavorful, novel, and perhaps healthy culinary recipes by drawing on big data techniques. It brings analytics algorithms together with disparate data sources from culinary science. In the most ambitious form, the system would employ human-computer interaction for rating different recipes and model the human cogitive ability for the cooking process. The end result is going to be an ingredient list, proportions, and as well as a directed acyclic graph representing a partial ordering of culinary recipe steps.

Distributed Anomaly Detection in RDF (en-US)

Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured graph data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data.