PhD Viva (Gezim Sejdiu): Efficient Distributed In-Memory Processing of RDF Datasets

Last week, on Tuesday 29th of September 2020 successfully defended my PhD thesis entitled “Efficient Distributed In-Memory Processing of RDF Datasets”. The main objective of this thesis is to lay foundations for efficient algorithms performing analytics, i.e. exploration, quality assessment, and querying over semantic knowledge graphs at a scale that has not been possible before.

Slides

See below the thesis abstract with references to the main papers, part of the work is based on (see here: https://gezimsejdiu.github.io//publications/ for the complete list of publications).

Abstract

Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies. Today, we count more than 10,000 datasets made available online following Semantic Web standards. A major and yet unsolved challenge that research faces today is to perform scalable analysis of large-scale knowledge graphs in order to facilitate applications in various domains including life sciences, publishing, and the internet of things.
The main objective of this thesis is to lay foundations for efficient algorithms performing analytics, i.e. exploration, quality assessment, and querying over semantic knowledge graphs at a scale that has not been possible before.
First, we propose a novel approach for statistical calculations of large RDF datasets [1], which scales out to clusters of machines.
In particular, we describe the first distributed in-memory approach for computing 32 different statistical criteria for RDF datasets using Apache Spark.
Many applications such as data integration, search, and interlinking, may take full advantage of the data when having a priori statistical information about its internal structure and coverage. However, such applications may suffer from low quality and not being able to leverage the full advantage of the data when the size of data goes beyond the capacity of the resources available.
Thus, we introduce a distributed approach of quality assessment of large RDF datasets [2]. It is the first distributed, in-memory approach for computing different quality metrics for large RDF datasets using Apache Spark. We also provide a quality assessment pattern that can be used to generate new scalable metrics that can be applied to big data.
Based on the knowledge of the internal statistics of a dataset and its quality, users typically want to query and retrieve large amounts of information.
As a result, it has become difficult to efficiently process these large RDF datasets.
Indeed, these processes require, both efficient storage strategies and query-processing engines, to be able to scale in terms of data size.
Therefore, we propose a scalable approach [3, 4] to evaluate SPARQL queries over distributed RDF datasets by translating SPARQL queries into Spark executable code.
We conducted several empirical evaluations to assess the scalability, effectiveness, and efficiency of our proposed approaches.
More importantly, various use cases i.e. Ethereum analysis, Mining Big Data Logs, and Scalable Integration of POIs, have been developed and leverages by our approach.
The empirical evaluations and concrete applications provide evidence that our methodology and techniques proposed during this thesis help to effectively analyze and process large-scale RDF datasets.
All the proposed approaches during this thesis are integrated into the larger SANSA framework [5].

References

[1]. Gezim Sejdiu; Ivan Ermilov; Jens Lehmann; and Mohamed Nadjib-Mami, “DistLODStats: Distributed Computation of RDF Dataset Statistics,” in Proceedings of 17th International Semantic Web Conference (ISWC), 2018.
[2]. Gezim Sejdiu; Anisa Rula; Jens Lehmann; and Hajira Jabeen, “A Scalable Framework for Quality Assessment of RDF Datasets,” in Proceedings of 18th International Semantic Web Conference (ISWC), 2019.
[3]. Claus Stadler; Gezim Sejdiu; Damien Graux; and Jens Lehmann, “Sparklify: A Scalable Software Component for Efficient evaluation of SPARQL queries over distributed RDF datasets,” in Proceedings of 18th International Semantic Web Conference (ISWC), 2019.
[4]. Gezim Sejdiu; Damien Graux; Imran Khan; Ioanna Lytra; Hajira Jabeen; and Jens Lehmann, “Towards A Scalable Semantic-based Distributed Approach for SPARQL query evaluation,” 15th International Conference on Semantic Systems (SEMANTiCS), Research & Innovation, 2019.
[5]. Jens Lehmann; Gezim Sejdiu; Lorenz Bühmann; Patrick Westphal; Claus Stadler; Ivan Ermilov; Simon Bin; Nilesh Chakraborty; Muhammad Saleem; Axel-Cyrille Ngomo Ngonga; and Hajira Jabeen, “Distributed Semantic Analytics using the SANSA Stack,”; in Proceedings of 16th International Semantic Web Conference – Resources Track (ISWC’2017), 2017.