Description | Schedule | Projects

Lecturer: Dr. Hajira Jabeen, Gezim Sejdiu, Prof. Dr. Jens Lehmann
Time: Thursday, 10:00–13:00, Endenicher Allee 19A – Seminarraum 1.047, Informatik III
Github: We will use github for disseminating project information.

The lab will not be offered during the WiSe2019/20 semester.

Course Description

The goal is to provide experience and technical skills related to Big data processing tools like Flink and Spark, in addition, to make them acquainted with the functional programming style prevalent in concurrent and parallel programming for Big data. This module will teach them to develop big data mining and machine learning solutions for massive amounts of data.

Schedule

Following is a schedule of the topics we plan to cover and what the assignments will focus on. More details will be added as the course progresses.

One goal of this class is to get you to be comfortable with using a wide variety of tools (most of which are listed below). You are NOT expected to learn these tools on your own; we will provide step-by-step guidance on getting started with the tools and the actual assignments will be simple.

Date	Lecture Topics and Materials	Assignments
April 04	Introduction: What is Big Data. Major tools used by data scientists. Class overview Readings: ◎ https://github.com/SANSA-Stack ◎ https://github.com/SmartDataAnalytics ◎ https://github.com/big-data-europe ◎ http://www.artima.com/scalazine/articles/steps.html ◎ http://www.scala-lang.org/ ◎ http://twitter.github.io/scala_school/	Lab 1: Setting up the environment and getting started with Scala
April 11	Spark Fundamentals I Readings: ◎ Spark Programming Guide ◎ RDD and DataFrame API Examples ◎ Spark SQL, DataFrames and Datasets Guide ◎ Spark Cluster Overview ◎ Spark Configuration, Monitoring, and tuning References: Spark: Cluster Computing with Working Sets Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Spark SQL: Relational Data Processing in Spark	Lab 2: Getting started with Spark + Spark GraphX and Spark SQL operations
April 18	No Class: Public Holiday
April 25	Spark Fundamentals II (Spark ML), BigDL Readings: ◎ Spark Machine Learning Library (MLlib) Guide ◎ BigDL Scala Guide References: MLlib: Machine Learning in Apache Spark BigDL: A Distributed Deep Learning Framework for Big Data	Lab 3: Spark ML and BigDL
May 2	SANSA – Semantic Analytics Stack, Project Allocation Readings: ◎ SANSA Overview and SANSA FAQ. References: MLlib: Machine Learning in Apache Spark GraphX: Graph Processing in a Distributed Dataflow Framework Distributed Semantic Analytics using the SANSA Stack The Tale of Sansa Spark	Lab 4: SANSA
May 2	Project Assignment
May 16	First presentation for the Project
May 23	Meeting I
	Lab work
	(cntd)
	(cntd)
June 27	Meeting II
July 05	Project report and source code submission
July 11	Project Presentations

Date	Lecture Topics and Materials	Assignments
October 11	Introduction: What is Big Data. Major tools used by data scientists. Class overview Readings: ◎ https://github.com/SANSA-Stack ◎ https://github.com/SmartDataAnalytics ◎ https://github.com/big-data-europe ◎ http://www.artima.com/scalazine/articles/steps.html ◎ http://www.scala-lang.org/ ◎ http://twitter.github.io/scala_school/	Lab 1: Setting up the environment and getting started with Scala
October 18	Spark Fundamentals I Readings: ◎ Spark Programming Guide ◎ RDD and DataFrame API Examples ◎ Spark Cluster Overview ◎ Spark Configuration, Monitoring, and tuning References: Spark: Cluster Computing with Working Sets Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics	Lab 2: Getting started with Spark
October 25	Spark Fundamentals II (Spark GraphX + Spark SQL) Readings: ◎ Spark SQL, DataFrames and Datasets Guide References: Spark SQL: Relational Data Processing in Spark	Lab 3: Spark GraphX and Spark SQL operations
November 1	No Class: Public Holiday
November 8	Spark Fundamentals II (Spark ML), SANSA – Semantic Analytics Stack, Project Allocation Readings: ◎ Spark Machine Learning Library (MLlib) Guide ◎ GraphX Programming Guide ◎ SANSA Overview and SANSA FAQ. References: MLlib: Machine Learning in Apache Spark GraphX: Graph Processing in a Distributed Dataflow Framework Distributed Semantic Analytics using the SANSA Stack The Tale of Sansa Spark	Lab 4: Spark ML and SANSA
November 8	Project Assignment
November 22	First presentation for the Project
December 13	Meeting I
	Lab work
	(cntd)
	(cntd)
January 24	Meeting II
February 22	Project report and source code
February 27	Project Presentations

Date	Lecture Topics and Materials	Assignments
April 17	Introduction: What is Big Data. Major tools used by data scientists. Class overview Readings: ◎ https://github.com/SANSA-Stack ◎ https://github.com/SmartDataAnalytics ◎ https://github.com/big-data-europe ◎ http://www.artima.com/scalazine/articles/steps.html ◎ http://www.scala-lang.org/ ◎ http://twitter.github.io/scala_school/	Lab 1: Setting up the environment and getting started with Scala
April 24	Spark Fundamentals I Readings: ◎ Spark Programming Guide ◎ RDD and DataFrame API Examples ◎ Spark Cluster Overview ◎ Spark Configuration, Monitoring, and tuning References: Spark: Cluster Computing with Working Sets Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics	Lab 2: Getting started with Spark
May 1	No Class: Public Holiday
May 8	Spark Fundamentals II (Spark GraphX + Spark SQL) Readings: ◎ Spark SQL, DataFrames and Datasets Guide References: Spark SQL: Relational Data Processing in Spark	Lab 3: Spark GraphX and Spark SQL operations
May 15	Spark Fundamentals II (Spark ML), SANSA – Semantic Analytics Stack, Project Allocation Readings: ◎ Spark Machine Learning Library (MLlib) Guide ◎ GraphX Programming Guide ◎ SANSA Overview and SANSA FAQ. References: MLlib: Machine Learning in Apache Spark GraphX: Graph Processing in a Distributed Dataflow Framework Distributed Semantic Analytics using the SANSA Stack The Tale of Sansa Spark	Lab 4: Spark ML and SANSA
May 15	Project Assignment
June 5	First presentation for the Project
June 12	Meetings
	Lab work
	(cntd)
	(cntd)
August 24	Project report and source code
August 29	Project Presentations

Date	Lecture Topics and Materials	Assignments
October 17	Introduction: What is Big Data. Major tools used by data scientists. Class overview Readings: ◎ https://github.com/SANSA-Stack ◎ https://github.com/SmartDataAnalytics ◎ https://github.com/big-data-europe ◎ http://www.artima.com/scalazine/articles/steps.html ◎ http://www.scala-lang.org/ ◎ http://twitter.github.io/scala_school/	Lab 1: Setting up the environment and getting started with Scala.
October 24	No Class
October 31	No Class: Public Holiday
November 7	Spark Fundamentals I Readings: ◎ Spark Programming Guide ◎ RDD and DataFrame API Examples ◎ Spark Cluster Overview ◎ Spark Configuration, Monitoring and tuning References: Spark: Cluster Computing with Working Sets Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics	Lab 2: Getting started with Spark
November 09	Spark Fundamentals II (Spark SQL) Readings: ◎ Spark SQL, DataFrames and Datasets Guide References: Spark SQL: Relational Data Processing in Spark	Lab 3: Spark GraphX and Spark SQL operations
November 14	Spark Fundamentals II (Spark ML – GraphX), SANSA – Semantic Analytics Stack, Project Allocation Readings: ◎ Spark Machine Learning Library (MLlib) Guide ◎ GraphX Programming Guide ◎ SANSA Overview and SANSA FAQ. References: MLlib: Machine Learning in Apache Spark GraphX: Graph Processing in a Distributed Dataflow Framework Distributed Semantic Analytics using the SANSA Stack The Tale of Sansa Spark	Lab 4: Spark ML and SANSA
November 21	Project Assignment
December 05	Presentation for the Project
December 19	Meetings
	Lab work
	(cntd)
	(cntd)
February 20	Project report
February 27	Project Presentations

Date	Lecture Topics and Materials	Assignments
April 18	Introduction: What is Big Data. Major tools used by data scientists. Class overview Readings: ◎ https://github.com/SANSA-Stack ◎ https://github.com/SmartDataAnalytics ◎ https://github.com/big-data-europe ◎ http://www.artima.com/scalazine/articles/steps.html ◎ http://www.scala-lang.org/ ◎ http://twitter.github.io/scala_school/	Lab 1: Setting up the environment and getting started with Scala.
May 2	Spark Fundamentals I Readings: ◎ Spark Programming Guide ◎ RDD and DataFrame API Examples ◎ Spark Cluster Overview ◎ Spark Configuration, Monitoring and tuning References: Spark: Cluster Computing with Working Sets Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics	Lab 2: Getting started with Spark
May 9	Spark Fundamentals II Readings: ◎ Spark SQL, DataFrames and Datasets Guide ◎ GraphX Programming Guide References: Spark SQL: Relational Data Processing in Spark GraphX: Graph Processing in a Distributed Dataflow Framework	Lab 3: Spark GraphX and Spark SQL operations
May 23	Spark Fundamentals II (Spark ML), SANSA – Semantic Analytics Stack Readings: ◎ Spark Machine Learning Library (MLlib) Guide ◎ SANSA Overview and SANSA FAQ. References: MLlib: Machine Learning in Apache Spark	Lab 4: Spark ML and SANSA


	(cntd)
	(cntd)
	(cntd)
	(cntd)
July 25	Project Presentations

Projects

Besides tutorials and worksheets to be posted publicly at the end of the semester, there will be some projects assigned to students. The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills taught during the lab.

You will have to choose one out of possible proposed projects. Each project is designed for group work and recommended to be worked on in groups of 3-4 students. It’s part of the project to work in a team.


#	Description	Submission Due Date	Presentation Date (Time)
1	TBA	July 5	July 11 (10:00)
2	TBA	July 5	July 11(10:30)
3	TBA	July 5	July 11 (11:00)
4	TBA	July 5	July 11 (11:30)
5	TBA	July 5	July 11 (12:00)
6	TBA	July 5	July 11 (12:30)


#	Description	Submission Due Date	Presentation Date (Time)
1	A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys using Spark	February 22	February 27 (10:00)
2	Distributed Data Deduplication using Spark	February 22	February 27 (10:30)
3	DisBLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution using Spark	February 22	February 27 (11:00)
4	A Distributed Active Blocking Scheme Learning for Entity Resolution	February 22	February 27 (11:30)
5	DistTransH: A Scalable Knowledge Graph Embedding by Translating on Hyperplanes	February 22	February 27 (12:00)
6	Evolutionary Discovery of Multi relational Association Rules from Ontological Knowledge Bases using Spark	February 22	February 27 (12:30)


#	Description	Submission Due Date	Presentation Date (Time)
1	DistSAKey: Scalable Almost Key discovery in RDF data using Spark	August 24	August 29 (10:00)
2	Efficient completeness aware rule learning from Knowledge Graphs using Spark	August 24	August 29 (10:30)
3	Finding Association Rules from Semantic web data using Spark framework	August 24	August 29 (11:00)
4	Mining Semantic Association Rules from RDF data using Spark	August 24	August 29 (11:30)
5	Evolutionary Discovery of Multi relational Association Rules from Ontological Knowledge Bases using Spark	August 24	August 29 (12:00)


#	Description	Submission Due Date	Presentation Date (Time)
1	Efficient semantic subgroup discovery using Spark	February 20	February 27 (TBA)
2	Kernels for RDF data using Spark	February 20	February 27 (TBA)
3	Ranking RDF properties using Spark framework	February 20	February 27 (TBA)
4	Distributed Entity Resolution using Spark	February 20	February 27 (TBA)
5	Substructure Kernels for RDF data using Spark	February 20	February 27 (TBA)


#	Description	Submission Due Date	Presentation Date (Time)
1	Efficient First Order Inductive Learner on Spark	July 23	July 25 (10:00)
2	Efficient semantic subgroup discovery using Spark	July 23	July 25 (10:30)
3	SANSA-RDF : Reading more types of RDF data	July 23	July 25 (10:30)
4	Efficient Graph Kernels for RDF data using Spark	July 23	July 25 (11:00)
5	RDF2Rules using Spark framework	July 23	July 25 (11:30)
6	Distributed TensorLog: A Efficient Differentiable Deductive Database using Spark	July 23	July 25 (12:00)
7	Entity Resolution using Spark	July 23	July 25 (12:30)

Grading

Grades for all projects will be assessed as follows:

project and team selection, problem understanding, implementation concept, and pre-presentation (15%)
project submission (implementation, documentation, project report) (80%)
- implementation (40%)
- project report (40%)
  - motivation, documentation (20%)
  - results and discussion (20%)
- submit report and code via Git repository commit
Q&A session (5%)