MA-INF 4223- Distributed Big Data Analytics

Description | Schedule | Projects

DBDA

Lecturer:  Dr. Hajira JabeenGezim SejdiuProf. Dr. Jens Lehmann
Time: Thursday,  10:00–13:00, Endenicher Allee 19A – Seminarraum 1.047, Informatik III 
Github: We will use github for disseminating project information.

Course Description

The goal is to provide experience and technical skills related to Big data processing tools like Flink and Spark, in addition, to make them acquainted with the functional programming style prevalent in concurrent and parallel programming for Big data. This module will teach them to develop big data mining and machine learning solutions for massive amounts of data.


Schedule

Following is a schedule of the topics we plan to cover and what the assignments will focus on. More details will be added as the course progresses.

One goal of this class is to get you to be comfortable with using a wide variety of tools (most of which are listed below). You are NOT expected to learn these tools on your own; we will provide step-by-step guidance on getting started with the tools and the actual assignments will be simple.

[restabs alignment=”osc-tabs-left” responsive=”false” tabheadcolor=”#000000″ seltabheadcolor=”#1e73be”]
[restab title=”WiSe 2018/19″ active=”active”]

Date Lecture Topics and Materials Assignments
October 11 Introduction: What is Big Data. Major tools used by data scientists. Class overview

Readings:
         ◎ https://github.com/SANSA-Stack
         ◎ http://www.scala-lang.org/
Lab 1: Setting up the environment and getting started with Scala
October 18 Spark Fundamentals I

Readings:
         ◎ Spark Programming Guide
         ◎ RDD and DataFrame API Examples
         ◎ Spark Cluster Overview
         ◎ Spark Configuration, Monitoring, and tuning

References:

Lab 2: Getting started with Spark
October 25

Spark Fundamentals II (Spark GraphX + Spark SQL)

Readings:
         ◎ Spark SQL, DataFrames and Datasets Guide

 References:

 

Lab 3: Spark GraphX and Spark SQL operations
November 1 No Class: Public Holiday
November 8

Spark Fundamentals II (Spark ML), SANSA – Semantic Analytics Stack, Project Allocation

Readings:
 ◎ Spark Machine Learning Library (MLlib) Guide
◎ GraphX Programming Guide
◎ SANSA Overview and SANSA FAQ.

References:

 

Lab 4: Spark ML and SANSA
 November 8  Project Assignment
November 22 First presentation for the Project
 December 13  Meeting I
 Lab work
 (cntd)
 (cntd)
 January 24  Meeting II
 February 22  Project report and source code
 February 28  Project Presentations

[/restab]

[restab title=”SoSe 2018″ ]

Date Lecture Topics and Materials Assignments
April 17 Introduction: What is Big Data. Major tools used by data scientists. Class overview

Readings:
         ◎ https://github.com/SANSA-Stack
         ◎ http://www.scala-lang.org/
Lab 1: Setting up the environment and getting started with Scala
April 24 Spark Fundamentals I

Readings:
         ◎ Spark Programming Guide
         ◎ RDD and DataFrame API Examples
         ◎ Spark Cluster Overview
         ◎ Spark Configuration, Monitoring, and tuning

References:

Lab 2: Getting started with Spark
May 1 No Class: Public Holiday
May 8

Spark Fundamentals II (Spark GraphX + Spark SQL)

Readings:
         ◎ Spark SQL, DataFrames and Datasets Guide

 References:

 

Lab 3: Spark GraphX and Spark SQL operations
May 15

Spark Fundamentals II (Spark ML)SANSA – Semantic Analytics Stack, Project Allocation

Readings:
       ◎ Spark Machine Learning Library (MLlib) Guide
◎ GraphX Programming Guide
◎ SANSA Overview and SANSA FAQ.

References:

 

Lab 4: Spark ML and SANSA
 May 15  Project Assignment
 June 5  First presentation for the Project
 June 12  Meetings
 Lab work
 (cntd)
 (cntd)
 August 24  Project report and source code
 August 29  Project Presentations

[/restab]
[restab title=”WiSe 2017/18″ ]

Date Lecture Topics and Materials Assignments
October 17 Introduction: What is Big Data. Major tools used by data scientists. Class overview

Readings:
         ◎ https://github.com/SANSA-Stack
         ◎ http://www.scala-lang.org/
Lab 1: Setting up the environment and getting started with Scala.
October 24 No Class
October 31 No Class: Public Holiday
November 7 Spark Fundamentals I

Readings:
         ◎ Spark Programming Guide
         ◎ RDD and DataFrame API Examples
         ◎ Spark Cluster Overview
         ◎ Spark Configuration, Monitoring and tuning

References:

Lab 2: Getting started with Spark
November 09 Spark Fundamentals II (Spark SQL)

Readings:
         ◎ Spark SQL, DataFrames and Datasets Guide

 References:
Lab 3: Spark GraphX and Spark SQL operations
November 14 Spark Fundamentals II (Spark ML – GraphX)SANSA – Semantic Analytics Stack, Project Allocation

Readings:
       ◎ Spark Machine Learning Library (MLlib) Guide
◎ GraphX Programming Guide
◎ SANSA Overview and SANSA FAQ.

References:

 Lab 4: Spark ML and SANSA
 November 21  Project Assignment
 December 05  Presentation for the Project
 December 19  Meetings
 Lab work
 (cntd)
 (cntd)
 February 20  Project report 
 February 27  Project Presentations

[/restab]

[restab title=”SoSe 2017″]

Date Lecture Topics and Materials Assignments
April 18 Introduction: What is Big Data. Major tools used by data scientists. Class overview

Readings:
         ◎ https://github.com/SANSA-Stack
         ◎ http://www.scala-lang.org/
Lab 1: Setting up the environment and getting started with Scala.
May 2 Spark Fundamentals I

Readings:
         ◎ Spark Programming Guide
         ◎ RDD and DataFrame API Examples
         ◎ Spark Cluster Overview
         ◎ Spark Configuration, Monitoring and tuning

References:

Lab 2: Getting started with Spark
May 9 Spark Fundamentals II

Readings:
         ◎ Spark SQL, DataFrames and Datasets Guide
◎ GraphX Programming Guide

 References:
Lab 3: Spark GraphX and Spark SQL operations
May 23  Spark Fundamentals II (Spark ML)SANSA – Semantic Analytics Stack

Readings:
       ◎ Spark Machine Learning Library (MLlib) Guide
◎ SANSA Overview and SANSA FAQ.

References:

 Lab 4: Spark ML and SANSA

(cntd)

(cntd)
(cntd)
(cntd)
 July 25  Project Presentations
 [/restab][/restabs]

Projects

Besides tutorials and worksheets to be posted publicly at the end of the semester, there will be some projects assigned to students. The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills taught during the lab.

You will have to choose one out of possible proposed projects. Each project is designed for group work and recommended to be worked on in groups of 3-4 students. It’s part of the project to work in a team.

[restabs alignment=”osc-tabs-left” responsive=”false” tabheadcolor=”#000000″ seltabheadcolor=”#1e73be”]

[restab title=”WiSe 2018/19″ active=”active”]

 
# Description Submission Due Date Presentation Date (Time)
1 TBA February 22 February 28 (10:00)
2 TBA February 22 February 28 (10:30)
3 TBA February 22 February 28 (11:00)
4 TBA February 22 February 28 (11:30)
5 TBA February 22 February 28 (12:00)

[/restab]

[restab title=”SoSe 2018″]

 
# Description Submission Due Date Presentation Date (Time)
1 DistSAKey: Scalable Almost Key discovery in RDF data using Spark August 24 August 29 (10:00)
2 Efficient completeness aware rule learning from Knowledge Graphs using Spark August 24 August 29 (10:30)
3 Finding Association Rules from Semantic web data using Spark framework August 24 August 29 (11:00)
4 Mining Semantic Association Rules from RDF data using Spark August 24 August 29 (11:30)
5 Evolutionary Discovery of Multi relational Association Rules from Ontological Knowledge Bases using Spark August 24 August 29 (12:00)

[/restab]

[restab title=”WiSe 2017/18″]

 
# Description Submission Due Date Presentation Date (Time)
1 Efficient semantic subgroup discovery using Spark February 20 February 27 (TBA)
2 Kernels for RDF data using Spark February 20 February 27 (TBA)
3 Ranking RDF properties using Spark framework February 20 February 27 (TBA)
4 Distributed Entity Resolution using Spark February 20 February 27 (TBA)
5 Substructure Kernels for RDF data using Spark February 20 February 27 (TBA)

[/restab]

[restab title=”SoSe 2017″]

 
# Description Submission Due Date Presentation Date (Time)
1 Efficient First Order Inductive Learner on Spark July 23 July 25 (10:00)
2 Efficient semantic subgroup discovery using Spark July 23 July 25 (10:30)
3 SANSA-RDF : Reading more types of RDF data July 23 July 25 (10:30)
4 Efficient Graph Kernels for RDF data using Spark July 23 July 25 (11:00)
5 RDF2Rules using Spark framework July 23 July 25 (11:30)
6 Distributed TensorLog: A Efficient Differentiable Deductive Database using Spark July 23 July 25 (12:00)
7 Entity Resolution using Spark July 23 July 25 (12:30)

[/restab][/restabs]

Grading

Grades for all projects will be assessed as follows:

  • project and team selection, problem understanding, implementation concept, and pre-presentation (15%)
  • project submission (implementation, documentation, project report) (80%)
    • implementation (40%)
    • project report (40%)
      • motivation, documentation (20%)
      • results and discussion (20%)
    • submit report and code via Git repository commit
  • Q&A session (5%)