Description | Schedule | Projects
Lecturer: Dr. Hajira Jabeen, Gezim Sejdiu, Prof. Dr. Jens Lehmann
Time: Thursday, 10:00–13:00, Endenicher Allee 19A – Seminarraum 1.047, Informatik III
Github: We will use github for disseminating project information.
The lab will not be offered during the WiSe2019/20 semester.
Course Description
The goal is to provide experience and technical skills related to Big data processing tools like Flink and Spark, in addition, to make them acquainted with the functional programming style prevalent in concurrent and parallel programming for Big data. This module will teach them to develop big data mining and machine learning solutions for massive amounts of data.
Schedule
Following is a schedule of the topics we plan to cover and what the assignments will focus on. More details will be added as the course progresses.
One goal of this class is to get you to be comfortable with using a wide variety of tools (most of which are listed below). You are NOT expected to learn these tools on your own; we will provide step-by-step guidance on getting started with the tools and the actual assignments will be simple.
Date | Lecture Topics and Materials | Assignments |
April 04 | Introduction: What is Big Data. Major tools used by data scientists. Class overview Readings:
◎ https://github.com/SANSA-Stack |
Lab 1: Setting up the environment and getting started with Scala |
April 11 | Spark Fundamentals I
Readings:
◎ Spark Programming Guide ◎ RDD and DataFrame API Examples
◎ Spark SQL, DataFrames and Datasets Guide
◎ Spark Cluster Overview
◎ Spark Configuration, Monitoring, and tuning
References: |
Lab 2: Getting started with Spark + Spark GraphX and Spark SQL operations |
April 18 |
No Class: Public Holiday |
|
April 25 |
Spark Fundamentals II (Spark ML), BigDL Readings: References:
|
Lab 3: Spark ML and BigDL |
May 2 |
SANSA – Semantic Analytics Stack, Project Allocation Readings: References:
|
Lab 4: SANSA |
May 2 | Project Assignment | |
May 16 | First presentation for the Project | |
May 23 | Meeting I | |
Lab work | ||
(cntd) | ||
(cntd) | ||
June 27 | Meeting II | |
July 05 | Project report and source code submission | |
July 11 | Project Presentations |
Date | Lecture Topics and Materials | Assignments |
October 11 | Introduction: What is Big Data. Major tools used by data scientists. Class overview Readings:
◎ https://github.com/SANSA-Stack |
Lab 1: Setting up the environment and getting started with Scala |
October 18 | Spark Fundamentals I
Readings:
◎ Spark Programming Guide ◎ RDD and DataFrame API Examples
◎ Spark Cluster Overview
◎ Spark Configuration, Monitoring, and tuning
References: |
Lab 2: Getting started with Spark |
October 25 |
Spark Fundamentals II (Spark GraphX + Spark SQL) Readings: References:
|
Lab 3: Spark GraphX and Spark SQL operations |
November 1 | No Class: Public Holiday | |
November 8 |
Spark Fundamentals II (Spark ML), SANSA – Semantic Analytics Stack, Project Allocation Readings: References:
|
Lab 4: Spark ML and SANSA |
November 8 | Project Assignment | |
November 22 | First presentation for the Project | |
December 13 | Meeting I | |
Lab work | ||
(cntd) | ||
(cntd) | ||
January 24 | Meeting II | |
February 22 | Project report and source code | |
February 27 | Project Presentations |
Date | Lecture Topics and Materials | Assignments |
April 17 | Introduction: What is Big Data. Major tools used by data scientists. Class overview Readings:
◎ https://github.com/SANSA-Stack |
Lab 1: Setting up the environment and getting started with Scala |
April 24 | Spark Fundamentals I
Readings:
◎ Spark Programming Guide ◎ RDD and DataFrame API Examples
◎ Spark Cluster Overview
◎ Spark Configuration, Monitoring, and tuning
References: |
Lab 2: Getting started with Spark |
May 1 | No Class: Public Holiday | |
May 8 |
Spark Fundamentals II (Spark GraphX + Spark SQL) Readings: References:
|
Lab 3: Spark GraphX and Spark SQL operations |
May 15 |
Spark Fundamentals II (Spark ML), SANSA – Semantic Analytics Stack, Project Allocation Readings: References:
|
Lab 4: Spark ML and SANSA |
May 15 | Project Assignment | |
June 5 | First presentation for the Project | |
June 12 | Meetings | |
Lab work | ||
(cntd) | ||
(cntd) | ||
August 24 | Project report and source code | |
August 29 | Project Presentations |
Date | Lecture Topics and Materials | Assignments |
October 17 | Introduction: What is Big Data. Major tools used by data scientists. Class overview Readings:
◎ https://github.com/SANSA-Stack |
Lab 1: Setting up the environment and getting started with Scala. |
October 24 | No Class | |
October 31 | No Class: Public Holiday | |
November 7 | Spark Fundamentals I
Readings:
◎ Spark Programming Guide ◎ RDD and DataFrame API Examples
◎ Spark Cluster Overview
◎ Spark Configuration, Monitoring and tuning
References: |
Lab 2: Getting started with Spark |
November 09 | Spark Fundamentals II (Spark SQL)
Readings: References:
|
Lab 3: Spark GraphX and Spark SQL operations |
November 14 | Spark Fundamentals II (Spark ML – GraphX), SANSA – Semantic Analytics Stack, Project Allocation
Readings: References: |
Lab 4: Spark ML and SANSA |
November 21 | Project Assignment | |
December 05 | Presentation for the Project | |
December 19 | Meetings | |
Lab work | ||
(cntd) | ||
(cntd) | ||
February 20 | Project report | |
February 27 | Project Presentations |
Date | Lecture Topics and Materials | Assignments |
April 18 | Introduction: What is Big Data. Major tools used by data scientists. Class overview Readings:
◎ https://github.com/SANSA-Stack |
Lab 1: Setting up the environment and getting started with Scala. |
May 2 | Spark Fundamentals I
Readings:
◎ Spark Programming Guide ◎ RDD and DataFrame API Examples
◎ Spark Cluster Overview
◎ Spark Configuration, Monitoring and tuning
References: |
Lab 2: Getting started with Spark |
May 9 | Spark Fundamentals II
Readings: References:
|
Lab 3: Spark GraphX and Spark SQL operations |
May 23 | Spark Fundamentals II (Spark ML), SANSA – Semantic Analytics Stack
Readings: References: |
Lab 4: Spark ML and SANSA |
(cntd) |
||
(cntd) | ||
(cntd) |
||
(cntd) | ||
July 25 | Project Presentations | |
Projects
Besides tutorials and worksheets to be posted publicly at the end of the semester, there will be some projects assigned to students. The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills taught during the lab.
You will have to choose one out of possible proposed projects. Each project is designed for group work and recommended to be worked on in groups of 3-4 students. It’s part of the project to work in a team.
# | Description | Submission Due Date | Presentation Date (Time) |
---|---|---|---|
1 | TBA | July 5 | July 11 (10:00) |
2 | TBA | July 5 | July 11(10:30) |
3 | TBA | July 5 | July 11 (11:00) |
4 | TBA | July 5 | July 11 (11:30) |
5 | TBA | July 5 | July 11 (12:00) |
6 | TBA | July 5 | July 11 (12:30) |
# | Description | Submission Due Date | Presentation Date (Time) |
---|---|---|---|
1 | A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys using Spark | February 22 | February 27 (10:00) |
2 | Distributed Data Deduplication using Spark | February 22 | February 27 (10:30) |
3 | DisBLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution using Spark | February 22 | February 27 (11:00) |
4 | A Distributed Active Blocking Scheme Learning for Entity Resolution | February 22 | February 27 (11:30) |
5 | DistTransH: A Scalable Knowledge Graph Embedding by Translating on Hyperplanes | February 22 | February 27 (12:00) |
6 | Evolutionary Discovery of Multi relational Association Rules from Ontological Knowledge Bases using Spark | February 22 | February 27 (12:30) |
# | Description | Submission Due Date | Presentation Date (Time) |
---|---|---|---|
1 | DistSAKey: Scalable Almost Key discovery in RDF data using Spark | August 24 | August 29 (10:00) |
2 | Efficient completeness aware rule learning from Knowledge Graphs using Spark | August 24 | August 29 (10:30) |
3 | Finding Association Rules from Semantic web data using Spark framework | August 24 | August 29 (11:00) |
4 | Mining Semantic Association Rules from RDF data using Spark | August 24 | August 29 (11:30) |
5 | Evolutionary Discovery of Multi relational Association Rules from Ontological Knowledge Bases using Spark | August 24 | August 29 (12:00) |
# | Description | Submission Due Date | Presentation Date (Time) |
---|---|---|---|
1 | Efficient semantic subgroup discovery using Spark | February 20 | February 27 (TBA) |
2 | Kernels for RDF data using Spark | February 20 | February 27 (TBA) |
3 | Ranking RDF properties using Spark framework | February 20 | February 27 (TBA) |
4 | Distributed Entity Resolution using Spark | February 20 | February 27 (TBA) |
5 | Substructure Kernels for RDF data using Spark | February 20 | February 27 (TBA) |
# | Description | Submission Due Date | Presentation Date (Time) |
---|---|---|---|
1 | Efficient First Order Inductive Learner on Spark | July 23 | July 25 (10:00) |
2 | Efficient semantic subgroup discovery using Spark | July 23 | July 25 (10:30) |
3 | SANSA-RDF : Reading more types of RDF data | July 23 | July 25 (10:30) |
4 | Efficient Graph Kernels for RDF data using Spark | July 23 | July 25 (11:00) |
5 | RDF2Rules using Spark framework | July 23 | July 25 (11:30) |
6 | Distributed TensorLog: A Efficient Differentiable Deductive Database using Spark | July 23 | July 25 (12:00) |
7 | Entity Resolution using Spark | July 23 | July 25 (12:30) |
Grading
Grades for all projects will be assessed as follows:
- project and team selection, problem understanding, implementation concept, and pre-presentation (15%)
- project submission (implementation, documentation, project report) (80%)
- implementation (40%)
- project report (40%)
- motivation, documentation (20%)
- results and discussion (20%)
- submit report and code via Git repository commit
- Q&A session (5%)