The goal is to provide experience and technical skills related to Big data processing tools like Flink and Spark, in addition, to make them acquainted with the functional programming style prevalent in concurrent and parallel programming for Big data. This module will teach them to develop big data mining and machine learning solutions for massive amounts of data.
Following is a tentative schedule of the topics we plan to cover and what the assignments will focus on. More details will be added as the course progresses.
One goal of this class is to get you to be comfortable with using a wide variety of tools (most of which are listed below). You are NOT expected to learn these tools on your own; we will provide step-by-step guidance on getting started with the tools and the actual assignments will be simple.
|Date||Lecture Topics and Materials||Assignments|
|April 18||Introduction: What is Big Data. Major tools used by data scientists. Class overview
|Lab 1: Setting up the environment and getting started with Scala.|
|May 2||Spark Fundamentals I
◎ Spark Programming Guide
◎ RDD and DataFrame API Examples
◎ Spark Cluster Overview
◎ Spark Configuration, Monitoring and tuning
|Lab 2: Getting started with Spark|
|May 9||Spark Fundamentals II
|Lab 3: Spark GraphX and Spark SQL operations|
|May 23|| Spark Fundamentals II (Spark ML), SANSA – Semantic Analytics Stack
|Lab 4: Spark ML and SANSA|
|July 25||Project Presentations|
Besides tutorials and worksheets to be posted publicly at the end of the semester, there will be some projects assigned to students. The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills taught during the lab.
You will have to choose one out of possible proposed projects. Each project is designed for group work and recommended to be worked on in groups of 3-4 students. It’s part of the project to work in a team.
|#||Description||Submission Due Date||Presentation Date (Time)|
|1||Efficient First Order Inductive Learner on Spark||July 23||July 25 (10:00)|
|2||Efficient semantic subgroup discovery using Spark||July 23||July 25 (10:30)|
|3||SANSA-RDF : Reading more types of RDF data||July 23||July 25 (10:30)|
|4||Efficient Graph Kernels for RDF data using Spark||July 23||July 25 (11:00)|
|5||RDF2Rules using Spark framework||July 23||July 25 (11:30)|
|6||Distributed TensorLog: A Efficient Differentiable Deductive Database using Spark||July 23||July 25 (12:00)|
|7||Entity Resolution using Spark||July 23||July 25 (12:30)|
Grades for all projects will be assessed as follows:
- project and team selection, problem understanding, implementation concept, and pre-presentation (15%)
- project submission (implementation, documentation, project report) (80%)
- implementation (40%)
- project report (40%)
- motivation, documentation (20%)
- results and discussion (20%)
- due 23/07/2017 (11:59pm – no extension!!)
- submit report and code via Git repository commit
- Q&A session (5%)