Date: 16 - 17 November 2017

Description

With the rapid growth in data volume that is being used in data analysis tasks, it gets more and more challenging for the user to process it using standard methods. Enter Spark, a high-performance distributed computing framework, which allows us to tackle big-data problems by distributing the workload across a cluster of machines.

 

This two day course addresses the technical architechture and use cases of Spark, setting it up for your work, best practices and programming aspects. The first day includes the overview, architechtural concepts and programming with Spark's fundamental data structure (RDD). The second day focuses on the SQL module of Spark, which allows the user to analyse data using Spark's distributed collection (Dataframes) by using the traditional SQL queries.

Learning outcome

After this course you should be able to write simple to intermediate programmes in Spark using RDD and dataframes/SQL.

Prerequisites

Basic knowledge on programming in general is recommended (ideally, Python).

Please NOTE: This is not a regular programming course, the participants would be expected to learn emerging concepts in the field of big data / distributed processing, which might be completely different from the concepts of a general progamming language.

Agenda

Day 1, Thursday 16.11

   09.00 – 09.30 Overview and architechture of Spark
   09.30 – 10.15 Basics of RDDs + Demo
   10.15 – 10.30 Coffee break
   10.30 – 11.00 RDD: Transformations and Actions
   11.00 – 12.00 Exercises
   12.00 – 13.00 Lunch
   13.00 – 13.30 Word Count Example
   13.30 – 14.00 Exercises
   14.00 – 14.15 Short overview of Machine learning library of Spark
   14.15 – 14.30 Coffee break
   14.30 – 15.30 Exercises
   15.30 – 16.00 Summary of the first day & exercises walk-trough
Day 2, Friday 17.11

   09.00 – 09.30 Spark Dataframes and SQL overview
   09.30 – 10.15 Exercises
   10.15 – 10.30 Coffee break
   10.30 – 10.45 Dataframes and SQL contd.
   10.45 – 12.00 Exercises
   12.00 – 13.00 Lunch
   13.00 – 13.30 Best practices and other useful stuff
   13.30 – 14.30 Exercises
   14.30 – 14.45 Coffee break
   14.45 – 15.00 Brief overview of Spark Streaming
   15.00 – 15.15 Demo: Processing live twitter stream data
   15.15 – 16.00 Summary of the course & exercises walk-trough
Lecturers: 

Apurva Nandan (CSC), Teaching Assistant: Tommi Jalkanen (CSC)

 

Language:  EnglishPrice:          Free of charge

https://events.prace-ri.eu/event/668/

Event types:

  • Workshops and courses


Activity log