Course Information

Big Data platforms is a 5 ECTS Master's level advanced course. This course focuses on big data platforms and on key algorithmic ideas and methods used to implement them. After completing this course you are able to list many of the key technologies used in big data processing and to select suitable methods for solving challenging big data processing tasks using cloud computing technologies. You will also be able to compare the scalability and fault tolerance implications of using the selected methodologies.

Main topics are:

  • distributed computing,
  • Warehouse-Scale Computers,
  • fault tolerance in distributed systems,
  • distributed file systems,
  • distributed batch processing with the MapReduce and the Apache Spark (PySpark) computing frameworks, and
  • distributed cloud based databases.

The course material will consist of lecture materials and exercises provided by the lecturer.

Course Target Audience

The course is suitable to those who are interested in big data platforms employed in cloud computing and have previous knowledge in programming, database systems and command line tools. Optional course in Data Science Master's Program. Also suitable for Computer Science Master's Program students. The course is suitable to University of Helsinki exchange students.

Course Prerequisites

To attend this course, you must have:

  • basic programming skills (Python),
  • skills to work with command line tools in Linux, and
  • basic knowledge in database systems (SQL).

Lecture Schedule

The Lectures of the course will be will Zoom based lectures. Slides and video recording of each of the lectures will typically be made available a within 24 hours of the live lecture session. The link to the Zoom lectures is:

https://helsinki.zoom.us/j/65695344424?pwd=TGxVUEFkY2RxQm9FYXBVN2NGWWxmQT09

Lecture date Lecture time (EEST)
Lecture 1 Tue 3.9.2024 10:15-11:45
Lecture 2 Thu 5.9.2024 12:15-13:45
Lecture 3 Tue 10.9.2024 10:15-11:45
Lecture 4 Thu 12.9.2024 12:15-13:45
Lecture 5 Tue 17.9.2024 10:15-11:45
Lecture 6 Thu 19.9.2024 12:15-13:45
Lecture 7 Tue 24.9.2024 10:15-11:45
Lecture 8 Thu 26.9.2024 12:15-13:45
Lecture 9 Tue 1.10.2024 10:15-11:45
Lecture 10 Thu 3.10.2024 12:15-13:45
Lecture 11 Tue 8.10.2024 10:15-11:45
Backup Lecture Slot Thu 10.10.2024 12:15-13:45

Lecture Slides and Videos

The Lecture slides contain all the material needed to pass the course, the videos go through this material and contain no additional information needed for the quizzes.

Home Exercise Schedule

The course will contain programming exercises where you will be using the Spark framework to solve Big Data processing tasks. We will be using the Python programming language based PySpark interface and will be doing several database query type analytics queries. Therefore basic programming skills using Python and knowledge about database programming, especially using the SQL query language will be very helpful for completing the home exercises.

The schedule for the home exercises was announced in the first Lecture:

Release Date Due Date (23:59 EEST)
Introduction to Spark + RDD Programming 10.9 24.9
Dataframe Programming 17.9 1.10
Machine Learning (MLlib) 24.9 8.10
Graphframe Programming 1.10 15.10
Structured Streaming 8.10 22.10
Extras (Optional for extra points) 15.10 31.10

The Home Exercise System

To complete the home exercises, you will need to utilize the container-based Jupyter Notebook system. Detailed instructions can be found via the following link:

Home Exercise Instructions

The home exercises will be released according to the schedule outlined above, accessible through the following link:

Home Exercise Release

To submit your assignments, please use the submission box located at the bottom of this page. Please note that the submission box will only become visible once you are logged in.

Course Discord Channel

The course has a Discord channel for helping fellow students. Lecturer and Course Assistant will periodically also join in the conversation. You can join to the groups through the link:

https://study.cs.helsinki.fi/discord/join/bdp

Passing the Course

You need to pass both the lecture quizzes by 31st of October 2024 and home exercises by their respective deadlines listed above. The grading scale will be as follows. Minimum 50% from both home exercise totals and also minimum 50% from quizzes are needed to pass the course. The Extra round points from home exercises will count towards your home exercise points but will not allow home exercises contribution to go over 100%. After this the percentages obtained from home exercises and quizzes are summed together, each weighted with 50% weight. The final total percentage will give grades as follows: 90%-100%: grade 5, 80%-89%: grade 4, 70%-79%: grade 3, 60%-69%: grade 2, 50%-59%: grade 1.

Use of Large Language Models in the Course

Large language models (LLMs) have recently developed rapidly as versatile tools. While they have useful applications, they can also conflict with learning objectives. Permitted use cases are always dependent on the course.

General language models can produce incorrect, misleading, or irrelevant information. Therefore, it is the student's responsibility to ensure the accuracy and relevance of the information. It is also worth noting that specialized tools generally yield better results than language models.

Presenting generated text or code as one's own can be interpreted as plagiarism. More information can be found at the following link: https://studies.helsinki.fi/instructions/article/what-cheating-and-plagiarism

In this course, the use of language models is completely prohibited.