Data Engineering

Spring 2023

  • Code: AI5308 / AI4005 (Data Engineering)
  • Schedule: Tue/Thu 1:00pm-2:30pm
  • Location: GIST College Building A, Room 227 (N4)
  • Instructor: Sundong Kim
  • TAs: Sanha Hwang, Sungkyu Yang, Hongyiel Suh
  • Contact: Students are encouraged to ask all course-related questions on Discord, where you can also find announcements. Meanwhile, our office hours are as follows.
    • Sundong: Tue 2:30pm-3:30pm, Discord or GIST AI Graduate School (S7) Room 204
    • Sungkyu: TBD, Discord or GIST AI Graduate School (S7) Room 202
    • Hongyiel: TBD, Discord or GIST AI Graduate School (S7) Room 202
    • Sanha: TBD, Discord or GIST AI Graduate School (S7) Room 204
  • Virtual Classrooms:
  • Class Logistics: See this page

Course Overview

Machine learning systems are both complex and unique. It is complex because they consist of many different components and involve many different stakeholders, and it is unique because they’re data dependent, with data varying wildly from one use case to the next. In this lecture, you’ll learn how to conduct data engineering, and hoslistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.

During the lecture, we will consider each design decision, such as how to process and create training data, which features to use, how often to retrain models, and what to monitor in the context of how it can help your system as a whole achieve its objectives. The iterative framework will be explained through actual case studies.

Overall, this lecture will gain you some insight on how to help you tackle scenarios such as:

  • Engineering data and choosing the right metrics to solve a business problem
  • Automating the process for continually developing, evaluating, deploying, and updating models
  • Developing responsible ML systems

Textbook & References


Notice

  • I invited several speakers to graduate school AI colloquium regarding to our course (from Naver, MakinaRocks, etc). Please mark the below sessions on your calendar and attend them. For those who cannot attend, we will record these lectures and share with you.
  • For convenience, you can consider taking the AI 5001 (AI Colloquium) together with my class. The colloquium will be on every Thursday 16:00-17:30, AI Studio (1F), AI Graduate School (S7).

Tentative Schedule

Herebelow, you can find the tentative schedule of the course. Overall course will follow the DMLS book, which is a up-to-date version of the online lecture notes, written by Chip Huyen.

Date       Description Materials  
Feb 28 Introduction; Understanding machine learning production LN.1, DMLS Ch.1,  
Mar 2 Understanding machine learning production (cont’d) LN.1, DMLS Ch.2, Booking.com  
Mar 7 Project announcement Reference reports  
Mar 9 Data engineering and fundamentals LN.2, DMLS Ch.3  
Mar 14 Data engineering and fundamentals LN.2, DMLS Ch.3  
Mar 16 Training data LN.3, DMLS Ch.4  
Mar 21 Feature engineering LN.4, DMLS Ch.5  
Mar 23 Feature engineering LN.4, DMLS Ch.5  
Mar 28 Model selection, development and training LN.5, DMLS Ch.6  
Mar 30 Invited talk - e-CLIP model at Naver Shopping (16:00, AI Studio, S7) Video  
Apr 4 Model selection, development and training LN.5, DMLS Ch.6  
Apr 6 Invited talk - MLOps at Naver Shopping (16:00, AI Studio, S7) Video  
Apr 11 Model development and offline evaluation LN.6, DMLS Ch.6  
Apr 13 Model development and offline evaluation LN.6, DMLS Ch.6  
Apr 18 No Lecture (Midterm Period)    
Apr 20 No Lecture (Midterm Period)    
Apr 25 Deployment LN.8, DMLS Ch.7  
Apr 27 Deployment LN.8, DMLS Ch.7  
May 2 Project mid-review (5 min presentation)    
May 4 Data distribution shifts and monitoring Ln.10, DMLS Ch.8  
May 9 Data distribution shifts and monitoring Ln.10, DMLS Ch.8  
May 11 Continual learning and test in production DMLS Ch.9  
May 16 Continual learning and test in production DMLS Ch.9  
May 18 Invited talk - Trustworthy federated learning (16:00, AI Studio, S7) Video, DMLS Ch.11  
May 23 Human side of machine Learning DMLS Ch.11  
May 25 Invited talk - MLOps at MakinaRocks (16:00, AI Studio, S7) Video  
May 30 Infrastructure and tooling for MLOps DMLS Ch.10  
Jun 1 Infrastructure and tooling for MLOps DMLS Ch.10  
Jun 6 No Lecture (National Holiday)    
Jun 8 Project demo day (Poster & demo booth) Reference presentation  
Jun 14 No Lecture (Finals week)    
Jun 16 No Lecture (Finals week) Team report (due: Jun 16), Distill-style sample