Data Engineering
Spring 2023
- Code: AI5308 / AI4005 (Data Engineering)
- Schedule: Tue/Thu 1:00pm-2:30pm
- Location: GIST College Building A, Room 227 (N4)
- Instructor: Sundong Kim
- TAs: Sanha Hwang, Sungkyu Yang, Hongyiel Suh
- Contact: Students are encouraged to ask all course-related questions on Discord, where you can also find announcements. Meanwhile, our office hours are as follows.
- Sundong: Tue 2:30pm-3:30pm, Discord or GIST AI Graduate School (S7) Room 204
- Sungkyu: TBD, Discord or GIST AI Graduate School (S7) Room 202
- Hongyiel: TBD, Discord or GIST AI Graduate School (S7) Room 202
- Sanha: TBD, Discord or GIST AI Graduate School (S7) Room 204
- Virtual Classrooms:
- Google Classroom (Homeworks)
- Discord (Discussion and Q&A, Team Collaboration)
- Class Logistics: See this page
Course Overview
Machine learning systems are both complex and unique. It is complex because they consist of many different components and involve many different stakeholders, and it is unique because they’re data dependent, with data varying wildly from one use case to the next. In this lecture, you’ll learn how to conduct data engineering, and hoslistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.
During the lecture, we will consider each design decision, such as how to process and create training data, which features to use, how often to retrain models, and what to monitor in the context of how it can help your system as a whole achieve its objectives. The iterative framework will be explained through actual case studies.
Overall, this lecture will gain you some insight on how to help you tackle scenarios such as:
- Engineering data and choosing the right metrics to solve a business problem
- Automating the process for continually developing, evaluating, deploying, and updating models
- Developing responsible ML systems
Textbook & References
- [Main Book: DMLS]: Designing Machine Learning Systems (O’Reilly 2022)
- Book: Machine Learning Interviews
- Homepage: CS329S: Machine Learning Systems Design
- MLOps Discord Channel by Chip Huyen
Notice
- I invited several speakers to graduate school AI colloquium regarding to our course (from Naver, MakinaRocks, etc). Please mark the below sessions on your calendar and attend them. For those who cannot attend, we will record these lectures and share with you.
- March 30: (Multimodal representation learning at Naver Shopping, Wonyoung Shin)
- April 6: (MLOps at Naver Shopping, Byeongjo Kim and Shengzhe Li)
- May 18: (Trustworthy federated learning at IBS, Sungwon Han)
- May 25: (MLOps and use cases at MakinaRocks, Youngsub Lim)
- For convenience, you can consider taking the AI 5001 (AI Colloquium) together with my class. The colloquium will be on every Thursday 16:00-17:30, AI Studio (1F), AI Graduate School (S7).
Tentative Schedule
Herebelow, you can find the tentative schedule of the course. Overall course will follow the DMLS book, which is a up-to-date version of the online lecture notes, written by Chip Huyen.
Date | Description | Materials | |
---|---|---|---|
Feb 28 | Introduction; Understanding machine learning production | LN.1, DMLS Ch.1, | |
Mar 2 | Understanding machine learning production (cont’d) | LN.1, DMLS Ch.2, Booking.com | |
Mar 7 | Project announcement | Reference reports | |
Mar 9 | Data engineering and fundamentals | LN.2, DMLS Ch.3 | |
Mar 14 | Data engineering and fundamentals | LN.2, DMLS Ch.3 | |
Mar 16 | Training data | LN.3, DMLS Ch.4 | |
Mar 21 | Feature engineering | LN.4, DMLS Ch.5 | |
Mar 23 | Feature engineering | LN.4, DMLS Ch.5 | |
Mar 28 | Model selection, development and training | LN.5, DMLS Ch.6 | |
Mar 30 | Invited talk - e-CLIP model at Naver Shopping (16:00, AI Studio, S7) | Video | |
Apr 4 | Model selection, development and training | LN.5, DMLS Ch.6 | |
Apr 6 | Invited talk - MLOps at Naver Shopping (16:00, AI Studio, S7) | Video | |
Apr 11 | Model development and offline evaluation | LN.6, DMLS Ch.6 | |
Apr 13 | Model development and offline evaluation | LN.6, DMLS Ch.6 | |
Apr 18 | No Lecture (Midterm Period) | ||
Apr 20 | No Lecture (Midterm Period) | ||
Apr 25 | Deployment | LN.8, DMLS Ch.7 | |
Apr 27 | Deployment | LN.8, DMLS Ch.7 | |
May 2 | Project mid-review (5 min presentation) | ||
May 4 | Data distribution shifts and monitoring | Ln.10, DMLS Ch.8 | |
May 9 | Data distribution shifts and monitoring | Ln.10, DMLS Ch.8 | |
May 11 | Continual learning and test in production | DMLS Ch.9 | |
May 16 | Continual learning and test in production | DMLS Ch.9 | |
May 18 | Invited talk - Trustworthy federated learning (16:00, AI Studio, S7) | Video, DMLS Ch.11 | |
May 23 | Human side of machine Learning | DMLS Ch.11 | |
May 25 | Invited talk - MLOps at MakinaRocks (16:00, AI Studio, S7) | Video | |
May 30 | Infrastructure and tooling for MLOps | DMLS Ch.10 | |
Jun 1 | Infrastructure and tooling for MLOps | DMLS Ch.10 | |
Jun 6 | No Lecture (National Holiday) | ||
Jun 8 | Project demo day (Poster & demo booth) | Reference presentation | |
Jun 14 | No Lecture (Finals week) | ||
Jun 16 | No Lecture (Finals week) | Team report (due: Jun 16), Distill-style sample |