AI5308 / AI4005 (Data Engineering)

Spring 2024


Table of contents


Notice


Course Outline

  • Code: AI5308 / AI4005 (Data Engineering)
  • Schedule: Tue/Thu 1:00pm-2:30pm
  • Location: GIST College Building A, Room 115 (N4)

Machine learning systems are both complex and unique. It is complex because they consist of many different components and involve many different stakeholders, and it is unique because they’re data dependent, with data varying wildly from one use case to the next. In this lecture, students will learn how to conduct data engineering, and holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.

Prerequisites: There are no official course prerequisites. But the final project will require building machine learning applications, so it is recommended to have basic knowledges on machine learning and some fluency in programming is needed. This is not a course of learning fancy algorithms, but you’ll get to know how to apply your machine learning knowledge in systemical way. Web programming skills are a plus, but not required. There will be some synergy by taking this course with project-based AI course such as AI4028.


Team

  • Instructor:
    • Sundong Kim - After class (Thu/Thr 14:15-14:45, N4 #115)
  • TAs:
    • Mintaek Lim - Office Hour (Tue 16:00-17:00, AI Graduate School (S7) 1st Floor)
    • Yoonjae Kim - Before class (Thu/Thr 12:30-13:00, N4 #115)
    • Sungho Bae - Office Hour (Tue 17:00-18:00, 신재생연구동 무한도전 공간, C10, #215)

The course schedule and all resources (e.g. lecture slides, discussion worksheets) will be posted on the course website. All class discussions, announcements and other communication will take place via Ed Discussion. If you need to contact the course staff privately, please make a private question on Ed. I do not encourage you to communicate with prof/TAs via emails.


LMS

  • Instead of using GIST LMS system, we will use Ed Discussion forum and Gradescope.
  • Register them with your name and GIST e-mail address (e.g., Sundong Kim, sundong@gist.ac.kr or sundong@gm.gist.ac.kr)
  • Access the AI5308 Ed Discussion forum. If you haven’t already been added to the class, use this invitation link.
  • Submit your homeworks at the AI5308 Gradescope. The entry code is EJGBD4.

Books


Syllabus

Herebelow, you can find the syllabus of the course. (This is a tentative schedule which can be changed.)

Date       Description Readings Homeworks/Projects
2/27 Course Overview    
2/29 Overview of ML Systems DMLS Ch.1  
3/5 Introduction to ML Systems Design DMLS Ch.2 Survey (Due: Mar 7)
3/7 Experience from Naver Shopping (Video) Blog, Paper  
3/12 Data Engineering 101 DMLS Ch.3
3/14 Training Data DMLS Ch.4  
3/19 Lab1: MongoDB Tutorial (TA: Mintaek) Colab
3/21 Training Data + Project Description   HW1 (Due: Mar 24)
3/26 Pop Quiz 1    
3/28 Lab2: Prototyping Tutorial (TA: Yoonjae)   Proposal (Due: Mar 31)
4/2 Feature Engineering DMLS Ch.5
4/4 Feature Engineering    
4/9 Model Development and Offline Evaluation DMLS Ch.6  
4/11 Lab3: Monitoring Tutorial (TA: Sungho)   HW2 (Due: Apr 14)
4/16 —— Midterm week (No lecture) ——    
4/18 —— Midterm week (No lecture) ——    
4/23 Invited Talk (Doyoon Song, Toggle)  
4/25 Data Distribution Shifts and Monitoring DMLS Ch.8 Demo Video (Due: Apr 28)
4/30 MVP Demo (Watch Video + Discussion)    
5/2 Continual Learning and Test in Production DMLS Ch.9  
5/7 Invited Talk (Sewon Kim, NCSoft)    
5/9 Human Side of ML DMLS Ch.11 HW3 (Due: May 12)
5/14 Invited Talk (Geonwoo Cho, GIST EECS)    
5/16 Infrastructure and Tooling for MLOps (with TAs) DMLS Ch.10
5/21 Project Review (Bring your demo & Discuss each other)  
5/23 TBD: LLM Training Experience (Video)   50% Draft (Due: May 26)
5/28 TBD: Responsible AI (Video)    
5/30 TBD: MLOps from Industry (Video)    
6/4 Rehearse & Commentary (AI Building 1F, 1:00-2:30pm)    
6/6 ——National Holiday (No lecture)——   Final report (Due: Jun 9)
6/11 Demo day (AI Building 1F, 1:00-4:00pm)    

Homeworks

You will submit three homeworks during the course. Submit your work at the AI5308 Gradescope. To facilitate the grading process, you must match which page corresponds to each of the questions, when you are submitting your homework. See this video for Gradescope tutorial.


Project

  • Details can be found here
    • Focus is on real-world applications (demos) of any machine learning topics
    • There will be three sessions to review your progress.
    • At the end of the semester, the class will host a convention-style demo day, with interactive booth experience. Each team will present their application and prepare a report.

Gradings

You will earn A if (but not only if) your score is at least \(80\times(1-ε_1)\), B if your score is at least \(60\times(1-ε_2)\), C if your score is at least \(40\times(1-ε_3)\), for some \(ε_i ≥ 0\) to be determined later. All participants in the course (Undergraduate and graduate students) are evaluated equally.

  • Homeworks (45%, 15% each)
  • In-class activity (10%, pop-quiz, invited talks, etc)
  • Project (45%; project proposal, mvp video, demo day, report, peer review)

Late Policy

The policy is simple: there are no slip dates. If homework submissions are late, they are increasingly penalized as follows: within 24 hours, you lose 10%; within 48 hours, you lose 20%; within 72 hours, you lose 40%. More than three days late, you can no longer hand-in the homework. Note that the penalty scheme applies to project deadlines too.


Honor Code

Study groups are allowed. It is also OK to get clarification (but not solutions) from books or online resources, again after you have thought about the problems on your own. However, we expect students to understand and complete their own homeworks. Each student must write down the solution independently and hand in one homework per student, which means you write your solution after closing the book and all your notes, without helped by your colleagues. If you studied together as a group, please cite your collaborators fully and completely (e.g., “Junho explained to me what is asked in Question 2.1”). When in doubt about collaboration details, please ask us on Ed discussion.

If elements of two homeworks are determined to be clearly very similar, we believe that they were done together or one was copied from the other), then the course grade for all students involved in the incident will be reduced by one letter grade for the first offense, and to an F for the second offense. (All means both the copy-ers and the copy-ees). The grade for that homework will also be reduced to 0. More serious cases of cheating (e.g., cheating on exams) will lead to severe consequences ranging from a grade of “F” on the class to suspension from the University.


Frequently Asked Questions

  • How difficult is the course? The materials are not difficult to understand, but the homeworks and final projects are fairly involved. We wouldn’t recommend taking the course unless you’re ready to build things and learn from hands-on experience!

  • Is attendance mandatory? We won’t be taking attendance but we expect to see you often in class. We love talking to students to understand how you are doing, make sure you get the most out of the class, and get your feedback to improve the materials. We may have one or two pop-quizzes without notification.

  • What is the format of the class? It will be lectures, tutorials and discussions. I will invite industry experts for special lectures.

  • I don’t have a team for the final project, can I still enroll? Yes. Most students don’t have a team already when they join the course. We’ll arrange you to find teammates.

  • How mature is the course? This is the second time the course is offered at GIST. Most of the materials are from Chip Huyen’s CS329S course. For me to handle this lecture, there is still a long way to go. We’re trying our best to ensure the quality of the lectures, but it might not be as polished as other courses. Your feedback will be greatly appreciated.

  • Do I need to know Python for the course? Since Python has become the most popular language for machine learning, we expect most tutorials will be in Python. Python fluency isn’t required, but will make your life so much easier during the course.

  • I have a question about the class. What is the best way to reach the course staff? Please post your question on the Ed discussion forum so that other students can benefit from your questions. If you have a personal matter or emergencies, please send the private message to me and our TAs.


GIST-logo