Homework 2 ✍🏻
Back to the Data Engineering Course - 2024 Spring (AI5308/AI4005)
The second homework consists of a paper critique and a design problem with a programming component. It is due on Sunday, Apr 14, at 11:59 PM.
- Deliverables:
- Submit a PDF of your homework, listing all your code, to the Gradescope entitled “HW2”. You may typeset your homework in LaTeX or Word or submit neatly handwritten and scanned solutions. Please start each question on a new page. Include any graphs within the relevant sections. Each solution should be self-contained on its own page.
- Please list the names of students who helped you or those whom you helped with the homework. Note that exchanging code is not allowed.
- To facilitate the grading process, you must match which page corresponds to each of the questions, when you are submitting your homework. See this video for Gradescope tutorial.
- Contents:
- Paper Critique: Choose one paper from the below list and write a critique following these guidelines.
- Actions Speak Louder than Goals: Valuing Player Actions in Soccer
- Optimizing Airbnb Search Journey with Multi-task Learning
- Thinking out of the Package: Recommending Package Types for E-Commerce Shipments
- e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce
- A Unified Approach to Interpreting Model Predictions
- Design Problem: Designing a Model to Predict Taxi Trip Duration
- Objective: Imagine you are tasked with developing a predictive model to estimate the duration of taxi trips based on the pickup and drop-off locations. This assignment invites you to explore and identify the features that would be most influential in predicting trip times accurately.
- Background: Real-world routing is complex, and the straightforward distance between two points rarely reflects the true time a trip might take. Factors such as road networks, traffic conditions, and various environmental variables can significantly alter trip durations.
- Guidelines: Consider the following aspects to guide your feature selection and model design:
- Understanding Distance: Recognize that the geometric distance between two points (as the crow flies) is different from the travel distance between two locations on a map. The latter is affected by the available routes, road types, and traffic patterns.
-
Spatial Division with Hexagonal Grids: Platforms like Kakao T divide their maps into numerous hexagons to manage and analyze geographic data more effectively. Consider how this approach could influence feature engineering and the granularity of data analysis for trip duration prediction.
- Influence of Events and Environment: Identify events and environmental factors that could impact trip durations. This may include weather conditions, time of day, local events, road closures, and construction activities. Discuss how these factors can be quantified and incorporated into your model.
- Database Features: Reflect on the types of features that can be extracted from a database. This might include historical trip data, traffic patterns, driver performance metrics, and more. Specify which features would be most valuable for your predictive model and why.
- Real-time Computation: Determine which features need to be computed in (near) real-time to ensure the accuracy of your model’s predictions. Consider the technical and computational challenges associated with real-time data analysis and how you would address them.
- Submission: You are required to write 1-2 page document discussing the design problem. Your write-up should cover the following:
- Your approach to feature selection and why you consider them critical for predicting taxi trip durations.
- How you plan to tackle the challenges associated with factors like spatial data management, event and environmental impacts, and real-time data analysis.
- Programming: Your First Application
- Objective: Create a toy application and host it. The frontend should enable users to input queries, and the backend will use this information to make a prediction using a model. The predicted value should then be displayed on the frontend. The primary focus of this assignment is on model development and the integration of various system components, rather than on web development itself.
- Hosting Options: You are encouraged to use Streamlit for hosting your application. Check this 7-min video. Alternatively, you can build upon the Colab code from Lab2: Prototyping Tutorial, which covered the basics of React, FastAPI, and ngrok. You may also utilize the GPU server I provided.
- Submission: After hosting your app, create a short video clip (up to 30 seconds) demonstrating its functionality and submit the shareable link of your video.
- Datasets: You are free to use any datasets.
- Examples include MLB dataset from HW1, Apartment price dataset, or Titanic dataset from Kaggle
- For spatio-temporal recommender, you can use LocEmb dataset.
- Application Example: If continuing from the previous question (taxi trip), your application might look like the following:
- MVP that I made (Video)
- Paper Critique: Choose one paper from the below list and write a critique following these guidelines.
- Grading Criteria:
- Paper Critique (5%):
- Check Plus (5%) - The critique is very well written and very insightful.
- Check (3-4%) - Adequate. Most critiques are expected to fall into this category.
- Check Minus (2%) - The critique lacks depth. Summaries may be vague, strengths/weaknesses trivial, questions superficial, and discussions shallow.
- No submission (0%)
- Design Problem (5%):
- Check Plus (5%) - The idea is very well-developed and insightful.
- Check (3-4%) - Adequate. Most submissions are expected to fall into this category.
- Check Minus (2%) - The idea lacks depth.
- No submission (0%)
- Programming (5%):
- Check Plus (5%) - The demo is well-developed and interesting.
- Check (3-4%) - Adequate. Most submissions are expected to fall into this category.
- Check Minus (2%) - Wasn’t able to make a demo .
- No submission (0%)
- Late submission will be graded according to the late policy
- Paper Critique (5%):