DoorDash Delivery Duration Prediction

Author

Divraj Singh

Executive Summary

This project builds an end-to-end machine learning pipeline to predict delivery duration using structured SQL-based feature engineering and gradient boosting.

The workflow includes: - SQL data cleaning and transformation using DuckDB - Feature engineering focused on marketplace congestion - Proper time-based train/test split (no leakage) - Linear regression baseline - Elastic Net regularization - XGBoost gradient boosting

XGBoost achieved a ~7% MAE improvement over the linear baseline, confirming the presence of nonlinear congestion effects in delivery dynamics.


Model Performance Comparison

Code
library(ggplot2)

# Load saved performance metrics
model_results <- read.csv("outputs/model_results.csv")

# Display table
model_results
        Model      MAE     RMSE
1      Linear 675.6293 902.6515
2 Elastic Net 675.6039 902.7823
3     XGBoost 628.3854 853.6275
Code
# MAE comparison
ggplot(model_results, aes(x = Model, y = MAE, fill = Model)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "MAE Comparison Across Models",
       y = "Mean Absolute Error (seconds)") +
  theme(legend.position = "none")

Code
# RMSE comparison
ggplot(model_results, aes(x = Model, y = RMSE, fill = Model)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "RMSE Comparison Across Models",
       y = "Root Mean Squared Error (seconds)") +
  theme(legend.position = "none")

Observations

  • XGBoost achieves the lowest MAE and RMSE among the three models.
  • The improvement over the linear baseline (~7% reduction in MAE) confirms the presence of nonlinear structure in the data.
  • Elastic Net shows almost identical performance to Linear Regression, suggesting multicollinearity was not a major source of instability.
  • The relatively modest gain from boosting indicates that the engineered features already captured a strong portion of the linear signal.

XGBoost Feature Importance

Code
# Load saved importance
importance <- read.csv("outputs/feature_importance.csv")

head(importance, 10)
                                        Feature       Gain      Cover
1                             orders_per_dasher 0.25437878 0.09919388
2  estimated_store_to_consumer_driving_duration 0.13364868 0.07146625
3                                    order_hour 0.10572304 0.03868333
4                                      subtotal 0.08093556 0.06174953
5                                      store_id 0.06205899 0.16415671
6                         total_onshift_dashers 0.03654983 0.03189322
7                            total_busy_dashers 0.03315585 0.04148257
8                                    busy_ratio 0.03002530 0.02011756
9                                     order_dow 0.02913948 0.02102457
10                     total_outstanding_orders 0.02405057 0.03840133
    Frequency
1  0.08346923
2  0.08047392
3  0.04626086
4  0.07436683
5  0.11803175
6  0.04973874
7  0.04571172
8  0.03507838
9  0.03343096
10 0.04596133
Code
# Plot top 15 features
top15 <- importance[1:15, ]

ggplot(top15, aes(x = reorder(Feature, Gain), y = Gain)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Top 15 Feature Importance (XGBoost)",
       x = "Feature",
       y = "Gain")
026cb92 (Deploy standalone Quarto dashboard (embedded resources))

026cb92 (Deploy standalone Quarto dashboard (embedded resources))

Observations

  • orders_per_dasher is the dominant predictor, confirming that marketplace congestion (supply-demand imbalance) drives delivery delays more than raw order volume.
  • estimated_store_to_consumer_driving_duration remains a structural driver, as expected in any last-mile delivery problem.
  • Time-based features such as order_hour indicate meaningful time-of-day effects.
  • The presence of store_id suggests consistent store-level preparation time differences, which tree models can capture but may introduce generalization risk.
  • Overall, importance rankings align strongly with domain intuition, increasing confidence in model validity.

Predicted vs Actual (XGBoost)

Code
library(ggplot2)

predictions <- read.csv("outputs/test_predictions.csv")

ggplot(predictions, aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Predicted vs Actual Delivery Time (XGBoost)",
    x = "Actual Delivery Duration (seconds)",
    y = "Predicted Delivery Duration (seconds)"
  )

Observations

  • Predictions cluster closely around the diagonal at lower delivery durations, indicating strong performance for typical orders.
  • As actual delivery time increases, dispersion widens, suggesting increasing prediction variance for extreme cases.
  • The model slightly underestimates very long deliveries, likely due to trimmed outliers and limited extreme-case signal.
  • The spread of points increases as delivery time increases, meaning prediction errors grow for longer deliveries.
  • Overall, the model captures the central tendency well but remains less precise in the upper tail.