DoorDash Delivery Duration Prediction

Author

Divraj Singh

Executive Summary

This project builds an end-to-end machine learning pipeline to predict delivery duration using structured SQL-based feature engineering and gradient boosting.

The workflow includes: - SQL data cleaning and transformation using DuckDB - Feature engineering focused on marketplace congestion - Proper time-based train/test split (no leakage) - Linear regression baseline - Elastic Net regularization - XGBoost gradient boosting

XGBoost achieved a ~7% MAE improvement over the linear baseline, confirming the presence of nonlinear congestion effects in delivery dynamics.

Model Performance Comparison

Code

library(ggplot2)

# Load saved performance metrics
model_results <- read.csv("outputs/model_results.csv")

# Display table
model_results

        Model      MAE     RMSE
1      Linear 675.6293 902.6515
2 Elastic Net 675.6039 902.7823
3     XGBoost 628.3854 853.6275

Code

# MAE comparison
ggplot(model_results, aes(x = Model, y = MAE, fill = Model)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "MAE Comparison Across Models",
       y = "Mean Absolute Error (seconds)") +
  theme(legend.position = "none")

Code

# RMSE comparison
ggplot(model_results, aes(x = Model, y = RMSE, fill = Model)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "RMSE Comparison Across Models",
       y = "Root Mean Squared Error (seconds)") +
  theme(legend.position = "none")

Observations

XGBoost achieves the lowest MAE and RMSE among the three models.
The improvement over the linear baseline (~7% reduction in MAE) confirms the presence of nonlinear structure in the data.
Elastic Net shows almost identical performance to Linear Regression, suggesting multicollinearity was not a major source of instability.
The relatively modest gain from boosting indicates that the engineered features already captured a strong portion of the linear signal.

XGBoost Feature Importance

Code

# Load saved importance
importance <- read.csv("outputs/feature_importance.csv")

head(importance, 10)

                                        Feature       Gain      Cover
1                             orders_per_dasher 0.25437878 0.09919388
2  estimated_store_to_consumer_driving_duration 0.13364868 0.07146625
3                                    order_hour 0.10572304 0.03868333
4                                      subtotal 0.08093556 0.06174953
5                                      store_id 0.06205899 0.16415671
6                         total_onshift_dashers 0.03654983 0.03189322
7                            total_busy_dashers 0.03315585 0.04148257
8                                    busy_ratio 0.03002530 0.02011756
9                                     order_dow 0.02913948 0.02102457
10                     total_outstanding_orders 0.02405057 0.03840133
    Frequency
1  0.08346923
2  0.08047392
3  0.04626086
4  0.07436683
5  0.11803175
6  0.04973874
7  0.04571172
8  0.03507838
9  0.03343096
10 0.04596133

Code

# Plot top 15 features
top15 <- importance[1:15, ]

ggplot(top15, aes(x = reorder(Feature, Gain), y = Gain)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Top 15 Feature Importance (XGBoost)",
       x = "Feature",
       y = "Gain")

026cb92 (Deploy standalone Quarto dashboard (embedded resources))

Observations

orders_per_dasher is the dominant predictor, confirming that marketplace congestion (supply-demand imbalance) drives delivery delays more than raw order volume.
estimated_store_to_consumer_driving_duration remains a structural driver, as expected in any last-mile delivery problem.
Time-based features such as order_hour indicate meaningful time-of-day effects.
The presence of store_id suggests consistent store-level preparation time differences, which tree models can capture but may introduce generalization risk.
Overall, importance rankings align strongly with domain intuition, increasing confidence in model validity.

Predicted vs Actual (XGBoost)

Code

library(ggplot2)

predictions <- read.csv("outputs/test_predictions.csv")

ggplot(predictions, aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Predicted vs Actual Delivery Time (XGBoost)",
    x = "Actual Delivery Duration (seconds)",
    y = "Predicted Delivery Duration (seconds)"
  )

Observations

Predictions cluster closely around the diagonal at lower delivery durations, indicating strong performance for typical orders.
As actual delivery time increases, dispersion widens, suggesting increasing prediction variance for extreme cases.
The model slightly underestimates very long deliveries, likely due to trimmed outliers and limited extreme-case signal.
The spread of points increases as delivery time increases, meaning prediction errors grow for longer deliveries.
Overall, the model captures the central tendency well but remains less precise in the upper tail.