The problem
Departure delays ripple through the entire aviation system — missed connections, crew scheduling chaos, and a measurable hit to passenger trust. The goal was to predict both whether a flight will be delayed and by how much, early enough to be operationally useful.
The data is the U.S. Department of Transportation Bureau of Transportation Statistics flight records from January 2018 through August 2023 — millions of rows spanning every domestic carrier, enriched with weather, airport congestion, and geospatial context.
Approach
The pipeline starts with exploratory data analysis examining how airport busyness, airline size, and airport altitude relate to delays, followed by geospatial analysis to surface regional delay hotspots.
Preprocessing and feature selection
- Standardization of numeric features and categorical encoding.
- Outlier removal to stop extreme delays from skewing the regressors.
- Feature selection via regression coefficients and Recursive Feature Elimination (RFE).
Tech deep-dive
Two problems were modeled in parallel: regression for delay magnitude in minutes, and classification for the binary delayed / on-time outcome.
- Regression: Linear Regression, Lasso, Ridge, ElasticNet, Neural Networks, and LightGBM — the LightGBM Regressor was the top performer.
- Classification: Gaussian Naive Bayes, LightGBM Classifier, Decision Tree, and Random Forest — the Random Forest Classifier won on F1 and accuracy.
from lightgbm import LGBMRegressor
model = LGBMRegressor()
model.fit(X_train, y_train)
preds = model.predict(X_test)Outcomes & learnings
- The Random Forest Classifier reached the highest F1 score and accuracy of the suite.
- Larger airlines show consistently lower delays than smaller carriers.
- East Coast airports are persistent delay hotspots.
- Airport busyness correlates proportionally with departure delay.
The biggest lesson was that geospatial and congestion features carried more signal than raw schedule data — a reminder that feature engineering, not model choice, is usually where prediction quality is won.