SkyFlow: AI-Powered Flight Delay Predictions

Dipti Aswath | LinkedIn | Email | Early SkyFlow Prototype

License

This project is licensed under the Apache License 2.0. You may use, modify, and distribute this code under the terms of the license. See the LICENSE file for more details.

Attribution: Please ensure to give proper credit to the original author listed above, when reusing or redistributing the code.

Executive Summary
Deep Dives
Data Sources
Methodology Used for Data Preparation and Modeling
Project Structure
Project Infrastructure
Key Insights from Phase1 to Phase2 of Project
Future Work
Appendix
References

Executive Summary

Problem Statement:

Airlines and airports face significant operational challenges due to flight delays, which can be caused by a variety of factors including flight status, weather conditions, air traffic congestion, aircraft specifics, and inefficiencies in ground and passenger handling. The objective is to predict flight delays by developing a multi-class classification model that considers both departure and arrival delays, helping improve operational planning and customer satisfaction.

Rationale:

Flight delays can have widespread consequences for airlines, from passenger dissatisfaction to operational disruptions. Developing a predictive model for flight delays not only addresses the core issue of minimizing delays but also enhances decision-making processes across various facets of airline operations.

Business Case 1: Enhancing Operational Efficiency

Predicting flight delays enables airlines to optimize their operations, routing, and resource management.

Route Optimization and Scheduling Adjustments: Airlines can reroute flights to avoid congested airspace or adverse weather, minimizing delays. Predictions also allow real-time adjustments to schedules, gates, and crew to manage disruptions efficiently.
Resource Allocation: By anticipating delays, airlines can proactively allocate ground crew, gates, and equipment, reducing the cascading effects on other flights.
Operational Resilience: Dynamic rerouting and resource realignment minimize the operational impacts of weather or high-traffic delays, enhancing resilience in crisis situations.
Cost Management: Avoiding delays lowers costs linked to operational disruptions, improving resource utilization and overall profitability.

Business Case 2: Improving Customer Experience

Accurate delay predictions lead to better customer service and proactive communication, enhancing the passenger experience.

Proactive Passenger Communication: Accurate predictions allow airlines to update passengers promptly, manage expectations, and offer rebooking or compensation options.
Improved Customer Service: Delay forecasts support better service recovery, leading to a smoother passenger experience and increased loyalty.
Competitive Advantage: Effective rerouting and communication give airlines an edge in maintaining on-time performance and customer satisfaction.

By addressing these areas, airlines can significantly improve operational efficiency, enhance passenger experience with better customer satisfaction scores, and better manage resources and disruptions. Predictive modeling for flight delays is not just about minimizing delays but also about fostering a more responsive and resilient airline operation.

Example Usage: An AI system that predicts flight delays could also:

Suggest alternate flight paths that are less likely to experience delays.
Provide passengers with timely updates and rebooking options.
Dynamically adjust flight schedules to manage disruptions effectively.
Allocate resources efficiently to minimize the impact on subsequent flights.

Research Question:

How can we develop an AI and machine learning-powered smart system to accurately predict flight delays by assessing multiple factors, including departure and arrival times, flight status, weather conditions, air traffic, aircraft specifics, and ground operations?

Flight Delay Predictions - Key Metrics:

SkyFlow is an advanced tool that helps predict how flights might perform. It looks at many factors like weather, how busy the airport is, and how well the airline usually does. Then, it puts each flight into one of three groups:

On Time: These flights are expected to leave and arrive as scheduled.
Partial Delay: These flights might be delayed leaving or arriving.
Full Delay: These flights are likely to be delayed both leaving and arriving.

Why is this important?

It helps everyone plan better:

Airlines: Can manage their schedules more effectively.
Airports: Can prepare for busy times.
Passengers: Can adjust their plans if needed.

How do we know if SkyFlow is doing a good job?

We look at five main things to evaluate SkyFlow’s performance:

Precision: How often SkyFlow correctly identifies delay groups when it predicts a delay.
Recall: How often SkyFlow correctly identifies actual delays out of all delayed flights.
F1 Score: How well SkyFlow balances precision and recall.
Precision-Recall Area Under the Curve (PR AUC): How well SkyFlow performs across different thresholds for classifying delays.
Receiver Operating Characteristic Area Under the Curve (ROC AUC): How well SkyFlow distinguishes between delayed and on-time flights.

Our goal is to make SkyFlow as accurate as possible, so everyone can rely on its predictions to make their travel smoother and more predictable.

To monitor overall performance, we use the Precision-Recall Area Under the Curve (PR AUC) and Receiver Operating Characteristic Area Under the Curve (ROC AUC).

For evaluating the balance between correctly identifying delays and avoiding false alarms, we rely on the F1 Score as the primary metric, which combines precision and recall into a single value. Further, the F1 score is optimized to give more weightage to the full-delay groups.

Approach

CRISP-DM Framework:

For the Flight Delay Prediction problem, the CRISP-DM (Cross Industry Standard Process for Data Mining) framework was applied to provide a structured solution. The process was as follows:

Business Understanding: The goal was to predict flight delays to improve airline operational efficiency and enhance customer satisfaction by reducing unexpected delays.
Data Understanding: A detailed analysis of the dataset was performed, identifying key patterns and relationships, such as flight times, delays, and distances, that could significantly influence prediction outcomes.
Data Preparation: The raw data was preprocessed, and relevant features were engineered. This included detailed delay metrics such as departure and arrival times, distances, and other flight-specific attributes to ensure high-quality inputs for model training.
Modeling: Various machine learning models were trained and evaluated, focusing on performance metrics like Precision-Recall AUC, ROC AUC, and F1 score. These models were iteratively tuned to optimize predictive performance.
Deployment: The best-performing model was integrated into SkyFlow’s prototype application, enabling real-time flight delay predictions. Future iterations aim to further enhance the operational decision-making.

Feature Engineering:

During the data preparation phase, significant feature engineering was conducted as outlined in a later Methodology section. Initially, features that captured the relationship between departure and arrival delays were found to introduce data leakage, leading to overly optimistic predictions. As a result, these features were excluded in Phase 2.

To improve prediction delays in Phase2, new features were engineered by tracking flight segment sequences for each tail number on a given day (SEGMENT_NUMBER).

Historical flight information, such as previous airports (PREVIOUS_AIRPORT), prior delays (PREVIOUS_ARR_DELAY), and flight durations (PREVIOUS_DURATION), was incorporated. This was done by merging current flight records with its own FLIGHT_DURATION with the corresponding previous segment data, providing a richer and more comprehensive dataset for predicting delays.

Please refer to this section for details on the algorithm.

Key Findings from Exploratory Data Analysis:

Highest Departure and Arrival delays by Carriers (2019): Identifying the carriers with the highest delays directly relates to improved customer experience and financial impact. By pinpointing these carriers, airlines can better manage customer expectations, offer targeted support, and address issues that could lead to costly disruptions and compensation claims.

A graph of blue rectangular objects Description automatically generated

Top 30 Congested Airports with Flight Delays (2019): This finding supports enhanced operational efficiency and operational resilience. By focusing on the most congested airports, airlines can optimize resource allocation and improve scheduling to alleviate delays at these critical points, leading to smoother operations and better crisis management.

A graph showing the number of airports Description automatically generated

A map of the united states with different colored spots Description automatically generated

SMOTE Resampling on Training Data: Demonstrates the importance of data-driven decision making. By improving model performance through resampling, airlines can make more accurate predictions about delays, leading to better strategic planning and performance monitoring.

A close-up of a computer code Description automatically generated

A blue and purple pie chart Description automatically generated

Delay Trends Across Distance Groups and Flight Segments (2019): This finding helps provide valuable insights into how aircraft operational schedules and the number of daily flights contributed to 2019 delays, effectively addressing operational efficiency and contingency planning. Understanding how delay patterns vary with flight distance and segment numbers helps airlines plan better turnaround times and manage operational schedules more effectively to prevent delays.

Segment Number Decreases with Distance: As flight distance increases, the number of segments (flights) decreases. Aircraft flying longer routes complete fewer flights in a day due to time constraints.
Delays Correlate with Higher Segment Numbers: Flights scheduled for more segments in a day are more prone to delays, regardless of distance. These delays are likely due to operational factors, such as shorter turnaround times, leading to delayed departures and arrivals.

A graph of a number across a group Description automatically generated

Median Departure and Arrival Delays per Carrier (2019): Identified the top 20 carriers with the highest median delays. For each carrier, the top 20 airports with the most significant contribution to delays were also identified. By examining median delays, airlines can gain insights into typical delay experiences and ensure compliance with regulations. Focusing on specific carriers and airports with high delays can enhance overall safety and customer satisfaction.

Comprehensive Delay Analysis: By considering both departure and arrival delays, we provide a more holistic view of 2019 airline performance and airport efficiency. Endeavor Air Inc shows a highest delay at Miami International Airport. Comair Inc follows with the next highest delay at Portland International Airport.
Focus on median delays: The use of median delays helped identify typical delay experiences, filtering out the effect of extreme delays that skewed averages.
Unique Operational Factors: The variation in delay trends suggests that delays may be influenced by distinct factors specific to each carrier and airport, rather than being caused by common issues across multiple locations. For instance, both Endeavor Air Inc and Comair Inc experienced higher-than-usual precipitation at the airports on their flight day, which could have contributed to their delays.

A screenshot of a graph Description automatically generated

A row of purple rectangular objects Description automatically generated

Analyzing Trends in Flight Delays by Distance Groups (2019): This focuses on understanding how flight delays vary across different distance categories, which helps airlines optimize their operations to inform strategies to mitigate delays.

It can be observed that flights traveling short and moderate distances tend to have higher delays compared to the remainder of the distance categories.

A graph of different colored squares Description automatically generated with medium confidence

Analyzing Trends in Flight Delays by Season, Time of Day and Day of Week (2019): This trend analysis aims to assist airlines in optimizing their operations by informing strategies to mitigate delays.

Seasonal Trends: Summer months generally experience the highest rates of both arrival and departure delays, while winter months also show significant arrival delays.
Time of Day: Early morning and late-night flights are associated with the highest arrival delays, whereas afternoon and evening flights tend to experience more departure delays.
Weekly Patterns: There is a noticeable dip in arrival delays on Saturdays, while other days of the week exhibit a relatively even distribution of both arrival and departure delays.

A screenshot of a graph Description automatically generated A purple and orange squares Description automatically generated

A purple and orange bars Description automatically generated

Analyzing Historical Average Delays (2019): Visualize the average historical delays of DEP_BLOCK_HIST, which represents the historical average delay for different departure time blocks aggregated by month, and DEP_AIRPORT_HIST, indicating the historical average delay rates for flights departing from specific airports per month. This analysis examines how these metrics fluctuate due to various time-related and seasonal factors, aiming to provide insights into delay patterns across different times of day, days of the week, and seasons.

Seasonal Trends: Historical average delays are generally higher during the summer months, followed by winter and spring.
Weekly Trends: Historical delays are evenly distributed throughout the week.
Time of Day: Average delays for different departure time blocks are notably higher in the afternoons and evenings.

A group of different colored bars Description automatically generated

Analyzing Average Weather features by Airlines and Airports (2019): This analysis was done to understand how selected weather features (PRCP, TMAX, AWND, SNOW, SNWD) vary across different carriers, departing airports, and previous airports, to observe any patterns with how weather conditions impact flight operations.

There was no significant trend observed in the average values of the selected weather features, when grouped by the specified columns (CARRIER_NAME, DEPARTING_AIRPORT, PREVIOUS_AIRPORT).

A graph of numbers on a white background Description automatically generated A chart with purple rectangles Description automatically generated

Actionable Insights - Recommendations from Exploratory Data Analysis:

Finding	Recommendation
Highest Departure and Arrival Delays by Carriers	- Implement targeted training and support programs for high-delay carriers to improve operational efficiency. - Use delay data to manage customer communications proactively.
Top 30 Congested Airports with Flight Delays	- Allocate more resources and staff during peak times at congested airports to minimize delays. - Develop contingency plans for high-traffic airports to handle surges in passenger volume effectively.
Delay Trends Across Distance Groups and Flight Segments	- Analyze operational schedules to optimize turnaround times for flights, especially those with multiple segments. - Review scheduling for short and moderate-distance flights to reduce potential delays.
Seasonal Trends	- Increase staffing and operational resources during summer months to manage higher delay rates effectively. - Monitor weather patterns and adjust scheduling in advance to minimize disruptions during winter months.
Time of Day	- Consider adjusting flight schedules to reduce the number of early morning and late-night flights that experience high arrival delays. - Increase capacity and resources during afternoon and evening hours to mitigate departure delays.
Weekly Patterns	- Evaluate operational strategies to understand the factors contributing to increased delays on specific days. - Promote Saturday travel incentives to balance the load and improve operational efficiency.

Model Evaluation and Performance Summary:

The following machine learning models were evaluated for predicting flight delays, listed in order:

Dummy Classifier (Baseline)
Multinomial Logistic Regression Classifier
Decision Tree Classifier with hyperparameter tuning
Ensemble Models – Bagging with Bagging Classifier with Decision Trees, Random Forest; Boosting with XGBoost, CatBoost and Light Gradient Boosting Machine (LGBM)
Hybrid Ensemble Models – Voting Classifier as an ensemble of XGBoost and Random Forest; Stacking Classifiers (hyperparameter tuned and without) with XGBoost and Random Forest as base estimator and a meta classifier with one-vs-rest Logistic Regression; and a custom Hybrid Ensemble that comprises of both the Voting and tuned Stacking Classifier

The ensemble, and hybrid ensemble models outperformed the baseline, Logistic Regression, and Decision Tree models. This section summarizes and compares the key metrics across these model groups, while making its final recommendation for production deployment here.

Actionable Insights - Recommendations for Model Selection and Deployment for Flight Delay Predictions:

Best Model: Voting Classifier

The Voting Classifier emerges as the best overall model for flight delay predictions due to its performance across multiple metrics:

Highest weighted F1 score (0.7944)
Highest accuracy (0.8290)
Best weighted PR AUC (0.82)
Best weighted ROC AUC (0.80)

Key strengths:

Strong performance in identifying on-time flights (class 0)
Good balance between precision and recall across delay classes

Deployment considerations:

Implement as the primary model for flight delay predictions
Use for real-time predictions and operational decision-making
Integrate into airline and airport management systems

Alternate Model: Hybrid Ensemble Classifier

The Hybrid Ensemble Classifier is an alternate choice:

High weighted F1 score (0.7935)
Good accuracy (0.8234)
High weighted PR AUC (0.82)
High weighted ROC AUC (0.80)

Key strengths:

Performance comparable to the Voting Classifier
Good balance between precision and recall for delay classes 0 and 2

Deployment considerations:

Use as a complementary model to the Voting Classifier
Use where compute resources and infrastructure allow for multiple model deployments

Actionable Insights - Recommendations based on influential Features in Flight Delay Predictions:

Feature	Recommendation
PREVIOUS_ARR_DELAY	- Implement robust systems to track and analyze previous flight delays. - Develop strategies to mitigate the cascading effect of delays (e.g., buffer time between connected flights).
SEGMENT_NUMBER	- Optimize flight schedules, especially for aircraft making multiple trips per day. - Consider maintenance and crew scheduling to minimize delays in later segments.
PREVIOUS_DURATION	- Analyze routes with consistently longer durations and consider adjustments. - Improve accuracy of flight duration estimates for better scheduling.
DEP_PART_OF_DAY	- Adjust departure times to less congested periods of the day. - Allocate more resources during peak departure times.
PREVIOUS_AIRPORT	- Identify problematic connections or airports. - Optimize route networks to minimize impact of delay-prone airports.
DISTANCE	- Allocate appropriate aircraft to routes based on distance. - Consider fuel stops or direct flights for very long distances.
DEP_BLOCK_HIST	- Use historical data to predict and prepare for delays during specific time blocks. - Adjust staffing and resources based on historically problematic time periods.
CARRIER_NAME	- Benchmark airline performance against industry standards. - Share best practices within the organization to improve overall efficiency.
PRCP (Precipitation)	- Enhance weather forecasting capabilities. - Develop contingency plans for various weather scenarios. - Invest in equipment and training for efficient operations during adverse weather.
DAY_OF_WEEK	- Adjust resources and schedules based on weekly patterns. - Implement dynamic pricing strategies to manage demand across different days.

Deep Dives

Enhanced Feature Engineering Algorithm

    Input:
    - Raw flight data
    - Aircraft data
    - Weather data
    - Airport data
    - Airline data

    Output:
    - Enriched dataset with engineered features for flight delay prediction

    Algorithm:

    1. Initialize empty dataset D for engineered features

    2. For each flight record F in raw flight data:

        2.1. Extract basic flight information (date, origin, destination, etc.)

        2.2. Compute SEGMENT_NUMBER:

            a. Group flights by TAIL_NUM and DAY_OF_MONTH

            b. Sort by DEP_TIME within each group

            c. Assign sequential numbers starting from 1

        2.3. Add SEGMENT_NUMBER to D

    3. For each flight record F in D:

        3.1. Identify previous flight P with same TAIL_NUM

        3.2. If P exists:

            a. Set PREVIOUS_AIRPORT = P.DESTINATION

            b. Set PREVIOUS_ARR_DELAY = P.ARR_DELAY

            c. Set PREVIOUS_DEP_DELAY = P.DEP_DELAY

            d. Set PREVIOUS_DURATION = P.ACTUAL_ELAPSED_TIME

        3.3. Else:

            Set all PREVIOUS_* features to null or appropriate default values

        3.4. Add PREVIOUS_* features to D

    4. Compute FLIGHT_DURATION:

        4.1. FLIGHT_DURATION = CRS_ARR_TIME - CRS_DEP_TIME

        4.2. Add FLIGHT_DURATION to D

    5. Merge weather data with D based on date and airport

    6. Compute temporal features:

        6.1. Extract MONTH, DAY_OF_WEEK from date

        6.2. Compute SEASON based on MONTH

        6.3. Compute DEP_PART_OF_DAY based on CRS_DEP_TIME

        6.4. Add temporal features to D

    7. Merge airport and airline data with D

    8. Compute flight statistics, passenger statistics, and employee statistics:

        8.1. Add all statistics features to D 

    9. Compute historical performance metrics:

        8.1. Calculate CARRIER_HISTORICAL (average delay by carrier and month)

        8.2. Calculate DEP_AIRPORT_HIST (average delay by departure airport and month)

        8.3. Calculate DEP_BLOCK_HIST (average delay by departure time block and month)

        8.4. Add historical metrics to D

    10. Handle missing values and perform necessary data type conversions

    11. Return enriched dataset D

Performance comparison across Baseline, Logistic Regression and Decision Tree

A graph showing different types of flight delay Description automatically generated

Model	Strengths	Weaknesses	Key Observations	Important Features
Baseline	- Simple and fast	- Very poor weighted F1 score (0.0373) - Low weighted PR AUC (0.63) - Poor weighted ROC AUC (0.50) - Low accuracy (0.1461) - Unable to distinguish between classes effectively	- Performs poorly across all metrics - Not suitable for this classification task	N/A
Multinomial Logistic Regression	- Best overall performance - Highest weighted F1 score (0.7329) - Highest weighted PR AUC (0.77) - Best weighted ROC AUC (0.74) - Best accuracy (0.7051) - Good balance between precision and recall	- Still struggles with minority class (class 1) - Slightly lower interpretability compared to Decision Tree	- Shows the best overall performance - Outperforms other models in most weighted metrics - Provides a good balance across different metrics and classes	Positive influence on class 2: - DAY_OF_WEEK - CARRIER_NAME - PREVIOUS_ARR_DELAY - MONTH - ARR_PART_OF_DAY - DEP_PART_OF_DAY - SEASON Negative influence on class 2: - PREVIOUS_DURATION_CATEGORY - FLIGHT_DURATION_CATEGORY - DISTANCE_GROUP_DESC
Hyperparameter-tuned Decision Tree	- Competitive weighted F1 score (0.7422) - Good weighted PR AUC (0.74) - Decent weighted ROC AUC (0.70) - Highest accuracy (0.7359) - Better interpretability than Logistic Regression	- Slightly lower weighted F1 score than Logistic Regression - Lower weighted PR AUC and ROC AUC compared to Log

Performance comparison across Ensemble Bagging and Boosting Classifiers

A graph showing different colored bars Description automatically generated with medium confidence

Model	Strengths	Weaknesses	Key Observations	Important Features
BaggingClassifier (Decision Tree)	- High weighted F1 score (0.7888) - High weighted PR AUC (0.81) - Good weighted ROC AUC (0.78)	- Slightly lower weighted ROC AUC compared to some other models	- Balanced performance across weighted metrics - Good overall predictive power	Top 5 (Permutation Importance): 1. PREVIOUS_ARR_DELAY: 0.1340 2. PREVIOUS_DURATION: 0.0802 3. SEGMENT_NUMBER: 0.0766 4. DEP_PART_OF_DAY: 0.0597 5. ARR_PART_OF_DAY: 0.0199
Random Forest Classifier	- High weighted F1 score (0.7887) - High weighted PR AUC (0.81) - Best weighted ROC AUC (0.79)	- Marginally lower weighted F1 score than BaggingClassifier	- Very similar performance to BaggingClassifier - Slightly better at handling class imbalance	Top 5 (Built-in Importance): 1. PREVIOUS_ARR_DELAY: 0.1370 2. DISTANCE: -0.0006 3. TMAX: -0.0000 4. FLIGHT_DURATION: -0.0005 5. AWND: 0.0001
XGBoost Classifier	- High weighted PR AUC (0.81) - High weighted ROC AUC (0.79)	- Lower weighted F1 score (0.7682) compared to BaggingClassifier and Random Forest	- Good balance between precision and recall - Strong performance in AUC metrics	Top 5 (Built-in Importance): 1. PREVIOUS_ARR_DELAY: 0.1718 2. DEP_PART_OF_DAY: 0.0503 3. PREVIOUS_DURATION_CATEGORY: -0.0040 4. PRCP: 4.3230 5. ARR_PART_OF_DAY: 4.4493
LightGBM	- High weighted PR AUC (0.81) - High weighted ROC AUC (0.79)	- Lower weighted F1 score (0.7182)	- Underperforms in F1 score compared to other models - Maintains strong AUC performance	Top 5 (Built-in Importance): 1. AIRLINE_AIRPORT_FLIGHTS_MONTH: 1207.0000 2. AIRLINE_FLIGHTS_MONTH: 996.0000 3. PREVIOUS_ARR_DELAY: 1031.0000 4. DISTANCE: 915.0000 5. DEP_AIRPORT_HIST: 856
CatBoost	- Relatively high weighted PR AUC (0.78)	- Lowest weighted F1 score (0.5134) - Lowest weighted ROC AUC (0.75)	- Significantly underperforms compared to other models - Struggles with overall predictive power	Top 5 (Built-in Importance): 1. PREVIOUS_ARR_DELAY: 64.3610 2. DEP_PART_OF_DAY: 11.9855 3. ARR_PART_OF_DAY: 4.4493 4. PRCP: 4.3230 5 SEGMENT_NUMBER: 2.8930

Performance comparison across Hybrid Ensemble Classifiers

A graph of different colored bars Description automatically generated with medium confidence

Model	Strengths	Weaknesses	Key Observations	Important Features
Voting Classifier	- Highest weighted F1 score (0.7944) - Highest accuracy (0.8290) - Best weighted PR AUC (0.82) - Best weighted ROC AUC (0.80)	- Low F1 score for class 1 (0.0677)	- Best overall performance - Strong in identifying on-time flights (class 0) - Good balance between precision and recall	Top 5 (Permutation Importance): 1. PREVIOUS_ARR_DELAY: 0.1311 2. SEGMENT_NUMBER: 0.0552 3. PREVIOUS_AIRPORT: 0.0476 4. PREVIOUS_DURATION: 0.0429 5. DEP_PART_OF_DAY: 0.0184
Stacking Classifier	- Good weighted F1 score (0.7896) - Good accuracy (0.8118) - High weighted PR AUC (0.81) - High weighted ROC AUC (0.79)	- Lower performance on class 1 (F1 score: 0.0936) compared to other classes	- Slightly lower performance than Voting Classifier - Better performance on class 1 compared to Voting Classifier	Top 5 (Permutation Importance): 1. PREVIOUS_ARR_DELAY: 0.1059 2. PREVIOUS_AIRPORT: 0.0247 3. SEGMENT_NUMBER: 0.0200 4. PREVIOUS_DURATION: 0.0192 5. DEP_PART_OF_DAY: 0.0173
Tuned Stacking Classifier	- Improved weighted F1 score (0.7921) - Improved accuracy (0.8180) - High weighted PR AUC (0.81) - High weighted ROC AUC (0.79)	- Still struggles with class 1 (F1 score: 0.0901)	- Performance improvement over base Stacking Classifier - Better balance across all classes	Top 5 (Permutation Importance): 1. PREVIOUS_ARR_DELAY: 0.1320 2. PREVIOUS_AIRPORT: 0.0734 3. SEGMENT_NUMBER: 0.0530 4. PREVIOUS_DURATION: 0.0472 5. DEP_PART_OF_DAY: 0.0256
Hybrid Ensemble Classifier	- High weighted F1 score (0.7935) - Good accuracy (0.8234) - High weighted PR AUC (0.82) - High weighted ROC AUC (0.80)	- Struggles with class 1 (F1 score: 0.0813)	- Performance comparable to other ensemble methods - Good balance between precision and recall for class 0 and 2	Top 5 (Permutation Importance): 1. PREVIOUS_ARR_DELAY: 0.1324 2. PREVIOUS_AIRPORT: 0.0495 3. PREVIOUS_DURATION: 0.0458 4. SEGMENT_NUMBER: 0.0431 5. DEP_PART_OF_DAY: 0.0216

Features influencing Flight Delay Predictions

Based on the feature importance results from across these models, the following features are consistently influential in flight delay predictions – ref: feature descriptions:

PREVIOUS_ARR_DELAY: This is consistently the most important feature across all models. It represents the arrival delay of the previous flight for the same aircraft.
SEGMENT_NUMBER: This feature, which represents the order of flights for an aircraft on a given day, is highly influential in several models.
PREVIOUS_DURATION: The duration of the previous flight is an important factor in predicting delays.
DEP_PART_OF_DAY: The time of day when the current flight departs is a significant predictor of delays.
PREVIOUS_AIRPORT: The departing airport from where the aircraft on its previous segment last came seems to have a notable impact on delay predictions.
DISTANCE: The flight distance appears to be moderately important in several models.
DEP_BLOCK_HIST: Historical average delay for different departure time blocks is influential.
CARRIER_NAME: The airline operating the flight is a relevant factor in some models.
PRCP: Precipitation at the airport on the day of the flight is a notable weather-related feature.
DAY_OF_WEEK: The day of the week when the flight occurs has some influence on delay predictions.

These features consistently appear among the top influential factors across different models (Bagging Classifier, Random Forest, XGBoost, LightGBM, and ensemble methods like Voting and Stacking Classifiers). While the exact order and magnitude of importance varies between the models, these features represent a mix of temporal factors (previous delays and time of day), operational aspects (segment number and carrier), geographical elements (distance and previous airport), and weather conditions (precipitation).

Partial Dependence Plots - Visualize Feature Impact on Flight Delay Predictions for each Delay Class

A group of graphs showing the results of a performance Description automatically generated with medium confidence

Data Sources

Kaggle Dataset from here, that is comprised of multiple csv’s listed below.

Air Carrier Summary
Aircraft Inventory
Air Carrier employee support (Ground Crew, Flight Attendants)
Flight On Time Reporting Status with Air Carrier info for 2019-2020
Airport Weather
Airport and Carrier look-up codes

Methodology Used for Data Preparation and Modeling

Data Preparation: Involved cleaning and merging multiple raw CSV files to create a unified dataset with ~4M entries (for training) and ~2M entries (for testing) with 34 predictor variables and 1 target variable. Raw data-set description is here.

Feature Engineering:

Delay Categories: Classified delays into three distinct categories for more granular analysis of flight performance:

Class0: On-time Departure and Arrival - Flights that depart and arrive within their scheduled times.

Class1: Either departure or arrival delayed - Flights that experience delays either during arrival or departure.

Class2: Delayed Departure and Arrival - Flights that experience delays both in departure and arrival times.
Aggregation Features: Developed historical delay averages, to identify patterns and trends in airline operations.

    CARRIER_HISTORICAL = captures the historical average delay rate of each carrier per month

    DEP_AIRPORT_HIST = captures historical average delay rates for flights departing from specific airports per month

    PREV_AIRPORT_HIST = captures historical average delay rate for the airport from which the aircraft arrived before the current departure

    DAY_HISTORICAL = captures historical average delays associated with each day of the week, adjusted monthly

    DEP_BLOCK_HIST = captures historical average delay rate for different departure time blocks, aggregated by month

Time-Based Features: Extracted seasonal information from the month and categorized parts of the day using departure and arrival time blocks to enhance temporal analysis of flight data.
Distance-Based Features: Mapped distance groups to descriptive labels, providing clearer insights into flight range categories for more intuitive analysis.
Delay-Based Features: Created new features by combining actual departure and arrival times with scheduled times, generating detailed delay metrics to enhance analysis of flight performance and punctuality. However, in Phase 2 of model evaluation, these features were removed due to data leakage, as they resulted in nearly 100% prediction accuracy.

    ELAPSED_TIME_DIFF, DEP_DELAY, ARR_DELAY

Flight Duration, Previous Flight Duration and Arrival Delay: Phase2 also introduced new delay-based features. Flight duration was the total duration of the current flight calculated from the planned departure and arrival times. This feature helps in assessing how longer flight durations may correlate with increased delays. Previous Flight Duration and Previous Arrival Delay were introduced as historical features and the approach to engineering these new features is outlined in the executive summary.

A chart of flight duration Description automatically generated with medium confidence

    FLIGHT_DURATION, FLIGHT_DURATION_CATEGORY, PREVIOUS_DURATION, 
    PREVIOUS_DURATION_CATEGORY, PREVIOUS_ARR_DELAY

Employee Statistics Features: Developed features to analyze staffing and resourcing in airline and carrier operations, providing insights into workforce allocation, scheduling efficiency, and resource optimization.

    FLT_ATTENDANTS_PER_PASS, PASSENGER_HANDLING

Removed highly correlated features with VIF – see before and after removal:

Data Pre-Processing: Missing values and outliers detected were removed. SMOTE Tomek was applied to just the training dataset. This combined SMOTE’s oversampling of the minority classes (classes 0,1 and 2) and Tomek links’ under-sampling. Categorical features were also target encoded and Numerical features were scaled.

Model Evaluation with Training, Validation and Test dataset:

The dataset was initially split into Training (70%, 4.542M entries) and Test (30%, 1.946M entries) sets. The training set was further divided, with 20% retained for validation. From the remaining training data, a sample of up to 500,000 entries was extracted for model training, ensuring that the sample size did not exceed the available data.

All splits were performed using stratified sampling to maintain class distribution. This approach was adopted to manage the large dataset by creating a more manageable training set size while still preserving a substantial validation set.

Project Structure

Data:

Engineered Features Documentation
Merged Datasets with new features: Train | Test
Raw Data
Raw Data Documentation

Analysis and Visualization:

AutoViz Plots (Credit: AutoViML/AutoViz)
README Images

Notebooks:

Links to the latest set of Notebooks from this folder are noted below. Please note, earlier revisions continue to be available in the same folder to track iterations.

Model Artifacts:

Folder here contains:

Recommended Model for production deploys
Performance Metrics for model evaluations in csv

StreamLit and FastAPI interface:

FastAPI as backend API deployed to AWS EC2 here
StreamLit application deployed to AWS EC2 here
Model deployed to AWS EC2 is this

Repository with GitLFS:

This project uses Git Large File Storage (LFS) to handle large files efficiently. Git LFS replaces large files with text pointers inside Git, while storing the file contents on a remote server.

To work with this repository:

Ensure you have Git LFS installed. If not, install it from git-lfs.com.
After cloning the repository, run:

   git lfs install 
   git lfs pull 

When adding new large files, track them with:

    git lfs track "path/to/large/file"

Commit and push as usual. Git LFS will handle the large files automatically. For more information on Git LFS, refer to the official documentation.

Project Infrastructure

This project utilized Google Colab Pro to handle computationally intensive notebook operations for data exploration and modeling. Key components include:

Notebooks:

Data exploration and modeling results from Colab Pro are captured in notebooks available in this GitHub repository.
Direct links to key external notebooks for results: Exploration Notebook, Modeling Notebook

AutoViz Visualizations:

Comprehensive AutoViz plots generated during data exploration are externally stored here due to size constraints on GitHub.

Decision Tree and Random Forest Artifacts

Decision tree and Random Forest tree structures are available externally - view here

MLOps with SkyFlow

SkyFlow is a Streamlit application deployed on an Amazon EC2 instance, which serves as the hosting environment. The application is accessible via a registered domain name (skyflow-kvgrowth.com), managed through AWS Route 53.
Route 53 is configured with an A record that points the domain(skyflow-kvgrowth.com) to the EC2 instance’s Elastic IP address, ensuring a stable connection even if the instance is restarted. For secure access, an SSL/TLS certificate is implemented, using AWS Certificate Manager (ACM) with an Application Load Balancer (ALB) for SSL termination.
EC2 instance’s security group is configured to allow inbound traffic on necessary ports (8000, 443, and 8501 for Streamlit). This setup provides a secure, scalable, and easily manageable environment for hosting the Streamlit application, with the flexibility to handle increased traffic and maintain high availability.

Key Insights from Phase1 to Phase2 of Project

Switched to predicting three classes instead of earlier four classes removing granularity of whether a flight had a specific arrival delay or a departure delay to see if performance with minority delay classes would improve
Experimented with F2 scores as an evaluation metric
Switched back to focus on F1 Score to: a) Decrease false positives for delayed flights, especially Class2, b) Improve accuracy of on-time flight predictions - Class1, c) Increase precision for Class2 and Class1
Model Tuning by adjusting thresholds for Class2 to optimize for F1 score. Also, added class weights where needed
Revisited SMOTETomek sampling strategy to improve prediction performance for minority classes – Class1 (either departure or arrival delayed) and Class2 – (both arrival and departure delayed)
Adding a Stacking Classifier and a Hybrid Ensemble to improve F1 scores by combining the strengths of multiple models, allowing them to capture diverse patterns in the data. This approach helped achieve a better balance between precision and recall, improving overall F1 performance
Enhanced feature engineering outlined in summary to further improve model performance on minority class predictions

Future Work

Feature Engineering: Improve flight prediction performance of the minority classes (Class1 and Class2) with engineered features.

Use of Principal Component Analysis (PCA): With 2D visualization to explore patterns within the current delay classes. If analysis reveals significant overlap between classes or a lack of distinct patterns, it may be beneficial to consider a more granular classification, such as separating arrival delays and departure delays into their own distinct classes.

Extend Forecast Horizon and Implement Multi-Step Forecasting: Increase the prediction timeframe beyond the current 24-hour forecast, implementing a multi-step forecasting approach that provides:

Short-term predictions (24 hours)
Medium-term predictions (48-72 hours)
Long-term predictions (up to 7 days) This multi-horizon approach allows for both immediate operational adjustments and longer-term strategic planning.

Explore use of Deep Learning Architectures: Investigate if performance can be improved further by:

Implementing LSTM (Long Short-Term Memory) networks to capture long-term dependencies in flight data
Exploring Transformer models for their ability to handle sequential data and long-range dependencies
Experimenting with hybrid models that combine CNN-LSTM architectures to capture both spatial and temporal patterns in flight and weather data

Expand SkyFlow: Refine its StreamLit interface beyond the initial prototype to include dashboards and to work with reduced number of inputs.

Real-time Updates: Incorporate real-time data to provide predictions as the departure time approaches.

Appendix

Baseline Dummy Classifier

A screenshot of a computer Description automatically generated

A graph with numbers and lines Description automatically generated with medium confidence

Multinomial Logistic Regression Classifier

A screenshot of a computer Description automatically generated

A graph of a graph Description automatically generated with medium confidence

A red bar graph with white background Description automatically generated

A graph showing a red and blue bar graph Description automatically generated

Decision Tree – HyperParameter tuned Decision Tree

PlotTree

A white background with black text Description automatically generated

Ensemble and Hybrid Ensemble model evaluation metrics

Similar metrics for the ensemble and hybrid classifiers can be found in this notebook here

References

How are airlines using AI to minimize disruptions

Case Study with JetBlue’s use of Tommorow.io

KDD2018: Predicting Estimated Time of Arrival for Commercial Flights

Mamdouh, M., Ezzat, M. & A.Hefny, H. A novel intelligent approach for flight delay prediction. J Big Data 10, 179 (2023). https://doi.org/10.1186/s40537-023-00854-w

Yuemin Tang. 2021. Airline Flight Delay Prediction Using Machine Learning Models. In 2021 5th International Conference on E-Business and Internet (ICEBI 2021), October 15-17, 2021, Singapore, Singapore. ACM, New York, NY, USA, 7 Pages. https://doi.org/10.1145/3497701.3497725

SkyFlow: AI-Powered Flight Delay Predictions

License

Table of Contents

Executive Summary

Problem Statement:

Rationale:

Business Case 1: Enhancing Operational Efficiency

Business Case 2: Improving Customer Experience

Example Usage: An AI system that predicts flight delays could also:

Research Question:

Flight Delay Predictions - Key Metrics:

Why is this important?

How do we know if SkyFlow is doing a good job?

Approach

CRISP-DM Framework:

Feature Engineering:

Key Findings from Exploratory Data Analysis:

Actionable Insights - Recommendations from Exploratory Data Analysis:

Model Evaluation and Performance Summary:

Actionable Insights - Recommendations for Model Selection and Deployment for Flight Delay Predictions:

Actionable Insights - Recommendations based on influential Features in Flight Delay Predictions:

Deep Dives

Enhanced Feature Engineering Algorithm

Performance comparison across Baseline, Logistic Regression and Decision Tree

Performance comparison across Ensemble Bagging and Boosting Classifiers

Performance comparison across Hybrid Ensemble Classifiers

Features influencing Flight Delay Predictions

Partial Dependence Plots - Visualize Feature Impact on Flight Delay Predictions for each Delay Class

Data Sources

Methodology Used for Data Preparation and Modeling

Project Structure

Data:

Analysis and Visualization:

Notebooks:

Model Artifacts:

StreamLit and FastAPI interface:

Repository with GitLFS:

Project Infrastructure

Key Insights from Phase1 to Phase2 of Project

Future Work

Appendix

Baseline Dummy Classifier

Multinomial Logistic Regression Classifier

Decision Tree – HyperParameter tuned Decision Tree

Ensemble and Hybrid Ensemble model evaluation metrics

References