Flight Price Prediction Machine Learning Model

by | Jan 16, 2023 | Coding, Machine Learning

Home » Coding » Flight Price Prediction Machine Learning Model

Introduction

A Flight Price Prediction Machine Learning Model is a type of predictive model that uses historical flight price data to predict the future prices of flights. The model will be trained using various algorithms such as linear regression, decision trees, or neural networks. The input features for the model will include factors such as the departure and arrival locations, the date of travel, the airline, and the class of service. The output of the model is a predicted flight price. Airlines and travel agencies can use the model and other businesses to predict prices and make pricing decisions.

 

Objectives

The main objectives of creating a Flight Price Prediction Machine Learning Model include the following:

  • Price forecasting: The model can predict the future prices of flights, which can help airlines and travel agencies to adjust their prices accordingly and remain competitive.
  • Inventory management: The model can be used to predict flight demand, which can help airlines and travel agencies optimize their inventory and avoid overbooking or underbooking.
  • Revenue optimization: The model can maximize revenue by predicting the prices at which flights will sell the most, which can help airlines and travel agencies adjust their prices accordingly.
  • Personalized pricing: The model can be used to personalize pricing for different customers by considering factors such as their past purchase history, location, and demographics.
  • Anomaly Detection: The model can detect abnormal prices, which can help airlines and travel agencies identify pricing errors or fraud.

Overall, the goal of creating a flight price prediction model is to improve pricing decisions, optimize inventory and revenue, and improve customer experience.

Requirements

Source Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
train_data = pd.read_excel(r"E:\MachineLearning\EDA\Flight_Price\Data_Train.xlsx")
pd.set_option('display.max_columns', None)
train_data.head()
train_data.info()
train_data["Duration"].value_counts()
train_data.dropna(inplace = True)
train_data.dropna(inplace = True)
train_data["Journey_day"] = pd.to_datetime(train_data.Date_of_Journey, format="%d/%m/%Y").dt.day
train_data["Journey_month"] = pd.to_datetime(train_data["Date_of_Journey"], format = "%d/%m/%Y").dt.month
train_data["Journey_month"] = pd.to_datetime(train_data["Date_of_Journey"], format = "%d/%m/%Y").dt.month
# As we have converted Date_of_Journey column into integers. We can drop it now as it is of no use.
train_data.drop(["Date_of_Journey"], axis = 1, inplace = True)
# Departure time is at which a plane leaves the gate.
# Similar to Date_of_Journey we will also extract values from Dep_Time
# Extracting Hours
train_data["Dep_hour"] = pd.to_datetime(train_data["Dep_Time"]).dt.hour
# Extracting Minutes
train_data["Dep_min"] = pd.to_datetime(train_data["Dep_Time"]).dt.minute
# Now we can drop Dep_Time as it is of no use
train_data.drop(["Dep_Time"], axis = 1, inplace = True)
train_data.head()
# Arrival time is when the plane pulls up to the gate.
# Similar to Date_of_Journey we can extract values from Arrival_Time
# Extracting Hours
train_data["Arrival_hour"] = pd.to_datetime(train_data.Arrival_Time).dt.hour
# Extracting Minutes
train_data["Arrival_min"] = pd.to_datetime(train_data.Arrival_Time).dt.minute
# Now we can drop Arrival_Time as it is of no use
train_data.drop(["Arrival_Time"], axis = 1, inplace = True)
train_data.head()
# Time taken by plane to reach destination is called Duration
# It is the differnce betwwen Departure Time and Arrival time
# Assigning and converting Duration column into list
duration = list(train_data["Duration"])
for i in range(len(duration)):
if len(duration[i].split()) != 2: # To Check if duration contains only hour or mins
if "h" in duration[i]:
duration[i] = duration[i].strip() + " 0m" # Add 0 minute
else:
duration[i] = "0h " + duration[i] # Add zero hours
duration_hours = []
duration_mins = []
for i in range(len(duration)):
duration_hours.append(int(duration[i].split(sep = "h")[0])) # Extracting hours from duration
duration_mins.append(int(duration[i].split(sep = "m")[0].split()[-1])) # Extracts only minutes from duration
# Addition of duration_hours and duration_mins list to train_data dataframe
train_data["Duration_hours"] = duration_hours
train_data["Duration_mins"] = duration_mins
train_data.drop(["Duration"], axis = 1, inplace = True)
train_data.head()
train_data["Airline"].value_counts()
# From graph we can see that Jet Airways Business have the highest Price.
# Apart from the first Airline almost all are having similar median
​
# Airline vs Price
sns.catplot(y = "Price", x = "Airline", data = train_data.sort_values("Price", ascending = False), kind="boxen", height = 6, aspect = 3)
plt.show()
# Since Airline is Nominal Categorical data we will perform OneHotEncoding
Airline = train_data[["Airline"]]
Airline = pd.get_dummies(Airline, drop_first= True)
Airline.head()
train_data["Source"].value_counts()
# Source vs Price
sns.catplot(y = "Price", x = "Source", data = train_data.sort_values("Price", ascending = False), kind="boxen", height = 4, aspect = 3)
plt.show()
# As Source is Nominal Categorical data we will perform OneHotEncoding
Source = train_data[["Source"]]
Source = pd.get_dummies(Source, drop_first= True)
Source.head()
train_data["Destination"].value_counts()
# As Destination is Nominal Categorical data we will perform OneHotEncoding
Destination = train_data[["Destination"]]
Destination = pd.get_dummies(Destination, drop_first = True)
Destination.head()
train_data["Route"]
# Additional_Info contains almost 80% no_info
# Route and Total_Stops are related to each other
train_data.drop(["Route", "Additional_Info"], axis = 1, inplace = True)
train_data["Total_Stops"].value_counts()
# Since this is a case of Ordinal Categorical type we perform LabelEncoder.
# Here Values are assigned with corresponding keys
train_data.replace({"non-stop": 0, "1 stop": 1, "2 stops": 2, "3 stops": 3, "4 stops": 4}, inplace = True)
train_data.head()
# Concatenate dataframe --> train_data + Airline + Source + Destination
data_train = pd.concat([train_data, Airline, Source, Destination], axis = 1)
data_train.head()
data_train.drop(["Airline", "Source", "Destination"], axis = 1, inplace = True)
data_train.head()
data_train.drop(["Airline", "Source", "Destination"], axis = 1, inplace = True)
test_data = pd.read_excel(r"E:\MachineLearning\EDA\Flight_Price\Test_set.xlsx")’
test_data.head()
# Preprocessing
print("Test data Info")
print("-"*75)
print(test_data.info())
print()
print()
print("Null values :")
print("-"*75)
test_data.dropna(inplace = True)
print(test_data.isnull().sum())
# EDA
# Date_of_Journey
test_data["Journey_day"] = pd.to_datetime(test_data.Date_of_Journey, format="%d/%m/%Y").dt.day
test_data["Journey_month"] = pd.to_datetime(test_data["Date_of_Journey"], format = "%d/%m/%Y").dt.month
test_data.drop(["Date_of_Journey"], axis = 1, inplace = True)
# Dep_Time
test_data["Dep_hour"] = pd.to_datetime(test_data["Dep_Time"]).dt.hour
test_data["Dep_min"] = pd.to_datetime(test_data["Dep_Time"]).dt.minute
test_data.drop(["Dep_Time"], axis = 1, inplace = True)
# Arrival_Time
test_data["Arrival_hour"] = pd.to_datetime(test_data.Arrival_Time).dt.hour
test_data["Arrival_min"] = pd.to_datetime(test_data.Arrival_Time).dt.minute
test_data.drop(["Arrival_Time"], axis = 1, inplace = True)
# Duration
duration = list(test_data["Duration"])
for i in range(len(duration)):
if len(duration[i].split()) != 2: # To Check if duration contains only hour or mins
if "h" in duration[i]:
duration[i] = duration[i].strip() + " 0m" # Adds 0 minute
else:
duration[i] = "0h " + duration[i] # Adding zero hour
duration_hours = []
duration_mins = []
for i in range(len(duration)):
duration_hours.append(int(duration[i].split(sep = "h")[0])) # Extracting hours from duration
duration_mins.append(int(duration[i].split(sep = "m")[0].split()[-1])) # Extracts only minutes from duration
# Adding Duration column to test set
test_data["Duration_hours"] = duration_hours
test_data["Duration_mins"] = duration_mins
test_data.drop(["Duration"], axis = 1, inplace = True)
# Categorical data
print("Airline")
print("-"*75)
print(test_data["Airline"].value_counts())
Airline = pd.get_dummies(test_data["Airline"], drop_first= True)
print()
print("Source")
print("-"*75)
print(test_data["Source"].value_counts())
Source = pd.get_dummies(test_data["Source"], drop_first= True)
print()
print("Destination")
print("-"*75)
print(test_data["Destination"].value_counts())
Destination = pd.get_dummies(test_data["Destination"], drop_first = True)
# Additional_Info contains almost 80% no_info
# Route and Total_Stops are related to each other
test_data.drop(["Route", "Additional_Info"], axis = 1, inplace = True)
# Replacing Total_Stops
test_data.replace({"non-stop": 0, "1 stop": 1, "2 stops": 2, "3 stops": 3, "4 stops": 4}, inplace = True)
# Concatenate dataframe --> test_data + Airline + Source + Destination
data_test = pd.concat([test_data, Airline, Source, Destination], axis = 1)
data_test.drop(["Airline", "Source", "Destination"], axis = 1, inplace = True)
print()
print()
print("Shape of test data : ", data_test.shape)
data_train.shape
data_train.columns
X = data_train.loc[:, ['Total_Stops', 'Journey_day', 'Journey_month', 'Dep_hour',
'Dep_min', 'Arrival_hour', 'Arrival_min', 'Duration_hours',
'Duration_mins', 'Airline_Air India', 'Airline_GoAir', 'Airline_IndiGo',
'Airline_Jet Airways', 'Airline_Jet Airways Business',
'Airline_Multiple carriers',
'Airline_Multiple carriers Premium economy', 'Airline_SpiceJet',
'Airline_Trujet', 'Airline_Vistara', 'Airline_Vistara Premium economy',
'Source_Chennai', 'Source_Delhi', 'Source_Kolkata', 'Source_Mumbai',
'Destination_Cochin', 'Destination_Delhi', 'Destination_Hyderabad',
'Destination_Kolkata', 'Destination_New Delhi']]
X.head()
y = data_train.iloc[:, 1]
y.head()
# Finds correlation between Independent and dependent attributes
plt.figure(figsize = (18,18))
sns.heatmap(train_data.corr(), annot = True, cmap = "RdYlGn")
plt.show()
# Important feature using ExtraTreesRegressor
from sklearn.ensemble import ExtraTreesRegressor
selection = ExtraTreesRegressor()
selection.fit(X, y)
print(selection.feature_importances_)
# plot graph of feature importances to better visualize
plt.figure(figsize = (12,8))
feat_importances = pd.Series(selection.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')
plt.show()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
from sklearn.ensemble import RandomForestRegressor
reg_rf = RandomForestRegressor()
reg_rf.fit(X_train, y_train)
y_pred = reg_rf.predict(X_test)
reg_rf.score(X_train, y_train)
reg_rf.score(X_test, y_test)
sns.distplot(y_test-y_pred)
plt.show()
plt.scatter(y_test, y_pred, alpha = 0.5)
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show()
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
# RMSE/(max(DV)-min(DV))
2090.5509/(max(y)-min(y))
metrics.r2_score(y_test, y_pred)
from sklearn.model_selection import RandomizedSearchCV
#Randomized Search CV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum no. of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]
# Minimum no. of samples required to split a node
min_samples_split = [2, 5, 10, 15, 100]
# Minimum no. of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf}
# Random search for parameters using 5 fold cross validation.
# Searching across hundred different combinations
rf_random = RandomizedSearchCV(estimator = reg_rf, param_distributions = random_grid,scoring='neg_mean_squared_error', n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = 1)
rf_random.fit(X_train,y_train)
rf_random.best_params_
prediction = rf_random.predict(X_test)
plt.figure(figsize = (8,8))
sns.distplot(y_test-prediction)
plt.show()
plt.figure(figsize = (8,8))
plt.scatter(y_test, prediction, alpha = 0.5)
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show()
print('MAE:', metrics.mean_absolute_error(y_test, prediction))
print('MSE:', metrics.mean_squared_error(y_test, prediction))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, prediction)))

Explanation of the Code

The outline of the steps in writing the code for a Flight Price Prediction Machine Learning Model is as follows.

1. Initially, we declared all the necessary libraries to build our model and loaded our dataset in our notebook.

2. The next step is to acquire historical flight price data, which can be obtained from various sources such as airlines, travel agencies, or online ticket booking platforms. The data will typically include information such as the departure and arrival locations, the date of travel, the airline, the class of service, and the corresponding prices.

3. Once the data is acquired, we cleaned and preprocessed it to remove any missing or inconsistent values and to format the data properly for the model. This involved removing outliers, normalizing the data, or encoding categorical variables. We cleaned our dataset by dropping the null values through dropna() function.

4. The next step is to extract relevant features from the data that will be used as input to the model. This may involve creating new features by combining existing ones or selecting a subset of the original features.

5. We have used the concept of one hot encoding and label encoding with the features.

6. Next, we trained the model using preprocessed and feature-engineered data. This involved splitting the data into training and test sets and then using an algorithm such as linear regression, decision trees, or neural networks to learn the relationship between the input features and the prices.

7. Next, we applied algorithms like random forest classifier and hyperparameter tuning.

8. Once the model is trained, we evaluate it to assess its performance. This involved comparing the predicted prices to the actual prices in the test set and calculating metrics such as mean squared error or R-squared.

Conclusion

Hence we have successfully built the Flight Price Prediction Machine Learning Model to predict the price of flights which will help us to select the best possible travel route and to reach our destination according to our own demand and utility.

 

More Machine Learning Projects>>>

You May Also Like To Create…