Introduction
A Flight Price Prediction Machine Learning Model is a type of predictive model that uses historical flight price data to predict the future prices of flights. The model will be trained using various algorithms such as linear regression, decision trees, or neural networks. The input features for the model will include factors such as the departure and arrival locations, the date of travel, the airline, and the class of service. The output of the model is a predicted flight price. Airlines and travel agencies can use the model and other businesses to predict prices and make pricing decisions.
Objectives
The main objectives of creating a Flight Price Prediction Machine Learning Model include the following:
- Price forecasting: The model can predict the future prices of flights, which can help airlines and travel agencies to adjust their prices accordingly and remain competitive.
- Inventory management: The model can be used to predict flight demand, which can help airlines and travel agencies optimize their inventory and avoid overbooking or underbooking.
- Revenue optimization: The model can maximize revenue by predicting the prices at which flights will sell the most, which can help airlines and travel agencies adjust their prices accordingly.
- Personalized pricing: The model can be used to personalize pricing for different customers by considering factors such as their past purchase history, location, and demographics.
- Anomaly Detection: The model can detect abnormal prices, which can help airlines and travel agencies identify pricing errors or fraud.
Overall, the goal of creating a flight price prediction model is to improve pricing decisions, optimize inventory and revenue, and improve customer experience.
Requirements
- Python
- Jupyter Notebook
Source Code
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set() train_data = pd.read_excel(r"E:\MachineLearning\EDA\Flight_Price\Data_Train.xlsx") pd.set_option('display.max_columns', None) train_data.head() train_data.info() train_data["Duration"].value_counts() train_data.dropna(inplace = True) train_data.dropna(inplace = True) train_data["Journey_day"] = pd.to_datetime(train_data.Date_of_Journey, format="%d/%m/%Y").dt.day train_data["Journey_month"] = pd.to_datetime(train_data["Date_of_Journey"], format = "%d/%m/%Y").dt.month train_data["Journey_month"] = pd.to_datetime(train_data["Date_of_Journey"], format = "%d/%m/%Y").dt.month # As we have converted Date_of_Journey column into integers. We can drop it now as it is of no use. train_data.drop(["Date_of_Journey"], axis = 1, inplace = True) # Departure time is at which a plane leaves the gate. # Similar to Date_of_Journey we will also extract values from Dep_Time # Extracting Hours train_data["Dep_hour"] = pd.to_datetime(train_data["Dep_Time"]).dt.hour # Extracting Minutes train_data["Dep_min"] = pd.to_datetime(train_data["Dep_Time"]).dt.minute # Now we can drop Dep_Time as it is of no use train_data.drop(["Dep_Time"], axis = 1, inplace = True) train_data.head() # Arrival time is when the plane pulls up to the gate. # Similar to Date_of_Journey we can extract values from Arrival_Time # Extracting Hours train_data["Arrival_hour"] = pd.to_datetime(train_data.Arrival_Time).dt.hour # Extracting Minutes train_data["Arrival_min"] = pd.to_datetime(train_data.Arrival_Time).dt.minute # Now we can drop Arrival_Time as it is of no use train_data.drop(["Arrival_Time"], axis = 1, inplace = True) train_data.head() # Time taken by plane to reach destination is called Duration # It is the differnce betwwen Departure Time and Arrival time # Assigning and converting Duration column into list duration = list(train_data["Duration"]) for i in range(len(duration)): if len(duration[i].split()) != 2: # To Check if duration contains only hour or mins if "h" in duration[i]: duration[i] = duration[i].strip() + " 0m" # Add 0 minute else: duration[i] = "0h " + duration[i] # Add zero hours duration_hours = [] duration_mins = [] for i in range(len(duration)): duration_hours.append(int(duration[i].split(sep = "h")[0])) # Extracting hours from duration duration_mins.append(int(duration[i].split(sep = "m")[0].split()[-1])) # Extracts only minutes from duration # Addition of duration_hours and duration_mins list to train_data dataframe train_data["Duration_hours"] = duration_hours train_data["Duration_mins"] = duration_mins train_data.drop(["Duration"], axis = 1, inplace = True) train_data.head() train_data["Airline"].value_counts() # From graph we can see that Jet Airways Business have the highest Price. # Apart from the first Airline almost all are having similar median # Airline vs Price sns.catplot(y = "Price", x = "Airline", data = train_data.sort_values("Price", ascending = False), kind="boxen", height = 6, aspect = 3) plt.show() # Since Airline is Nominal Categorical data we will perform OneHotEncoding Airline = train_data[["Airline"]] Airline = pd.get_dummies(Airline, drop_first= True) Airline.head() train_data["Source"].value_counts() # Source vs Price sns.catplot(y = "Price", x = "Source", data = train_data.sort_values("Price", ascending = False), kind="boxen", height = 4, aspect = 3) plt.show() # As Source is Nominal Categorical data we will perform OneHotEncoding Source = train_data[["Source"]] Source = pd.get_dummies(Source, drop_first= True) Source.head() train_data["Destination"].value_counts() # As Destination is Nominal Categorical data we will perform OneHotEncoding Destination = train_data[["Destination"]] Destination = pd.get_dummies(Destination, drop_first = True) Destination.head() train_data["Route"] # Additional_Info contains almost 80% no_info # Route and Total_Stops are related to each other train_data.drop(["Route", "Additional_Info"], axis = 1, inplace = True) train_data["Total_Stops"].value_counts() # Since this is a case of Ordinal Categorical type we perform LabelEncoder. # Here Values are assigned with corresponding keys train_data.replace({"non-stop": 0, "1 stop": 1, "2 stops": 2, "3 stops": 3, "4 stops": 4}, inplace = True) train_data.head() # Concatenate dataframe --> train_data + Airline + Source + Destination data_train = pd.concat([train_data, Airline, Source, Destination], axis = 1) data_train.head() data_train.drop(["Airline", "Source", "Destination"], axis = 1, inplace = True) data_train.head() data_train.drop(["Airline", "Source", "Destination"], axis = 1, inplace = True) test_data = pd.read_excel(r"E:\MachineLearning\EDA\Flight_Price\Test_set.xlsx")’ test_data.head() # Preprocessing print("Test data Info") print("-"*75) print(test_data.info()) print() print() print("Null values :") print("-"*75) test_data.dropna(inplace = True) print(test_data.isnull().sum()) # EDA # Date_of_Journey test_data["Journey_day"] = pd.to_datetime(test_data.Date_of_Journey, format="%d/%m/%Y").dt.day test_data["Journey_month"] = pd.to_datetime(test_data["Date_of_Journey"], format = "%d/%m/%Y").dt.month test_data.drop(["Date_of_Journey"], axis = 1, inplace = True) # Dep_Time test_data["Dep_hour"] = pd.to_datetime(test_data["Dep_Time"]).dt.hour test_data["Dep_min"] = pd.to_datetime(test_data["Dep_Time"]).dt.minute test_data.drop(["Dep_Time"], axis = 1, inplace = True) # Arrival_Time test_data["Arrival_hour"] = pd.to_datetime(test_data.Arrival_Time).dt.hour test_data["Arrival_min"] = pd.to_datetime(test_data.Arrival_Time).dt.minute test_data.drop(["Arrival_Time"], axis = 1, inplace = True) # Duration duration = list(test_data["Duration"]) for i in range(len(duration)): if len(duration[i].split()) != 2: # To Check if duration contains only hour or mins if "h" in duration[i]: duration[i] = duration[i].strip() + " 0m" # Adds 0 minute else: duration[i] = "0h " + duration[i] # Adding zero hour duration_hours = [] duration_mins = [] for i in range(len(duration)): duration_hours.append(int(duration[i].split(sep = "h")[0])) # Extracting hours from duration duration_mins.append(int(duration[i].split(sep = "m")[0].split()[-1])) # Extracts only minutes from duration # Adding Duration column to test set test_data["Duration_hours"] = duration_hours test_data["Duration_mins"] = duration_mins test_data.drop(["Duration"], axis = 1, inplace = True) # Categorical data print("Airline") print("-"*75) print(test_data["Airline"].value_counts()) Airline = pd.get_dummies(test_data["Airline"], drop_first= True) print() print("Source") print("-"*75) print(test_data["Source"].value_counts()) Source = pd.get_dummies(test_data["Source"], drop_first= True) print() print("Destination") print("-"*75) print(test_data["Destination"].value_counts()) Destination = pd.get_dummies(test_data["Destination"], drop_first = True) # Additional_Info contains almost 80% no_info # Route and Total_Stops are related to each other test_data.drop(["Route", "Additional_Info"], axis = 1, inplace = True) # Replacing Total_Stops test_data.replace({"non-stop": 0, "1 stop": 1, "2 stops": 2, "3 stops": 3, "4 stops": 4}, inplace = True) # Concatenate dataframe --> test_data + Airline + Source + Destination data_test = pd.concat([test_data, Airline, Source, Destination], axis = 1) data_test.drop(["Airline", "Source", "Destination"], axis = 1, inplace = True) print() print() print("Shape of test data : ", data_test.shape) data_train.shape data_train.columns X = data_train.loc[:, ['Total_Stops', 'Journey_day', 'Journey_month', 'Dep_hour', 'Dep_min', 'Arrival_hour', 'Arrival_min', 'Duration_hours', 'Duration_mins', 'Airline_Air India', 'Airline_GoAir', 'Airline_IndiGo', 'Airline_Jet Airways', 'Airline_Jet Airways Business', 'Airline_Multiple carriers', 'Airline_Multiple carriers Premium economy', 'Airline_SpiceJet', 'Airline_Trujet', 'Airline_Vistara', 'Airline_Vistara Premium economy', 'Source_Chennai', 'Source_Delhi', 'Source_Kolkata', 'Source_Mumbai', 'Destination_Cochin', 'Destination_Delhi', 'Destination_Hyderabad', 'Destination_Kolkata', 'Destination_New Delhi']] X.head() y = data_train.iloc[:, 1] y.head() # Finds correlation between Independent and dependent attributes plt.figure(figsize = (18,18)) sns.heatmap(train_data.corr(), annot = True, cmap = "RdYlGn") plt.show() # Important feature using ExtraTreesRegressor from sklearn.ensemble import ExtraTreesRegressor selection = ExtraTreesRegressor() selection.fit(X, y) print(selection.feature_importances_) # plot graph of feature importances to better visualize plt.figure(figsize = (12,8)) feat_importances = pd.Series(selection.feature_importances_, index=X.columns) feat_importances.nlargest(20).plot(kind='barh') plt.show() from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42) from sklearn.ensemble import RandomForestRegressor reg_rf = RandomForestRegressor() reg_rf.fit(X_train, y_train) y_pred = reg_rf.predict(X_test) reg_rf.score(X_train, y_train) reg_rf.score(X_test, y_test) sns.distplot(y_test-y_pred) plt.show() plt.scatter(y_test, y_pred, alpha = 0.5) plt.xlabel("y_test") plt.ylabel("y_pred") plt.show() from sklearn import metrics print('MAE:', metrics.mean_absolute_error(y_test, y_pred)) print('MSE:', metrics.mean_squared_error(y_test, y_pred)) print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred))) # RMSE/(max(DV)-min(DV)) 2090.5509/(max(y)-min(y)) metrics.r2_score(y_test, y_pred) from sklearn.model_selection import RandomizedSearchCV #Randomized Search CV # Number of trees in random forest n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)] # Number of features to consider at every split max_features = ['auto', 'sqrt'] # Maximum no. of levels in tree max_depth = [int(x) for x in np.linspace(5, 30, num = 6)] # Minimum no. of samples required to split a node min_samples_split = [2, 5, 10, 15, 100] # Minimum no. of samples required at each leaf node min_samples_leaf = [1, 2, 5, 10] # Create the random grid random_grid = {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth': max_depth, 'min_samples_split': min_samples_split, 'min_samples_leaf': min_samples_leaf} # Random search for parameters using 5 fold cross validation. # Searching across hundred different combinations rf_random = RandomizedSearchCV(estimator = reg_rf, param_distributions = random_grid,scoring='neg_mean_squared_error', n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = 1) rf_random.fit(X_train,y_train) rf_random.best_params_ prediction = rf_random.predict(X_test) plt.figure(figsize = (8,8)) sns.distplot(y_test-prediction) plt.show() plt.figure(figsize = (8,8)) plt.scatter(y_test, prediction, alpha = 0.5) plt.xlabel("y_test") plt.ylabel("y_pred") plt.show() print('MAE:', metrics.mean_absolute_error(y_test, prediction)) print('MSE:', metrics.mean_squared_error(y_test, prediction)) print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
Explanation of the Code
The outline of the steps in writing the code for a Flight Price Prediction Machine Learning Model is as follows.
1. Initially, we declared all the necessary libraries to build our model and loaded our dataset in our notebook.
2. The next step is to acquire historical flight price data, which can be obtained from various sources such as airlines, travel agencies, or online ticket booking platforms. The data will typically include information such as the departure and arrival locations, the date of travel, the airline, the class of service, and the corresponding prices.
3. Once the data is acquired, we cleaned and preprocessed it to remove any missing or inconsistent values and to format the data properly for the model. This involved removing outliers, normalizing the data, or encoding categorical variables. We cleaned our dataset by dropping the null values through dropna() function.
4. The next step is to extract relevant features from the data that will be used as input to the model. This may involve creating new features by combining existing ones or selecting a subset of the original features.
5. We have used the concept of one hot encoding and label encoding with the features.
6. Next, we trained the model using preprocessed and feature-engineered data. This involved splitting the data into training and test sets and then using an algorithm such as linear regression, decision trees, or neural networks to learn the relationship between the input features and the prices.
7. Next, we applied algorithms like random forest classifier and hyperparameter tuning.
8. Once the model is trained, we evaluate it to assess its performance. This involved comparing the predicted prices to the actual prices in the test set and calculating metrics such as mean squared error or R-squared.
Conclusion
Hence we have successfully built the Flight Price Prediction Machine Learning Model to predict the price of flights which will help us to select the best possible travel route and to reach our destination according to our own demand and utility.
More Machine Learning Projects>>>

Cisco Ramon is an American software engineer who has experience in several popular and commercially successful programming languages and development tools. He has been writing content since last 5 years. He is a Senior Manager at Rude Labs Pvt. Ltd.
0 Comments