Introduction
A car price prediction machine learning model is a type of algorithm that uses historical data on car sales and features to predict the price of a car. The model is trained on a dataset of car information such as make, model, year, mileage, condition, and the corresponding sale price. Once trained, the model can be used to predict the sale price of a car based on its features. Common techniques for creating a car price prediction model include linear regression, decision trees, and random forests.
Objectives
The objective behind building this car price prediction machine learning model is
- To predict the price of a car so that we can get our car according to our own utility and demand balanced according to our price range.
- To help businesses in the automobile industry to set standards to meet the requirements of the users and can also grow their businesses accordingly.
- To use historical data to train the model and make predictions on new, unseen data.
Requirements
To build a car price prediction model using Python, you will need the following:
- A dataset of car information: This dataset should include features such as make, model, year, mileage, and condition, as well as the corresponding sale price.
- Python programming language: You must install Python on your computer to build the model.
- Required Libraries: You will need to install libraries such as numpy, pandas, scikit-learn, and matplotlib. These libraries are used in Python for data manipulation, visualization, and machine learning.
- Jupyter Notebook/ IDE: You will need a development environment such as Jupyter Notebook or an IDE to write and run the code for the model.
- Understanding of Machine Learning concepts and Python programming.
Source Code
import seaborn as sns import pandas as pd import matplotlib.pyplot as plt import numpy as np %matplotlib inline df=pd.read_csv('car data.csv’) df.shape print(df['Seller_Type'].unique()) print(df['Fuel_Type'].unique()) print(df['Transmission'].unique()) print(df['Owner'].unique()) ##check missing values df.isnull().sum() df.describe() final_dataset=df[['Year','Selling_Price','Present_Price','Kms_Driven','Fuel_Type','Seller_Type','Transmission','Owner']] final_dataset.head() final_dataset['Current Year']=2022 final_dataset.head() final_dataset['no_year']=final_dataset['Current Year']- final_dataset['Year'] final_dataset.head() final_dataset.drop(['Year'],axis=1,inplace=True) final_dataset.head() final_dataset=final_dataset.drop(['Current Year'],axis=1) final_dataset.head() final_dataset.corr() sns.pairplot(final_dataset) import seaborn as sns #get correlations of each features in dataset corrmat = df.corr() top_corr_features = corrmat.index plt.figure(figsize=(20,20)) #plot heat map g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn") final_dataset.head() X=final_dataset.iloc[:,1:] # independent feature y=final_dataset.iloc[:,0] # dependent feature (selling price) X.head() y.head() # feature importance from sklearn.ensemble import ExtraTreesRegressor model = ExtraTreesRegressor() model.fit(X,y) print(model.feature_importances_) # according to the value this tells us the importance of features #plot graph to better visualize feature importances feat_importances = pd.Series(model.feature_importances_, index=X.columns) feat_importances.nlargest(5).plot(kind='barh') plt.show() from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) X_train.shape from sklearn.ensemble import RandomForestRegressor regressor=RandomForestRegressor() # hyperparameters for decision trees n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)] print(n_estimators) # Number of trees in random forest n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)] # Number of features to consider at every split max_features = ['auto', 'sqrt'] # Max number of levels in the tree max_depth = [int(x) for x in np.linspace(5, 30, num = 6)] # max_depth.append(None) # Min number of samples that are required to split a node min_samples_split = [2, 5, 10, 15, 100] # Min number of samples that are required at each leaf node min_samples_leaf = [1, 2, 5, 10] from sklearn.model_selection import RandomizedSearchCV #Randomized Search CV # Create the random grid random_grid = {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth': max_depth, 'min_samples_split': min_samples_split, 'min_samples_leaf': min_samples_leaf} print(random_grid) # Using the random grid to search best hyper-parameters # First create the base model to tune rf = RandomForestRegressor() # Random search of parameters by using 3 fold cross validation # search across 100 different combinations rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,scoring='neg_mean_squared_error', n_iter = 10, cv = 5, verbose=2, random_state=42,n_jobs=1) rf_random.fit(X_train,y_train) rf_random.best_params_ rf_random.best_score_ predictions=rf_random.predict(X_test) from sklearn import metrics print('MAE:', metrics.mean_absolute_error(y_test, predictions)) print('MSE:', metrics.mean_squared_error(y_test, predictions)) print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions))) import pickle # open a file, where you want to store the data file = open('random_forest_regression_model.pkl', 'wb') # dump information to that file pickle.dump(rf_random, file)
Explanation of the Code
1. Initially, we imported the dataset and all the necessary libraries that were needed to build our model.
2. Then, we checked for the null values in our dataset, and if present, we removed them accordingly.
3. According to our features, we have cleaned our dataset and dropped some of the columns which are not useful in our model-building process.
4. Then, in the next section, we started our train test split phase and trained the model with Random Forest Classifier, and then with the Randomized Search CV, we selected the best number of attributes for our model building.
5. We have created some plots and visualizations to get insights from our dataset more concisely.
6. Then, accordingly, we predicted the values after the training phase was done.
Output
Conclusion
Hence we have successfully built the car price prediction machine learning model. This model will predict the price of a car based on given features in our dataset, which will help individuals to select the best-suited car according to their own utility and demand. Hence this model can also help businesses grow and increase revenues.

Cisco Ramon is an American software engineer who has experience in several popular and commercially successful programming languages and development tools. He has been writing content since last 5 years. He is a Senior Manager at Rude Labs Pvt. Ltd.
0 Comments