Introduction
We are going to create a fake news classification machine learning model, which is a type of artificial intelligence model that is trained to identify and classify news articles or statements as genuine or fake. We are going to train this model on a dataset of labeled examples of real and fake news, which can be used to classify new, unseen news articles or statements automatically. There are different approaches to building such a model, but common techniques include natural language processing, machine learning, and deep learning. The performance of the model can be evaluated by measuring its accuracy, precision, recall, and other metrics on a separate test dataset.
This machine learning model will help us to classify the news as fake news or real news according to the words and special characters present in the text. We are going to use algorithms like Count Vectorizer and the concepts of Porter Steamer to perform necessary actions.
Objectives
The main objectives of creating a fake news classification machine learning model are:
- Identifying fake news by automatically classifying news articles or statements as genuine or fake based on patterns and characteristics learned from a labeled training dataset.
- Improving the accuracy and performance of the classifier by experimenting with different machine learning algorithms, feature engineering techniques, and hyperparameter tuning.
- Making the classifier more robust by handling different types of text and handling issues such as imbalanced classes, missing data, and noisy data.
- Incorporating additional information sources, such as social media data, to improve the classifier’s ability to identify fake news.
- Improving the interpretability of the classifier by providing insights into the features and decision rules used by the model.
- Continuously monitoring the classifier’s performance and updating it as new fake news detection techniques and data become available.
Requirements
To perform a fake news classification machine learning model using Python, the following requirements are typically needed:
- A labeled dataset of real and fake news articles or statements will be used to train and evaluate the classifier.
- Python programming language and a set of commonly used libraries such as NumPy, pandas, scikit-learn, and NLTK for data pre-processing, feature extraction, and machine learning.
- A machine learning algorithm for building the classifier, such as logistic regression, Naive Bayes, decision trees, random forests, or deep learning models.
- Knowledge of natural language processing techniques for text processing, such as tokenization, stemming, and lemmatization.
- A development environment for coding and testing the classifier, such as Jupyter Notebook or PyCharm. We have used Jupyter Notebook.
- Access to a computing platform with sufficient resources to train and test the classifier, such as a local machine or a cloud-based platform.
- Familiarity with machine learning and data analysis fundamentals, such as feature engineering, model evaluation, and hyperparameter tuning.
- Experience with visualization libraries such as Matplotlib and Seaborn to visualize the results and insights of the model.
- Familiarity with web scraping and web crawling to extract data from different sources.
Source Code
import pandas as pd df=pd.read_csv('fake-news/train.csv') df.head() ## Get the Independent Features X=df.drop('label',axis=1) X.head() ## Get the Dependent features y=df['label'] y.head() df.shape from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer df=df.dropna() df.head(10) messages=df.copy() messages.reset_index(inplace=True) messages.head(10) messages['title'][6] from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer ps = PorterStemmer() corpus = [] for i in range(0, len(messages)): review = re.sub('[^a-zA-Z]', ' ', messages['title'][i]) review = review.lower() review = review.split() review = [ps.stem(word) for word in review if not word in stopwords.words('english')] review = ' '.join(review) corpus.append(review) corpus[3] ## Applying Countvectorizer # Creating the Bag of Words model from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(max_features=5000,ngram_range=(1,3)) X = cv.fit_transform(corpus).toarray() X.shape y=messages['label'] ## Divide the dataset into Train and Test from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0) cv.get_feature_names()[:20] cv.get_params() count_df = pd.DataFrame(X_train, columns=cv.get_feature_names()) count_df.head() import matplotlib.pyplot as plt def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues): """ See full source and example: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print('Confusion matrix, without normalization') thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label') from sklearn.naive_bayes import MultinomialNB classifier=MultinomialNB() from sklearn import metrics import numpy as np import itertools classifier.fit(X_train, y_train) pred = classifier.predict(X_test) score = metrics.accuracy_score(y_test, pred) print("accuracy: %0.3f" % score) cm = metrics.confusion_matrix(y_test, pred) plot_confusion_matrix(cm, classes=['FAKE', 'REAL']) classifier.fit(X_train, y_train) pred = classifier.predict(X_test) score = metrics.accuracy_score(y_test, pred) score y_train.shape from sklearn.linear_model import PassiveAggressiveClassifier linear_clf = PassiveAggressiveClassifier(n_iter=50) linear_clf.fit(X_train, y_train) pred = linear_clf.predict(X_test) score = metrics.accuracy_score(y_test, pred) print("accuracy: %0.3f" % score) cm = metrics.confusion_matrix(y_test, pred) plot_confusion_matrix(cm, classes=['FAKE Data', 'REAL Data']) classifier=MultinomialNB(alpha=0.1) previous_score=0 for alpha in np.arange(0,1,0.1): sub_classifier=MultinomialNB(alpha=alpha) sub_classifier.fit(X_train,y_train) y_pred=sub_classifier.predict(X_test) score = metrics.accuracy_score(y_test, y_pred) if score>previous_score: classifier=sub_classifier print("Alpha: {}, Score : {}".format(alpha,score)) ## Get Features names feature_names = cv.get_feature_names() classifier.coef_[0] ### Most real sorted(zip(classifier.coef_[0], feature_names), reverse=True)[:20] ### Most fake sorted(zip(classifier.coef_[0], feature_names))[:5000]
Output
Explanation of the Code
1. Initially, we imported all the libraries required to build our machine-learning model.
2. Then, we cleaned our dataset by dropping the null values through dropna() function.
3. Accordingly, we have looked at our dataset in the head and tail functions, respectively.
4. Then, we removed some special characters from the text so that analysis becomes easier.
5. Then, through the natural language toolkit, we imported all the necessary libraries and algorithms like porter streamer and count vectorizer and through the fit function, we trained our model through this algorithm.
6. Algorithms used: HashingVectorizer, TfidfVectorizer, CountVectorizer
Conclusion
Hence we have successfully built the machine learning model to predict the news as fake or real, which helps extract the correct information from the news and remove the disinformation.

Cisco Ramon is an American software engineer who has experience in several popular and commercially successful programming languages and development tools. He has been writing content since last 5 years. He is a Senior Manager at Rude Labs Pvt. Ltd.
0 Comments