COVID Fake News Detection with a Very Simple Logistic Regression

Natural Language Processing, NLP, Scikit Learn

Published in

Towards Data Science

3 min readJul 22, 2020

This time, we are going to create a simple logistic regression model to classify COVID news to either true or fake, using the data I collected a while ago.

The process is surprisingly simple and easy. We will clean and pre-process the text data, perform feature extraction using NLTK library, build and deploy a logistic regression classifier using Scikit-Learn library, and evaluate the model’s accuracy at the end.

The Data

The data set contains 586 true news and 578 fake news, almost 50/50 split. Because the data collection bias, I decided not to use “source” as one of the features, instead, I will combine “title” and “text” into one feature “title_text”.

fake_news_logreg_start.py

Pre-processing

Let’s have a look an example of the title text combination:

df['title_text'][50]

Looking at the above example of title and text, they are pretty clean, a simple text pre-processing would do the job. So, we will strip off any html tags, punctuation, and make them lower case.

fake_news_logreg_preprocessing.py

The following code combines tokenization and stemming techniques together, and then apply the techniques on “title_text” later.

porter = PorterStemmer()def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

TF-IDF

Here we transform “title_text” feature into TF-IDF vectors.

Because we have already convert “title_text” to lowercase earlier, here we set lowercase=False.
Because we have taken care of and applied preprocessing on “title_text”, here we set preprocessor=None.
We override the string tokenization step with our combination of tokenization and stemming we defined earlier.
Set use_idf=True to enable inverse-document-frequency reweighting.
Set smooth_idf=True to avoid zero divisions.

fake_news_logreg_tfidf.py

Logistic Regression for Document Classification

Instead of tuning C parameter manually, we can use an estimator which is LogisticRegressionCV.
We specify the number of cross validation folds cv=5 to tune this hyperparameter.
The measurement of the model is the accuracy of the classification.
By setting n_jobs=-1, we dedicate all the CPU cores to solve the problem.
We maximize the number of iterations of the optimization algorithm.
We use pickle to save the model.

fake_news_logreg_model.py