Photo credit: Unsplash

COVID Fake News Detection with a Very Simple Logistic Regression

Natural Language Processing, NLP, Scikit Learn

--

This time, we are going to create a simple logistic regression model to classify COVID news to either true or fake, using the data I collected a while ago.

The process is surprisingly simple and easy. We will clean and pre-process the text data, perform feature extraction using NLTK library, build and deploy a logistic regression classifier using Scikit-Learn library, and evaluate the model’s accuracy at the end.

The Data

The data set contains 586 true news and 578 fake news, almost 50/50 split. Because the data collection bias, I decided not to use “source” as one of the features, instead, I will combine “title” and “text” into one feature “title_text”.

fake_news_logreg_start.py

Pre-processing

Let’s have a look an example of the title text combination:

df['title_text'][50]

Looking at the above example of title and text, they are pretty clean, a simple text pre-processing would do the job. So, we will strip off any html tags, punctuation, and make them lower case.

fake_news_logreg_preprocessing.py

The following code combines tokenization and stemming techniques together, and then apply the techniques on “title_text” later.

porter = PorterStemmer()def tokenizer_porter(text):
return [porter.stem(word) for word in text.split()]

TF-IDF

Here we transform “title_text” feature into TF-IDF vectors.

  • Because we have already convert “title_text” to lowercase earlier, here we set lowercase=False.
  • Because we have taken care of and applied preprocessing on “title_text”, here we set preprocessor=None.
  • We override the string tokenization step with our combination of tokenization and stemming we defined earlier.
  • Set use_idf=True to enable inverse-document-frequency reweighting.
  • Set smooth_idf=True to avoid zero divisions.
fake_news_logreg_tfidf.py

Logistic Regression for Document Classification

  • Instead of tuning C parameter manually, we can use an estimator which is LogisticRegressionCV.
  • We specify the number of cross validation folds cv=5 to tune this hyperparameter.
  • The measurement of the model is the accuracy of the classification.
  • By setting n_jobs=-1, we dedicate all the CPU cores to solve the problem.
  • We maximize the number of iterations of the optimization algorithm.
  • We use pickle to save the model.
fake_news_logreg_model.py

Model Evaluation

  • Use pickle to load our saved model.
  • Use the model to look at the accuracy score on the data it has never seen before.
fake_news_logreg_eva.py

Jupyter notebook can be found on Github. Enjoy the rest of the week.

--

--