COVID Fake News Detection with a Very Simple Logistic Regression
Natural Language Processing, NLP, Scikit Learn
This time, we are going to create a simple logistic regression model to classify COVID news to either true or fake, using the data I collected a while ago.
The process is surprisingly simple and easy. We will clean and pre-process the text data, perform feature extraction using NLTK library, build and deploy a logistic regression classifier using Scikit-Learn library, and evaluate the model’s accuracy at the end.
The Data
The data set contains 586 true news and 578 fake news, almost 50/50 split. Because the data collection bias, I decided not to use “source” as one of the features, instead, I will combine “title” and “text” into one feature “title_text”.
Pre-processing
Let’s have a look an example of the title text combination:
df['title_text'][50]
Looking at the above example of title and text, they are pretty clean, a simple text pre-processing would do the job. So, we will strip off any html tags, punctuation, and make them lower case.
The following code combines tokenization and stemming techniques together, and then apply the techniques on “title_text” later.
porter = PorterStemmer()def tokenizer_porter(text):
return [porter.stem(word) for word in text.split()]
TF-IDF
Here we transform “title_text” feature into TF-IDF vectors.
- Because we have already convert “title_text” to lowercase earlier, here we set
lowercase=False
. - Because we have taken care of and applied preprocessing on “title_text”, here we set
preprocessor=None
. - We override the string tokenization step with our combination of tokenization and stemming we defined earlier.
- Set
use_idf=True
to enable inverse-document-frequency reweighting. - Set
smooth_idf=True
to avoid zero divisions.
Logistic Regression for Document Classification
- Instead of tuning C parameter manually, we can use an estimator which is
LogisticRegressionCV
. - We specify the number of cross validation folds
cv=5
to tune this hyperparameter. - The measurement of the model is the
accuracy
of the classification. - By setting
n_jobs=-1
, we dedicate all the CPU cores to solve the problem. - We maximize the number of iterations of the optimization algorithm.
- We use
pickle
to save the model.
Model Evaluation
- Use
pickle
to load our saved model. - Use the model to look at the accuracy score on the data it has never seen before.
Jupyter notebook can be found on Github. Enjoy the rest of the week.