Photo Credit: Pixabay

Augmented Dickey-Fuller (ADF) test, Kwiatkowski–Phillips–Schmidt–Shin (KPSS) tests, Vector Autoregressions (VA), Durbin–Watson statistic, Cointegration test

The Granger causality test is a statistical hypothesis test for determining whether one time series is a factor and offer useful information in forecasting another time series.

For example, given a question: Could we use today’s Apple’s stock price to predict tomorrow’s Tesla’s stock price? If this is true, our statement will be Apple’s stock price Granger causes Tesla’s stock price. If this is not true, we say Apple’s stock price does not Granger cause Tesla’s stock price.

The Data


Phone Credit: Pixabay

PYMC3, Bernoulli Distribution, Pfizer, Moderna, AstraZeneca

Because of the trials are still ongoing, researchers caution against making head-to-head comparisons of vaccines based on incomplete data. But for the sake of learning, we will do it anyway, just not making any meaningful conclusions.

Recently, the announcements went out that the potential effectiveness of SARS-CoV-2 vaccine candidates developed by Pfizer-Biontech, Moderna, AstraZeneca regimen 1 and AstraZeneca regimen 2 to be 95%, 94.5%, 90% and 62% respectively. We know that some of the data are incomplete. The following analysis will be based on whatever information we currently have.

The Data


Johnson and Johnson, JNJ, Keras, Autoencoder, Tensorflow

Autoencoders are an unsupervised learning technique, although they are trained using supervised learning methods. The goal is to minimize reconstruction error based on a loss function, such as the mean squared error.

In this post, we will try to detect anomalies in the Johnson & Johnson’s historical stock price time series data with an LSTM autoencoder.

The data can be downloaded from Yahoo Finance. The time period I selected was from 1985–09–04 to 2020–09–03.

The steps we will follow to detect anomalies in Johnson & Johnson stock price data using an LSTM autoencoder:

  1. Train an LSTM autoencoder on the Johnson…


Photo credit: Pexels

Natural Language Processing, NLP, Hugging Face

Most of the researchers submit their research papers to academic conference because its a faster way of making the results available. Finding and selecting a suitable conference has always been challenging especially for young researchers.

However, based on the previous conferences proceeding data, the researchers can increase their chances of paper acceptance and publication. We will try to solve this text classification problem with deep learning using BERT.

Almost all the code were taken from this tutorial, the only difference is the data.

The Data


Photo credit: Unsplash

Natural Language Processing, NLP, Scikit Learn

This time, we are going to create a simple logistic regression model to classify COVID news to either true or fake, using the data I collected a while ago.

The process is surprisingly simple and easy. We will clean and pre-process the text data, perform feature extraction using NLTK library, build and deploy a logistic regression classifier using Scikit-Learn library, and evaluate the model’s accuracy at the end.

The Data


Photo Credit: Unsplash

NLP, Machine learning, COVID-19

It’s not easy for ordinary citizens to identify fake news. And fake coronavirus news is no exception.

As part of an effort to combat misinformation about coronavirus, I tried and collected training data and trained a ML model to detect fake news on coronavirus.

My training data is not perfect, but I hope it will be useful to help us understand whether fake news differs systematically from real news in style and language use. So, let’s find out.

The Data


Photo Credit: Unsplash

NLP, Natural Language Processing, Visualization

It is heart breaking to learn that Half of Canadians fooled by Covid-19 conspiracy theories.

According to the WHO, the COVID-19 related infodemic is just as dangerous as the virus itself. Similarly, conspiracy theories, myths and exaggerated facts could have consequences that go way beyond public health.

Thanks to the projects such as Lead Stories, Poynter, FactCheck.org, Snopes, EuVsDisinfo that monitor, identify and fact-check disinformation that spread across the world.

To explore the content of COVID-19 fake news, I use strict definitions of what true and fake news stories are. Specifically, true news articles are articles that are known to…


Photo by Dave Michuda on Unsplash

Economy, public health, coronavirus, NLP

Was intrigued by one of Thomas L. Friedmans OpEd on New York Times last week: “A Plan to Get America Back to Work”. It advocated a Data-Driven approach to the COVID-19 Pandemic: that is, limiting the number of infections and deaths from the coronavirus, in the same time maximizing the speed at which we can safely fold workers back into the workplace, based on the best data and expert advice.

In another OpEd, he offered detailed plan on how to accomplish this step-by-step.

I was interested to understand how his readers think of this, so I decided to do a…


Photo credit: Pixabay

Travel intent, personalization in travel, K-Modes Clustering

In the travel and tourism industry, segmentation is an important strategy for developing itineraries and marketing materials targeted towards different groups with varying travel intents and motivations. It helps the businesses to understand the subgroups that make up the audience so that the businesses can better tailor products and messages.

One caveat in the travel industry is that unlike online shopping, leisure travel is an infrequent purchase, most leisure travelers only travel once or twice every year, active customers are either uncommon or very slow in making their bookings.

But the good thing is that people love to travel and…


Photo credit: Pixabay

Promotional analysis, Time series forecasting, Intervention analysis

Many businesses are very seasonal, and some of them make money during holidays like Super Bowl, Labor Day, Thanksgiving, and Christmas. In addition, they use sales promotions to increase the demand for or visibility of a product or service for several weeks during the year.

In this post, we are going to analyze the historical data using time series analysis techniques, with promotion effect.

The data we are going to use is a weekly sales and price data for 9 stores and 3 products. At the end, we are going to forecast the sales for next 50 weeks for one…

Susan Li

Changing the world, one post at a time. Sr Data Scientist, Toronto Canada. https://www.linkedin.com/in/susanli/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store