Emile Timothy
Main Page

Projects

Correlating COVID-19 outbreaks to COVID-19 Vaccine Efficacy

During the spring quarter of my sophomore year at Caltech, I took CS 156b (Learning from Data), which is a graduate-level project-based class on Machine Learning taught by Professor Yaser Abu-Mostafa. Since this was during the peak of the COVID-19 pandemic, Prof. Abu-Mostafa asked us to pursue a ML project of our choice using any available data to produce novel and meaningful insights about COVID-19. So, my friends (Julen, Andrew, and Basel, who are really smart guys) and I, decided to use existing data to determine how effective the COVID-19 vaccine was against the dominant variants at the time (B.1.1.7, P.1, B.1.351, B.1.427, B.1.429).

We used GISAID-scraped data and data from Outbreak.info, OWID, CDC, which represented regularly-updating COVID-19 variant outbreaks in different states in the United States. We used this data to train a couple of models: LSTM (Long Short-term Memory), VAR (Vector Auto-Regression), SIR (Susceptible, Infectious, Recovered) model, ARIMA (Autoregressive Integrated Moving Average), and GPR (gaussian process regression). Specifically, we looked at vaccine-delivery data and variant-outbreaks to see if we could find a negative correlation. My contribution to this project was the development of the ARIMA and GPR models, which are tools used primarily in modeling highly volatile data, like stocks. I also looked at XGBoost, but decided against using it due its lack of scalability since the data changed so sporadically. Our final machine-learning model ensembled the ARIMA, LSTM, and VAR models, to yield a pretty accurate estimate of the COVID-19 vaccine efficacy.

Here, you'll see my raw source-code for developing the ARIMA model, as well as the mostly self-contained slides for the three virtual presentations that we gave over the course of the quarter.



A quick digression: While other groups in the class had elaborate names like 'Super Unsupervised' or 'Indecision Trees', we decided to name ourselves as 'Group 23' for an odd, though funny, reason. The story is that we were given a google sheets form to register our group and its name. However, we decided that instead of going to the trouble of choosing a witty (and potentially, awkward) name, we would be happier messing with another group. So, we registered our group as 'Group 23' on row 24 on the spreadsheet in the hopes that the Group that signed up on row 23 of the spreadsheet would be thrown into confusion ... Alas, it turned out that the group that signed up on row 23 never looked at the spreadsheet again, which was quite unfortunate. Moral of the story: our collective prank game needs to get stronger.

This was a really enjoyable class; and, even though it was entirely virtual, Professor Abu-Mostafa made it very memorable. I'd also like to acknowledge TAs Chris Wang, Dominic Yurk, and Alexander Zlokapa for their feedback throughout the class.