Data-310-Public-Raposo

Data 310 project summaries

View the Project on GitHub aeraposo/Data-310-Public-Raposo

Final Project: time to hop off the struggle bus

Reflecting on this summer semester, I am filled with a mixture of gratitude, relief, and an overwhelming desire to take a nap. In all seriousness, although this program was challenging at times, I learned SO much. I don’t think I fully knew what I signed up for when the program started but after getting familiar with the data science lingo, things began to feel less intimidating.

One of the practical skills I refined while completing this project was the art of just winging it- seriously! I had no idea where to start. In fact, I spent a 2 full weeks attempting to clean up a dataset (by COVID-net), only to realize that it was so poorly organzied that I couldn’t separate distinct patient records- it was a mess. By some miracle, I started looking up prestegious institutions and COVID data, after struggling to use a George Washington Univeristy dataset, and found one by the University of Oxford that seemed just right. Although it was missing values for many of the 37 parameters for most countries, I was able to identify 6 that were available for every country (only in the most recently added rows).

After a lot of trail and error with what type of model to use and adjusting the different layers and arguments, it started to run. At first, my results were pretty abysmal but as I began making small changes, I was able to get a better undertanding for what each adjustment did. I eventually settled on the model, which I detail on my poster. Considering my MAE started at over 10,000, I’m pretty happy with the decrease to ~540 by the end. With the US data, I was also able to make a correlation matrix heat map, which is available in my slides for the showcase. In my in-class presentation, I shared some additional info about how the optimizer and loss functions, which I included below (more info on MAE included on poster):

If I had more time, I would have liked to generate more data to better train the model and to have made a model that specifically predicted values for the United States or another country rather than trying to generealize data about the whole world.

Infographic