Posted by Oscar Wahltinez, Developer Programs Engineer

It's been over two years since the COVID-19 Open Data repository launched. The world has changed a lot since then, and the purpose of the repository has changed, too.

The major features and associated use cases were:

  1. Real-time COVID-19 data updates for monitoring purposes
  2. Comprehensive location coverage for forecasting and research purposes
  3. Comprehensive covariate coverage for forecasting and research purposes
illustration of colored icons representing unspecified data points leading to a larger yellow folder with a globe on it
The COVID-19 Open Data repository was launched over two years ago.

A major global shift in policy and resources allocated to COVID-19 monitoring has drastically changed our ability to maintain good-quality and up-to-date data. Due to a societal shift away from focusing on the pandemic, many health authorities have decreased the frequency of updates, the type of data being updated, and some have stopped updating data altogether.

Changing our focus

With that in mind, we have decided to shift our focus to the research use case, using the data for retrospective analysis and no longer providing real-time updates.

We have accumulated a wealth of data, covering over 20,000 distinct locations and hundreds of covariates from many different data sources. Users that wish to receive further updates to the datasets can either run the code themselves since our code and infrastructure are fully open source, or they can inspect our data sources and go directly to the canonical source of information.
a screenshot of existing data sources for the COVID-19 Open Data repository.
Snapshot of existing data sources for the COVID-19 Open Data repository.

The project site, Github repository and data will continue to be accessible to users, including the associated BigQuery Datasets entry.

Related tools and further research

Although COVID-19 was evidently the focus of the repository, the breadth of data available is such that a number of generalizable tools and research were also built using our data. For example, the agent-based epidemic simulator can be seeded with real data from any chosen location and date, and the clinical trial site selection tool can be used to plan any future large-scale, diverse vaccine trials.

Beyond research focused on COVID-19, other types of data can be used to analyze large-scale population health and related metrics. For a fun little example, here is a simple plot showcasing a correlation between visits to parks (from the Google's Community Mobility Reports dataset) and searches for skin rash and podalgia (from the Google Search Trends dataset) for a 16 month period in the state of California:
Graph showing plot of visits to parks compared to Google search trends for skin rash and podalgia
Plot of visits to parks compared to Google search trends for skin rash and podalgia.

As evidenced by the plot, the correlation between the different variables is quite remarkable! The Pearson correlation coefficient is greater than 0.8 for any two of the three variables. Here's the SQL query that you can use to replicate the chart above, using the dataset hosted on BigQuery:

SELECT

 date,

 mobility_parks,

 search_trends_skin_rash,

 search_trends_podalgia,

FROM `bigquery-public-data.covid19_open_data.covid19_open_data`

WHERE date >= '2020-06-01' AND date <= '2021-12-31' AND location_key = 'US_CA'

ORDER BY date ASC

Feel free to reach out to us via the COVID-19 Open Data site if you use this work for interesting and impactful research, or if you have any questions for us!