Click here to Skip to main content
14,737,967 members
Articles » General Programming » Artificial Intelligence and Machine Learning » General
Posted 2 Jan 2021

Tagged as


2 bookmarked

Predicting Near Term COVID Death Rates

Rate me:
Please Sign up or sign in to vote.
5.00/5 (3 votes)
2 Jan 2021CPOL
Using previous case and fatality numbers, predict future death rates based on current case counts
There is an offset between when a COVID case is reported and an eventual death. Using Python, SKLearn and a Jupyter notebook, find the offset in days between the two that results in the best correlation. The resulting linear equation can then accurately predict the number of future deaths based on already reported cases.


The COVID Tracking Project publishes a daily, curated data set of global COVID cases, hospitizations, deaths and other data points. Reporting in the US seems to focus on lagging indicators like cases reported and fatalities-to-date. So using the Tracking Project's data, let's see the correlation between cases and future fatalities. As you might expect, there is a reliable relationship between the two.

Using the code

Python and Jupyter notebooks are great tools for experimentation and visualization of this sort of thing. Pandas, numpy, matplotlib and SKLearn provide more than enough functionality for a straightforward analysis.

startDate = np.datetime64('2020-06-24') # the earliest date in the data to use 
country = 'USA' # the country code to analyse

# load and filter the dataset to just the locale and period of interest
raw_data = pd.read_csv('', 
           usecols=['date', 'iso_code', 'new_deaths', 'new_cases', 'new_deaths_smoothed', 
           'new_cases_smoothed'], parse_dates=['date'])
df = raw_data[(raw_data.iso_code == country) & ( >= startDate) &
    (~raw_data.new_deaths.isnull()) &
    (~raw_data.new_cases.isnull()) &
    (~raw_data.new_deaths_smoothed.isnull()) &

After some initial exploration of the dataset for the US, it is pretty clear that there is a linear relationship between cases and deaths. From there, it's relatively straightforward to offset the case data from the fatalities by several days and find the offset with the best r2.

Model = namedtuple('Model', 'linearRegression r2 offset data')

def BestFitModel(new_cases, new_deaths, max_offset) -> Model:
    best = Model(None, 0.0, 0, None)

    # find an offset, in days, where cases best correlate to future deaths
    for offset in range(1, max_offset):
        cases = new_cases[-0:-offset]
        deaths = new_deaths[offset:]

        model = LinearRegression().fit(cases, deaths)
        predictions = model.predict(cases)
        r2 = metrics.r2_score(deaths, predictions)
        if (r2 > best.r2):
            best = Model(model, r2, offset, 
                         pd.DataFrame({'predicted': predictions, 'actual': deaths}))

    return best


Reporting noise (weekends and holidays get under-reported) and are very spikey. Even with that noise, the data shows a strong correlation somewhere between 14 and 21 days offset. It's almost always a factor of 7 because of the weekend pattern.

Smoothed data points, included in the Tracking Project's data set, removes the weekend pattern, but under-reporting over Thanksgiving and Christmas remain. Despite that, correlation is > 0.94.

Once the best fitting linear equation is found, it's straight forward to predict the number of deaths for the next number of days, equal to the offset used to find that equation.

def Predict(model: Model, dates, cases, deaths) -> pd.DataFrame:
    # create a new date series for the range over which we will predict
    # (it is wider than the source date range by [offset]. 
    #  That is how far in the future we can predict)
    minDate = np.amin(dates)
    maxDate = np.amax(dates) + np.timedelta64(model.offset + 1, 'D')

    projected_dates = [date for date in np.arange(minDate, maxDate, dt.timedelta(days=1))]

    # padding so actuals and predictions can be graphed together within dates
    padding = pd.Series(np.full(model.offset, np.nan))

    actual_deaths = deaths.append(padding)
    projected_deaths = padding.append(pd.Series(model.linearRegression.predict(cases)))

    frame = pd.DataFrame({"dates": projected_dates, 
        "actual": actual_deaths.values, 
        "projected": projected_deaths.values}, index=projected_dates)

    # unpivot the data set for easy graphing
    return frame.melt(id_vars=['dates'], var_name='series', value_name='deaths')

Predicted COVID fatalities

Points of Interest

For the curious, the results are updated and published to GitHub each day.


  • 2nd January, 2021 - Initial release


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Don Kackman
Team Leader Starkey Laboratories
United States United States
The first computer program I ever wrote was in BASIC on a TRS-80 Model I and it looked something like:
10 PRINT "Don is cool"
20 GOTO 10

It only went downhill from there.

Hey look, I've got a blog

Comments and Discussions

QuestionStatistically not possible Pin
obermd4-Jan-21 9:15
Memberobermd4-Jan-21 9:15 
AnswerRe: Statistically not possible Pin
Harrison Pratt4-Jan-21 12:02
professionalHarrison Pratt4-Jan-21 12:02 
AnswerRe: Statistically not possible Pin
Don Kackman5-Jan-21 5:03
MemberDon Kackman5-Jan-21 5:03 
QuestionFirst program Pin
Member 135197134-Jan-21 4:41
MemberMember 135197134-Jan-21 4:41 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.