Originally published in the TIBCO Community Blog, by Adam Faskowitz
As COVID-19 continues to impact people’s lives, we are interested in predicting case trends of the near future. Trying to predict an epidemic is certainly no easy task. While challenging, we explore a variety of modeling approaches and compare their relative performance in predicting case trends. In our methodology, we focus on using data of the past few weeks to predict the data of next week. In this blog, we first talk about the data, how it is formatted and managed, and then describe the various models that we investigated.
The data we use records the number of new cases reported in each county of the U.S everyday. Even though the dataset that we use has much more information, like the number of recovered deaths, etc, the columns that we focus on are “Cases”, “Date”, “State”, and “County”. We combine the “State” and “County” columns together into a single column named “geo.” After that, we decided to use the 2 weeks from 05/21/2020 to 06/03/2020 as training data, to try to predict the median number of cases from 06/04/2020 to 06/10/2020.
We obtain the following table for training set:
To trim down the data, we remove all counties that have less than 10 cases in the 2 training weeks. The final dataset has 1521 counties in total, which is around half of 3,141 total counties in the US.
The first method that we look into is the Friedman’s Supersmoother method. This is a nonparametric estimator based on local linear regression. Using a series of these regressions, the Projection method is able to generate a smoothed line for our time series data. Below is an example of the smoother on COVID case data from King county in Washington State:
As part of our methods for prediction, we use the last 2 points fitted by the smoother to compute a slope, and then use this slope to predict the number of cases for next week. We find that Friedman’s Supersmoother method is consistent and easy to use because it does not require any parameters. However, we have found that outliers can cause the method to sometimes have erratic behavior.
Generalized Linear Model
In this approach, we will use R’s built-in generalized linear model function, glm. GLMs generalize the linear model paradigm by introducing a link function to accommodate data which cannot be fit with a normal distribution. The link function transforms a linear predictor to enable the fit. The type of link function used is specified by the “family” parameter in R’s GLM function. As is usual with count data, we use family=”poisson”. A good introduction can be found at The General Linear Model (GLM): A gentle introduction. One drawback of this approach is that our model could be too sensitive to outliers. To combat against this, we experiment two approaches: Cook’s Distance and Forward Search.
This method is quite straightforward and can be summarized in 3 steps:
- Fit a GLM model.
- Calculate Cook’s distance, which measures the influence of each data point, for all 14 points. Remove high influence points where data is far away from the fitted line.
- Fit a GLM model again based on the remaining data.
One caveat of this method is that the model might not converge in the first step. Though, such cases are rare if we only use 2 weeks of training data. A longer training period may cause the linear predictor structure to prove too limited and require other methods.
The Forward Search method is adapted from the second chapter of the text “Robust Diagnostic Regression Analysis,” written by Anthony Atkinson and Marco Riani. In Forward Search, we start with a model fit to a subset of the data. The goal is to start with a model that is very unlikely to be built on data that includes outliers. In our case, there are few enough points that we can build a set of models based on every pair of points; or select a random sample to speed up the process. Out of these, we choose the model that best fits the data. Then, the method will iteratively and greedily select data points to add into the model. In each step, the deviance of the data points from the fitted line is recorded. A steep jump in deviance implies that the newly added data is an outlier. Let’s look at this method in further detail:
1. Find initial model using the following steps:
a. Build models on any combination of 2 data points. Since we have 14 data points in total, we will have (14 choose 2) = 91 candidate models.
b. Compute the trimmed sum squared error of the 14 data points based on each fitted model. (Trimmed here means that we only use the 11 data points with least squared error. The intention is to ignore outliers when fitting)
c. The model with least trimmed squared error is selected as the initial model.
For explanation, let’s assume that this red line below was chosen as the initial model. This means that out of all the pairings of two points, this model, more or less, fit the data the best.
2. Next, we walk through and add the data points to our initial model. The process is as follows:
a. Record the deviance of all 14 data points to the existing model
b. Using the points with the lowest deviations from the current model, select the subset with one additional point for fitting the next model in the sequence
c. Using the newly fit model, repeat this process iteratively on the rest of the data
3. We want to evaluate the results of step 2 by looking at the recorded deviance from each substep. Once there seems to be a steep jump in the recorded deviance (above 1.5 SDs), this indicates that we’ve reached an outlier. The steep jump indicates this because, compared to the model before that does not include the outlier, the newly created model with the outlier shifted the model & the recorded deviance significantly—suggesting that this data point is unlike the rest of the data. Additionally, we can presume that the remaining points after the steep jump are more aligned to the skewed data and could also be treated as outliers.
4. Ignoring the outliers identified in step 3, use the remaining data set as training data for the GLM and fit the final model.
Using this method, we will always be able to get a converged model. However, the first step of selecting the best initial model can be very time consuming and the time complexity is O(N^2), where N is the number of data points in the training set. One way to reduce the runtime is to use a sample of possible combinations. In our example, we may try 10 combinations out of the potential 91 combinations.
Our next approach is a simplified version of Moving Average. For this, we first compute the average of the first training week, and then compute the average of the second training week. Here, we assume that the change in number of cases reported each day has a linear relationship. While simple, using a moving average can obtain decent results with strong performance. Below is a visual representation of this method. The first red point represents the average of the first week and the second represents the average of the second week. The slope of the two points is then used to project the following week.
To evaluate these approaches, we used each method to project the median number of cases for the next week based on the case data from the previous two weeks. In addition, we also analyzed the model in terms of a classification problem—taking a look at whether each model was able to correctly identify whether the case trend was increasing or decreasing. Doing this over all of the counties in our dataset, each method now has a list of 1521 projected medians. Comparing the projections to actual data, we can calculate the observed median error for each county across the methods. The table below displays the percentiles of each method’s list of errors.
Note that it is quite common for the Moving Average and Projection methods to predict a negative number of cases. In those situations, we will force them to predict 0. It is common for both GLM models to produce an extremely large number of cases.
Overall, the GLM model, utilizing Cook’s Distance to find outliers, seems to perform best. This method rarely makes negative predictions and predicts reasonably in most cases. The Moving Average method produced the lowest 100th Percentile, or in other terms, achieved the lowest maximum error. The traditional model-based Cooks Distance method improves on the simple Moving Average approach in most cases. All methods, however, suffer from a number of very unrealistic estimates in some cases. Although the Forward Search method is interesting for its innovative approach, in practice it underperforms and is more costly in terms of compute time.
Now, let’s take a look at the results of our classification problem:
Interestingly, the GLM models seemed to not perform as well when looking at the problem in terms of correctly classifying increasing or decreasing trends universally across the counties. There are two metrics in the table above. The “ROC AUC (>5)” calculates the metric when applied to counties with their previous week’s median case count above 5, whereas the “ROC AUC (>25)” refers to above 25 cases (ROC AUC, which you can read more about here, is a metric for measuring the success of a binary classification model; values closer to 1 indicate better performance). What you can infer from this is that the more simple Moving Average and Projection methods can do better than the GLMs as a blanket approach. However, when looking at counties with more cases, and likely more significant trends, the GLMs prove better. This supports the finding that GLMs can often have erroneous results on insufficient datasets, but good results on datasets with enough quality data. Additionally, we can say that this is a good example to demonstrate that one-size does not fit all when it comes to modelling. Each method has its benefits and it is important to explore those pros and cons when making a decision on what model to use, and when to use it.
For a more visual look at the results, we can examine some specific cases. Here, we plot results of the methods on three different scenarios: where the number of cases is less than 50, between 50 and 150, and greater than 150.
In general, it can be seen that the more cases there are in the training set, the more accurate and reasonable are the GLM methods. These perform particularly well when there is a clear trend of increasing or decreasing data. However, the GLM does a poor job when the cases are reported on an inconsistent basis (data on some days, but 0’s on others). In such cases, the fitted curve is “dragged” by the few days of reported data. An example of this is illustrated by the Texas Pecos data in the second figure above.
The Projection method seems to be too subjective to the case counts on the last few days. When there is a sharp decrease on those days, the supersmoother may make negative predictions.
The Moving Average method can be interpreted as a simplified version of the supersmoother. The main difference is that it weights the data of the first and second week equally when making predictions. Therefore, it actually does a slightly better job than the supersmoother.
Effect of Training Period:
To further evaluate these approaches, we can extend the length of the training weeks to see how that might affect the performance of each model. The metric used here is similar to the table from the “Results” section: the median error of the model prediction from the observed data. The results across different training lengths are below:
It is interesting to see that the performance of the GLM-CD model first increases as the length of training data increases (deviances decrease), but later the performance deteriorates once the length of training data is too large.
The following examples illustrate why the performance may deteriorate when the length of training data is too long:
We can see that the GLM model assumes that the trend must be monotone. Once it assumes that the number of cases are increasing, it fails to detect the decreasing number of cases after the outbreak. Therefore, the GLM model is particularly useful when making predictions based solely on the most recent trend. On the contrary, the Projection method is much better at automatically emphasizing the most recent trend, without having to worry about whether the data is monotonic or not, and increasing the length of training data increases its performance in general.
The GLM approach could also be improved by taking into account the presence of a maximum and only using the monotonic portion of the data. For example, the gamlss package and function have a feature that can detect a changepoint and fit a piecewise linear function appropriately. (See Flexible Regression and Smoothing using GAMLSS in R pp 250-253). This would enable us to use a longer time frame when possible in an automated way.
Overall, if we want to use the most recent data for nearcasting based on a GLM model, a 6 week training set seems to be the optimal length. If we were to use a longer period of training data, we might prefer using the Projection method.
While each model has its advantages and disadvantages, using these approaches can help establish reasonable predictions about future trends in COVID data. Not only can these methods be applied in this specific case, but they can also be used for a number of different use cases involving time series data.
The methodologies used in this analysis were created in R and Spotfire. To run these yourself, simply utilize Spotfire’s data function, which allows you to run R (or python) scripts within the application. For more information on data functions, check out our community, and if you are interested in learning more about our COVID work and what happens under the hood in Spotfire, read here.
A special thanks to Zongyuan Chen, David Katz, and the rest of the team for their contributions.
- Aswani, A. IEOR 265 – Lecture 6 Local Linear Regression. 2015.
- Brown, C. Generalized Linear Models: understanding the link function. 2018.
- Carey, G. The General Linear Model (GLM): A gentle introduction. 2013.
- Glen, S. Cook’s Distance / Cook’s D: Definition, Interpretation. 2016.
- Stasinopoulos et al. Flexible Regression and Smoothing using GAMLSS in R. 2015.
- Wikipedia. Least trimmed squares
- Wikipedia. Moving average
- Wikipedia. Receiver operating characteristic