Originally published in the TIBCO Community Blog, by Adam Faskowitz
R0, the basic reproduction number, and Rt, the time-varying reproduction number, are quantities used when trying to understand the reproduction rate of the virus. However, both of them are quite difficult to estimate. One glaring issue is the largely unknown prevalence of asymptomatic and presymptomatic cases, which directly impacts our estimates. Another issue is the extent of local versus imported cases (read here for more information on R0/Rt).
A complementary approach to estimating R0 is to consider the reported cases as a random variable, and seek to characterize it directly. This is often done with some form of moving average. Moving average has its advantages—it is familiar and easy to understand. However, it has several disadvantages:
- We must choose a bandwidth – the number of days to include in the average. How should we do this?
- We do not get an estimate of the stability of the predictions we obtain.
- Averages are quite sensitive to outliers, resulting in swings that are difficult to interpret.
At TIBCO, we have been using Friedman’s SuperSmoother (as implemented in R as supsmu) to estimate the epidemic curve of counties, states/provinces, and countries across the world. This has the default option to choose the bandwidth automatically. In most cases, this method appears to match or outperform the moving average method.
One important output that we’d like to get from this exercise is an estimate of the current trend, but this proves challenging for many methods. We have performed a number of experiments to compare various bandwidths and methods for extrapolating into the near future, including the SuperSmoother and GAM/Poisson models (we delve into those results in a separate study). Today, we focus on what we have found to be the most effective method for estimating the current trend: GAMLSS.
In this blog, we describe our experience with a method called GAMLSS, Generalized Additive Models for Location, Scale and Shape, available in R as package gamlss. Our intention is to provide insight into the overall history, variation, and current trend in COVID data; prediction is not necessarily our focus. This method is described by Stasinopoulos et. al. as a distributional approach—that is, we make an informed guess as to the appropriate distribution to use, and then fit a linear or smooth model via a link function to the data. We use penalized b-splines to fit the time series of new cases with a smooth curve (this method creates a smooth curve adapted to the data). The gamlss package additionally provides easy to use diagnostics to help validate our hypothesis and the fit.
To begin, we apply this method to the daily counts of New Cases from Johns Hopkins as well as the websites of various countries. Initially we use the data by State and Province from around the world. For each geography, defined as ‘country’+’State/Province’, we use the gamlss::fitDist function to find the distribution that best fits the data. We found that “SHAHSo” was a promising choice–“sinh-arcsinh original”. This is a flexible distribution with 4 parameters that control its location, scale and shape.
We fit the data as follows:
- Model <- gamlss(New.Cases ~ pb(as.numeric(Date)),
- sigma.formula= ~ pb(as.numeric(Date)),
- data = train.data,
GAMLSS allows a separate model formula for each of its parameters. On line 1, we model the mu parameter as a smooth curve pb(), which uses a penalized b-spline, with the degree of smoothing selected by cross-validation. The sigma parameter on line 2 is modelled with the same formula. This is a particular advantage of this method, since we do not have to assume constant values for the parameters. Here we have specified a model of the variance changing over time. We see this clearly in the plots below.
Two more parameters, nu and tau, are omitted from this specification, which results in the default, a constant for each of these.
We check the fit by inspecting plot(Model):
These plots show the quantile residuals in various ways. Normalized quantile residuals are transformations of the observed residuals that use the quantiles for the chosen distribution. The effect is that the transformed residuals are normally distributed if the model is correct. You see that this is true when looking at the normal-shaped bell curve in the Density Estimate above, and in the Normal Q-Q plot, where a roughly straight line indicates that the residuals are normally distributed. We find that in many cases, we are able to create adequate models with healthy diagnostics. These models are then used to predict the distribution of values we can expect to see on each date.
Because we have such a model, we use the centiles of the distribution to plot contour lines for any centile. This gives us not only the center of the distribution, but also graphically shows the distribution of confidence intervals. Here is one example:
In this plot, the dots represent the observed values of new cases across the state of Virginia. The output of the GAMLSS model provides a range of outcomes at a certain point in time. Each line in the visualization represents the plot of the nth percentile and the value here is that these can be used to create confidence intervals about the actual trend of the data.
We see an interesting increase in the modelled variance, followed by a relative decrease. We speculate that these high daily counts may indicate highly localized outbreaks in an area, building or event, or timing issues caused by new sets of tests being administered. With sufficient data, the model captures these events as part of its error distribution.
As a means of validating our approach, we use 153 geographic areas for which we developed adequate models. We use 7 days of available data as a hold-out sample, and project our results forward for comparison with the percentiles we predicted. We tag each new observation into the band that it falls into and compare the percentiles of this test data set with the expected percentiles:
Each point (dot) represents a group of observations whose predictions fall into a range of counts. The y-axis is the actual result expressed as a percentile. The line is y=x, perfect agreement between expected and actual.
The observed percentiles agree well with a random set of draws from the predicted distribution. For example, 44% of the actual counts fall below the median predicted for its day and geo. We conclude that extrapolating from such models gives a reasonable expectation of the range of results we are likely to see in the coming week.
We decide to use these percentile plots in TIBCO’s COVID Live Report whenever we are able to fit a good model. In most cases, failures are due to insufficient data, which could be seen as a drawback for the approach. We recognize that the GAMLSS, despite its clear advantages, is not adept at dealing with all types of data. Like any method, it is important to understand its advantages and disadvantages, and when to know how to utilize it.
GAMLSS in TIBCO’s COVID-19 Live Report
Using GAMLSS to fit epidemic curves and predict future cases/deaths over the next week is a new feature in TIBCO’s COVID-19 Live Report. Because GAMLSS does really well at capturing the current trend, a side effect is that the method is able to produce promising estimates about future data. By clicking the “View Forecasts” button on the Live Report’s home page, you are directed to a tab where you can explore the results of our GAMLSS method. On this page, you can choose one or more states/provinces, counties, or countries, and a GAMLSS model will fit to the selected data and provide predictions for the next 7 days. The following is the GAMLSS plot on California’s case counts:
The GAMLSS plot has three lines, one for the 10th, 50th, and 90th centiles. This is a benefit of the GAMLSS method as you are not only able to view the center of distribution, represented by the solid line, but you can also see the distribution of confidence values with the 10th and 90th centiles, represented by the dotted lines. Instead of being limited to just one prediction forecast, the GAMLSS model gives you a range of possibilities of what might happen over the next 7 days, as seen in red.
After some experimentation, we found that the GAMLSS method is an effective approach for understanding trends in COVID data. Due to the method’s unique distributional approach and solid statistical foundations, the GAMLSS improves on other more popular methods like Moving Average for fitting epidemic curves and making predictions. GAMLSS provides a credible estimate of the variance at each point in time and the distribution of expected values in the near future, something lacking in other methods. For more information on GAMLSS, be sure to check out the references below and explore the method on TIBCO’s dashboard.
- Ford, C. Understanding QQ Plots. August 26, 2015.
- Stasinopoulos, Dm & Rigby, Robert & Heller, Gillian & Voudouris, Vlasios & De Bastiani, Fernanda. Flexible regression and smoothing: using GAMLSS in R. 2017.
- Wikipedia. Moving average
- Wikipedia. P-spline