All Posts By

R Consortium

From R Hub – Counting and Visualizing CRAN Downloads with packageRank (with Caveats!)

By Blog

Originally posted on the R Hub blog

This post was contributed by Peter Li. Thank you, Peter!

packageRank is an R package that helps put package download counts into context. It does so via two functions. The first, cranDownloads(), extends cranlogs::cran_downloads() by adding a plot() method and a more user-friendly interface. The second, packageRank(), uses rank percentiles, a nonparametric statistic that tells you the percentage of packages with fewer downloads, to help you see how your package is doing compared to all other CRAN packages.

In this post, I’ll do two things. First, I’ll give an overview of the package’s core features and functions – a more detailed description of the package can be found in the README in the project’s GitHub repository. Second, I’ll discuss a systematic positive bias that inflates download counts.

Two notes. First, in this post I’ll be referring to active and inactive packages. The former are packages that are still being developed and appear in the CRAN repository. The latter are “retired” packages that are stored in the CRAN Archive along with past versions of active packages. Second, if you want to follow along (i.e., copy and paste code), you’ll need to install packageRank (ver. 0.3.5) from CRAN or GitHub:


# GitHub


cranDownloads() uses all the same arguments as cranlogs::cran_downloads():

cranlogs::cran_downloads(packages = "HistData")
        date count  package
1 2020-05-01   338 HistData
cranDownloads(packages = "HistData")
        date count  package
1 2020-05-01   338 HistData

The only difference is that `cranDownloads()` adds four features:

Check package names

cranDownloads(packages = "GGplot2")
## Error in cranDownloads(packages = "GGplot2") :
##   GGplot2: misspelled or not on CRAN.
cranDownloads(packages = "ggplot2")
        date count package
1 2020-05-01 56357 ggplot2

This also works for inactive packages in the [Archive](

cranDownloads(packages = "vr")
## Error in cranDownloads(packages = "vr") :
##  vr: misspelled or not on CRAN/Archive.
cranDownloads(packages = "VR")
        date count package
1 2020-05-01    11      VR

Two additional date formats

With cranlogs::cran_downloads(), you can specify a time frame using the from and to arguments. The downside of this is that you must use the “yyyy-mm-dd” format. For convenience’s sake, cranDownloads() also allows you to use “yyyy-mm” or “yyyy” (yyyy also works).


Let’s say you want the download counts for HistData for February 2020. With cranlogs::cran_downloads(), you’d have to type out the whole date and remember that 2020 was a leap year:

cranlogs::cran_downloads(packages = "HistData", from = "2020-02-01",
  to = "2020-02-29")

With cranDownloads(), you can just specify the year and month:

cranDownloads(packages = "HistData", from = "2020-02", to = "2020-02")


Let’s say you want the year-to-date counts for rstan. With cranlogs::cran_downloads(), you’d type something like:

cranlogs::cran_downloads(packages = "rstan", from = "2020-01-01",
  to = Sys.Date() - 1)

With cranDownloads(), you can just type:

cranDownloads(packages = "rstan", from = "2020")

Check dates

cranDownloads() tries to validate dates:

cranDownloads(packages = "HistData", from = "2019-01-15",
  to = "2019-01-35")
## Error in resolveDate(to, type = "to") : Not a valid date.


cranDownloads() makes visualization easy. Just use plot():

plot(cranDownloads(packages = "HistData", from = "2019", to = "2019"))
A time series lineplot illustrating package downloads for a single package for 2019.
Figure 1 Visualize cranDownloads() for A Single Package

If you pass a vector of package names, plot() will use ggplot2 facets:

plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
  from = "2020", to = "2020-03-20"))
A time series lineplot with multiple window frames illustrating package downloads for multiple packages for 2019
Figure 2 Visualize cranDownloads() for Multiple Packages

If you want to plot those data in a single frame, use `multi.plot = TRUE`:

plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
  from = "2020", to = "2020-03-20"), multi.plot = TRUE)

For more plotting options, see the README on GitHub and the plot.cranDownloads() documentation.


packageRank began as a collection of functions I wrote to gauge interest in my cholera package. After looking at the data for this and other packages, the “compared to what?” question quickly came to mind.

Consider the data for the first week of March 2020:

plot(cranDownloads(packages = "cholera", from = "2020-03-01",
  to = "2020-03-07"))
A time series lineplot illustrating the package downloads counts for the cholera R package for the first week of March. There are download peaks on Wednesday and Saturday.
Figure 3 Package Downloads for ‘cholera’ March 1-7, 2020

Do Wednesday and Saturday reflect surges of interest in the package or surges of traffic to CRAN? To put it differently, how can we know if a given download count is typical or unusual?

One way to answer these questions is to locate your package in the frequency distribution of download counts. Below are the distributions for Wednesday and Saturday with the location of cholera highlighted:

A scatterplot that plots package downloads v. frequency count of package downloads (i.e., frequency distribution plot) for Wednesday, March 4, 2020. The plot has highly right-skewed and long-tailed shape.
Figure 4 Frequency Distribution of Package Downloads for Wednesday, March 4, 2020
A scatterplot that plots package downloads v. frequency count of package downloads (i.e., frequency distribution plot) for Saturday, March 7, 2020. The plot has highly right-skewed and long-tailed shape.
Figure 5 Frequency Distribution of Package Downloads for Saturday, March 7, 2020

As you can see, the frequency distribution of package downloads typically has a heavily skewed, exponential shape. On the Wednesday, the most “popular” package had 177,745 downloads while the least “popular” package(s) had just one. This is why the left side of the distribution, where packages with fewer downloads are located, looks like a vertical line.

To see what’s going on, I take the log of download counts (x-axis) and redraw the graph. In these plots, the location of a vertical segment along the x-axis represents a download count and the height of a vertical segment represents the frequency of a download count:

plot(packageDistribution(package = "cholera", date = "2020-03-04"))
A histogram plot that plots the base 10 logarithm of package downloads v. frequency count of package downloads (i.e., frequency distribution plot) for Wednesday, March 4, 2020.
Figure 6 Frequency Distribution of Package Downloads for Wednesday, March 4, 2020 with Logarithm of Download Counts
plot(packageDistribution(package = "cholera", date = "2020-03-07"), memoization = FALSE)
A histogram plot that plots the base 10 logarithm of package downloads v. frequency count of package downloads (i.e., frequency distribution plot) for Saturday, March 7, 2020.
Figure 7 Frequency Distribution of Package Downloads for Saturday, March 7, 2020 with Logarithm of Download Counts

While these plots give us a better picture of where cholera is located, comparisons between Wednesday and Saturday are impressionistic at best: all we can confidently say is that the download counts for both days were greater than its respective mode.

To make interpretation and comparison easier, I use the rank percentile of a download count in place of the nominal download count. This rank percentile is a nonparametric statistic tells you the percentage of packages with fewer downloads. In other words, it gives you the location of your package relative to the location of all other packages in the distribution. Moreover, by rescaling download counts to lie on the bounded interval between 0 and 100, rank percentiles make it easier to compare packages both within and across distributions.

For example, we can compare Wednesday (“2020-03-04”) to Saturday (“2020-03-07”):

        date packages downloads            rank percentile
1 2020-03-04  cholera        38 5,556 of 18,038       67.9
packageRank(package = "cholera", date = "2020-03-04", size.filter = FALSE)

On Wednesday, we can see that cholera had 38 downloads, came in 5,556th place out of the 18,038 unique packages downloaded, and earned a spot in the 68th percentile.

        date packages downloads            rank percentile
1 2020-03-07  cholera        29 3,061 of 15,950         80
packageRank(package = "cholera", date = "2020-03-07", size.filter = FALSE)

On Saturday, we can see that cholera had 29 downloads, came in 3,061st place out of the 15,950 unique packages downloaded, and earned a spot in the 80th percentile.

So contrary to what the nominal counts tell us, one could say that the interest in cholera was actually greater on Saturday than on Wednesday.

Computing rank percentiles

To compute rank percentiles, I do the following. For each package, I tabulate the number of downloads and then compute the percentage of packages with fewer downloads. Here are the details using cholera from Wednesday as an example:

pkg.rank <- packageRank(packages = "cholera", date = "2020-03-04",
  size.filter = FALSE)

downloads <- pkg.rank$crosstab

round(100 * mean(downloads < downloads["cholera"]), 1)
[1] 67.9

To put it differently:

(pkgs.with.fewer.downloads <- sum(downloads < downloads["cholera"]))
[1] 12250
(tot.pkgs <- length(downloads))
[1] 18038
round(100 * pkgs.with.fewer.downloads / tot.pkgs, 1)
[1] 67.9

Visualizing rank percentiles

To visualize packageRank(), use plot():

plot(packageRank(packages = "cholera", date = "2020-03-04"))
A plot of packageRank() for the cholera package for Wednesday, March 4, 2020. It plots rank order of downloads against the base 10 logarithm of downloads, highlights a package's rank percentile and nominal download counts, and the location of the 75th, 50th, and 25th quartile.
Figure 8 Rank Frequency Distribution of Package Downloads for Wednesday, March 4, 2020
plot(packageRank(packages = "cholera", date = "2020-03-07"))
A plot of packageRank() for cholera package for Saturday, March 7, 2020. It plots rank order of downloads against the base 10 logarithm of downloads, highlights a package's rank percentile and nominal download counts, and the location of the 75th, 50th, and 25th quartile.
Figure 9 Rank Frequency Distribution of Package Downloads for Saturday, March 7, 2020

These graphs, customized to be on the same scale, plot the rank order of packages’ download counts (x-axis) against the logarithm of those counts (y-axis). It then highlights a package’s position in the distribution along with its rank percentile and download count (in red). In the background, the 75th, 50th and 25th percentiles are plotted as dotted vertical lines; the package with the most downloads, which in both cases is magrittr (in blue, top left); and the total number of downloads, 5,561,681 and 3,403,969 respectively (in blue, top right).

Computational limitations

Unlike cranlogs::cran_download(), which benefits from server-side support (i.e., download counts are “pre-computed”), packageRank() must first download the log file (upwards of 50 MB file) from the internet and then compute the rank percentiles of download counts for all observed packages (typically 15,000+ unique packages and 6 million log entries). The downloading is the real bottleneck (the computation of rank percentiles takes less than a second). This, however, is somewhat mitigated by caching the file using the memoise package.

Analytical limitations

Because of the computational limitations, anything beyond a one-day, cross-sectional comparison is “expensive”. You need to download all the desired log files (each ~50 MB). If you want to compare ranks for a week, you have to download 7 log files. If you want to compare ranks for a month, you have to download 30 odd log files.

Nevertheless, as a proof-of-concept of the potential value of computing rank percentiles over multiple time frames, the plot below compares nominal download counts with rank percentiles of cholera for the first week in March. Note that, to the chagrin of some, two independently scaled y-variables are plotted on the same graph (black for counts on the left axis, red for rank percentiles on the right).

A time series lineplot with dual axes that compares download counts to rank percentiles of download counts for first week of March, 2020. It show that, contrary to what nominal download counts tell us, that the peak of interest on Saturday was greater than that on Wednesday.
Figure 10 Comparison of Package Download Counts and Rank Percentiles

Note that while the correlation between counts and rank percentiles is high in this example (r = 0.7), it’s not necessarily representative of the general relationship between counts and rank percentiles.

Conceptual limitations

Above, I argued that one of the virtues of the rank percentile is that it allows you to locate your package’s position relative to that of all other packages. However, one might wonder whether we may be comparing apple to oranges: just how fair or meaningful it is to compare a package like curl, an important infrastructure tool, to a package like cholera, an applied, niche application. While I believe that comparing fruit against fruit (packages against packages) can be interesting and insightful (e.g., the numerical and visual comparisons of Wednesday and Saturday), I do acknowledge that not all fruit are created equal.

This is, in fact, one of tasks I had in mind for packageRank. I wanted to create indices (e.g., Dow Jones, NASDAQ) that use download activity as a way to assess the state and health of R and its ecosystem(s). By that I mean I’d not only look at packages as a single collective entity but also as individual communities or components (i.e., the various CRAN Task Views, tidyverse, developers, end-users, etc.). To do the latter, my hope was to segment or classify packages into separate groups based on size and domain, each with its own individual index (just like various stock market indices). This effort, along with another to control for the effect of package dependencies (see below), are now on the back burner. The reason why is that I’d argue that we first need to address an inflationary bias that affects these data.

Inflationary Bias of Download Counts

Download counts are a popular way for developers to signal a package’s importance or quality, witness the frequent use of badges that advertise those numbers on repositories. To get those counts, cranlogs, which both adjustedcranlogs and packageRank among others rely on, computes the number of entries in RStudio’s download logs for a given package.

Putting aside the possibility that the logs themselves may not be representative of of R users in general1, this strategy of would be perfectly sensible. Unfortunately, three objections can be made against the assumed equivalence of download counts and the number of log entries.

The first is that package updates inflate download counts. Based on my reading of the source code and documentation, the removal of downloads due to these updates is what motivates the adjustedcranlogs package.2 However, why updates require removal, the “adjustment” is either downward or zero, is not obvious. Both package updates (existing users) and new installations (new users) would be of interest to developers (arguably both reflect interest in a package). For this reason, I’m not entirely convinced that package updates are a source of “inflation” for download counts.

The second is that package dependencies inflate download counts. The problem, in a nutshell, is that when a user chooses to download a package, they do not choose to download all the supporting, upstream packages (i.e., package dependencies) that are downloaded along with the chosen package. To me, this is the elephant-in-the-room of download count inflation (and one reason why cranlogs::cran_top_downloads() returns the usual suspects). This was one of the problems I was hoping to tackle with packageRank. What stopped me was the discovery of the next objection, which will be the focus of the rest of this post.

The third is that “invalid” log entries inflate download counts. I’ve found two “invalid” types: 1) downloads that are “too small” and 2) an overrepresentation of past versions. Downloads that are “too small” are, apparently, a software artifact. The overrepresentation of prior versions is a consequence of what appears to be efforts to mirror or download CRAN in its entirety. These efforts makes both “invalid” log entries particularly problematic. Numerically, they undermine our strategy of computing package downloads by counting logs entries. Conceptually, they lead us to overestimate the amount of interest in a package.

The inflationary effect of “invalid” log entries is variable. First, the greater a package’s “true” popularity (i.e., the number of “real” downloads), the lower the bias: essentially, the bias gets diluted as “real” downloads increase. Second, the greater the number of prior versions, the greater the bias: when all of CRAN is being downloaded, more versions mean more package downloads. Fortunately, we can minimize the bias by filtering out “small” downloads, and by filtering out or discounting prior versions.

Download logs

To understand this bias, you should look at actual download logs. You can access RStudio’s logs directly or by using packageRank::packageLog(). Below is the log for cholera for February 2, 2020:

packageLog(package = "cholera", date = "2020-02-02")
         date     time    size package version country ip_id
1  2020-02-02 03:25:16 4156216 cholera   0.7.0      US 10411
2  2020-02-02 04:24:41 4165122 cholera   0.7.0      CO  4144
3  2020-02-02 06:28:18 4165122 cholera   0.7.0      US   758
4  2020-02-02 07:57:22 4292917 cholera   0.7.0      ET  3242
5  2020-02-02 10:19:17 4147305 cholera   0.7.0      US  1047
6  2020-02-02 10:19:17   34821 cholera   0.7.0      US  1047
7  2020-02-02 10:19:17     539 cholera   0.7.0      US  1047
8  2020-02-02 10:55:22     539 cholera   0.2.1      US  1047
9  2020-02-02 10:55:22 3510325 cholera   0.2.1      US  1047
10 2020-02-02 10:55:22   65571 cholera   0.2.1      US  1047
11 2020-02-02 11:25:30 4151442 cholera   0.7.0      US  1047
12 2020-02-02 11:25:30     539 cholera   0.7.0      US  1047
13 2020-02-02 11:25:30   14701 cholera   0.7.0      US  1047
14 2020-02-02 14:23:57 4165122 cholera   0.7.0    <NA>     6
15 2020-02-02 14:51:10 4298412 cholera   0.7.0      US     2
16 2020-02-02 17:27:40 4297845 cholera   0.7.0      US     2
17 2020-02-02 18:44:10 4298744 cholera   0.7.0      US     2
18 2020-02-02 23:32:13   13247 cholera   0.6.0      GB    20

“Small” downloads

Entries 5 through 7 form the log above illustrate “small” downloads:

        date     time    size package version country ip_id
5 2020-02-02 10:19:17 4147305 cholera   0.7.0      US  1047
6 2020-02-02 10:19:17   34821 cholera   0.7.0      US  1047
7 2020-02-02 10:19:17     539 cholera   0.7.0      US  1047

Notice the differences in size: 4.1 MB, 35 kB and 539 B. On CRAN, the source and binary files of cholera are 4.0 and 4.1 MB *.tar.gz files. With “small” downloads, I’d argue that we end up over-counting the number of actual downloads.

While I’m unsure about the kB-sized entry (they seem to increasing in frequency so insights are welcome!), my current understanding is that ~500 B downloads are HTTP HEAD requests from lftp. The earliest example I’ve found goes back to “2012-10-17” (RStudio’s download logs only go back to “2012-10-01”.). I’ve also noticed that, unlike the above example, “small” downloads aren’t always paired with “complete” downloads.

To get a sense of their frequency, I look back to October 2019 and focus on ~500 B downloads. In aggregate, these downloads account for approximately 2% of the total. While this seems modest (if 2.5 million downloads could be modest),3 I’d argue that there’s actually something lurking underneath. A closer look reveals that the difference between the total and filtered (without ~500 B entries) counts is greatest on the five Wednesdays.

A time series lineplot comparing total package downloads with and without ~500 byte log entries. The plot shows that the difference is greatest on Wednesdays.
Figure 11 Total Package Downloads from CRAN With and Without ~500 B Downloads: October 2019

To see what’s going on, I switch the unit of observation from download counts to the number of unique packages:

A time series lineplot showing how the exclusion of ~500 byte log entries affects the number of observed unique packages that are downloaded. From 15,000+ packages on most days to 17,000+ packages on Wednesdays plus 3 addtional days.
Figure 12 Total Number of Unique Packages Downloaded from CRAN With and Without ~500 B Downloads: October 2019

Doing so, we see that on Wednesdays (+3 additional days) the total number of unique packages downloaded tops 17,000. This is significant because it exceeds the 15,000+ active packages on CRAN (go here for the latest count). The only way to hit 17,000+ would be to include some, if not all, of the 2,000+ inactive packages. Based on this, I’d say that on those peak days virtually, if not literally, all CRAN packages (both active and inactive) were downloaded.4

Past versions

This actually understates what’s going on. It’s not just that all packages are being downloaded but that all versions of all packages are being regularly and repeatedly download. It’s these efforts, rather than downloads done for reasons of compatibility, research, or reproducibility (including the use of Docker) that lead me to argue that there’s an overrepresentation of prior versions.

As an example, see the first eight entries for cholera from the October 22, 2019 log:

packageLog(packages = "cholera", date = "2019-10-22")[1:8, ]
        date     time    size package version country  ip_id
1 2019-10-22 04:17:09 4158481 cholera   0.7.0      US 110912
2 2019-10-22 08:00:56 3797773 cholera   0.2.1      CH  24085
3 2019-10-22 08:01:06 4109048 cholera   0.3.0      UA  10526
4 2019-10-22 08:01:28 3764845 cholera   0.5.1      RU   7828
5 2019-10-22 08:01:33 4284606 cholera   0.6.5      RU  27794
6 2019-10-22 08:01:39 4275828 cholera   0.6.0      DE   6214
7 2019-10-22 08:01:43 4285678 cholera   0.4.0      RU   5721
8 2019-10-22 08:01:46 3766511 cholera   0.5.0      RU  15119

These eight entries record the download of eight different versions of cholera. A little digging with packageRank::packageHistory() reveals that the eight observed versions represent all the versions available on that day:

  Package Version       Date Repository
1 cholera   0.2.1 2017-08-10    Archive
2 cholera   0.3.0 2018-01-26    Archive
3 cholera   0.4.0 2018-04-01    Archive
4 cholera   0.5.0 2018-07-16    Archive
5 cholera   0.5.1 2018-08-15    Archive
6 cholera   0.6.0 2019-03-08    Archive
7 cholera   0.6.5 2019-06-11    Archive
8 cholera   0.7.0 2019-08-28       CRAN

Showing that all versions of all packages are being downloaded is not as easy as showing the effect of “small” downloads. For this post, I’ll rely on a random sample of 100 active and 100 inactive packages.

The graph below plots the percent of versions downloaded for each day in October 2019 (IDs 1-100 are active packages; IDs 101-200 are inactive packages). On the five Wednesdays (+ 3 additional days), there’s a horizontal line at 100% that indicates that all versions of the packages in the sample were downloaded.5

A  multiple window frames time series lineplot with, one for each of the 31 days in October 2019, that shows that on Wednesdays plus 3 additional days, all versions of all packages are downloaded.
Figure 13 Percent of Package-Versions Downloaded for 100 Active & 100 Inactive Packages: October 2019


To minimize this bias, we could filter out “small” downloads and past versions. Filtering out 500 B downloads is simple and straightforward (packageRank() and packageLog() already include this functionality). My understanding is that there may be plans to do this in cranlogs as well. Filtering out the other “small” downloads is a bit more involved because you’d need the size of a “valid” download. Filtering out previous versions is more complicated. You’d not only need to know the current version, you’d probably also want a way to discount rather than to simply exclude previous version(s). This is especially true when a package update occurs.


Should you be worried about this inflationary bias? In general, I think the answer is yes. For most users, the goal is to estimate interest in R packages, not to estimate traffic to CRAN. To that end, “cleaner” data, which adjusts download counts to exclude “invalid” log entries should be welcome.

That said, how much you should worry depends on what you’re trying to do and which package you’re interested in. The bias works in variable, unequal fashion. It’s a function of a package’s “popularity” (i.e, the number of “valid” downloads) and the number of prior versions. A package with more “real” downloads will be less affected than one with fewer “real” downloads because the bias gets diluted (typically, “real” interest is greater than “artificial” interest). A package with more versions will be more affected because, if CRAN in its entirety is being downloaded, a package with more versions will record more downloads than one with fewer versions.


To illustrate the effect of popularity, I compare ggplot2 and cholera for October 2019. With one million plus downloads, ~500 B entries inflate the download count for ggplot2 by 2%:

A time series lineplot comparing downloads with and without ~500 byte log entries for a popular package, ggplot2. The plot shows that the inflation is 2%.
Figure 14 Effect of ~500 B Downloads on Download Counts on a Popular Package: October 2019

With under 400 downloads, ~500 B entries inflate the download count for cholera by 25%:

A time series lineplot comparing downloads with and without ~500 byte log entries for a less popular package, cholera. The plot shows that the inflation is 25%.
Figure 15 Effect of ~500 B Downloads on Download Counts on a Less Popular Package: October 2019

Number of versions

To illustrate the effect of the number of versions, I compare cholera, an active package with 8 versions, and ‘VR’, an inactive package last updated in 2009, with 92 versions. In both cases, I filter out all downloads except for those of the most recent version.

With cholera, past versions inflate the download count by 27%:

A time series lineplot comparing downloads with all version versus downloads with just the recent versions for a package, cholera, with few versions. The plot shows that the inflation is 27%.
Figure 16 Effect of the Number of Prior Versions on Download Counts for a Package with Few Versions: October 2019

With ‘VR’, past version inflate the download count by 7,500%:

A time series lineplot comparing downloads with all version versus downloads with just the recent versions for a package, VR, with many version. The plot shows that the inflation is 7,500%.
Figure 17 Effect of the Number of Past Versions on Download Counts for a Package with Many Versions: October 2019

Popularity & number of versions

To illustrate the joint effect of both ~500 B downloads and previous versions, I again use cholera. Here, we see that the joint effect of both biases inflate the download count by 31%:

A time series lineplot comparing all downloads versus downloads without ~500 byte log entries and previous versions for the cholera package. The plot shows that the inflation is 31%.
Figure 18 Effect of ~500 B Downloads and Number of Past Versions on Download Counts: October 2019

OLS estimate

Even though the bias is pretty mechanical and deterministic, to show that examples above are not idiosyncratic, I conclude with a back-of-the-envelope estimate of the joint, simultaneous effect of popularity (unfiltered downloads) and version count (total number of versions) on total bias (the percent change in download counts after filtering out ~500 B download and prior versions).

I use the above sample of 100 active and 100 inactive packages as the data. I fit an ordinary least squares (OLS) linear model using the base 10 logarithm for the three variables. To control for interaction between popularity and number of versions (i.e., popular packages tend to have many version; packages with many version tend to attract more downloads), I include a multiplicative term between the two variables. The results are below:

lm(formula = bias ~ popularity + versions + popularity * versions,
    data =

     Min       1Q   Median       3Q      Max
-0.50028 -0.12810 -0.03428  0.08074  1.09940

                    Estimate Std. Error t value Pr(>|t|)
(Intercept)          2.99344    0.04769  62.768   <2e-16 ***
popularity          -0.92101    0.02471 -37.280   <2e-16 ***
versions             0.98727    0.07625  12.948   <2e-16 ***
popularity:versions  0.05918    0.03356   1.763   0.0794 .
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2188 on 195 degrees of freedom
Multiple R-squared:  0.9567,	Adjusted R-squared:  0.956
F-statistic:  1435 on 3 and 195 DF,  p-value: < 2.2e-16

The hope is that I at least get the signs right. That is, the signs of the coefficients in the fitted model (the “Estimate” column in the table above) should match the effects described above: 1) a negative sign for “popularity”, implying that greater popularity is associated with lower bias, and 2) a positive sign for “versions”, implying that a greater number of versions is associated with higher bias. For what it’s worth, the coefficients and the model itself are statistically significant at conventional levels (large t-values and small p-scores for the former; large F-statistic with a small p-score for the latter).


This post introduces some of the functions and features of packageRank. The aim of the package is to put package download counts into context using visualization and rank percentiles. The post also describes a systematic, positive bias that affects download counts and offers some ideas about how to minimize its effect.

The package is a work-in-progress. Please submit questions, suggestions, feature requests and problems to the comments section below or to the package’s GitHub Issues. Insights about “small” downloads are particularly welcome.

From R Epidemics Consortium – RECON COVID-19 challenge – Two Paid Consultant Positions Available

By Blog

Bringing the R community together against COVID-19

Originally posted on the R Epidemics Consortium blog

We are proud to announce that the R Consortium has awarded RECON a grant for $23,300 to develop the RECON COVID-19 challenge, a project aiming to centralise, organise and manage needs for analytics resources in R to support the response to COVID-19 worldwide.

These resources will permit to expand our initial preliminary collection of github issues to create a user-friendly web platform gathering R tasks reflecting needs from different projects and groups, and facilitate contributions from the wider R programming community.

We are looking for two consultants (see ‘We need you!’ section below), including:

  • project manager to drive the project forward
  • web developer to create the website

The RECON COVID-19 challenge in a nutshell

The RECON COVID-19 challenge aims to bring together the infectious disease modelling, epidemiology and R communities to improve analytics resources for the COVID-19 response via a website which will provide a platform to centralise, curate and update R development tasks relevant to the COVID-19 response. Similar to the Open Street Map Tasking Manager (, this platform will allow potential contributors to quickly identify outstanding tasks submitted by groups involved in the response to COVID-19 and ensure that developments follow the highest scientific and technical standards.

While this project is aimed at leveraging R tools for helping to respond to COVID-19, we expect that it will lead to long-lasting developments of partnerships between the R and epidemiological communities, and that the resources developed will become key assets for supporting outbreak responses well beyond this pandemic.

For more details about the project, check out our initial project proposal.

We need you!

We are urgently looking for two consultants to get this project started. These two positions funded by the R Consortium include (click links for job descriptions):

You can apply to either position by sending us an email, attaching a recent CV, and a cover letter stating why you are interested in this project.

  • closing date for applications: Monday 22nd June 2020, 12:00 UTC
  • interviews: interviews will be held by members of the RECON board on Wednesday 24th June.

COVID-19 Data Forum – Co-Sponsored with Stanford Data Science Institute – Heavily Attended

By Blog

The first COVID-19 Data Forum, co-sponsored by the Stanford Data Science Institute and the R Consortium, was held May 14, 2020. The forum used Zoom as a way to connect remote specialists, present information, and conduct a Q&A session so that participants could ask questions and give opinions.

UPDATED – Full video recording here

Close to 200 people attended, watching a range of experts cover a level of detail around COVID-19 that is not available through newspapers, and asking questions covering science and policy.

From the COVID-19 Data Forum site:

The COVID-19 pandemic has challenged science and society to an unprecedented degree. Human lives and the future of our society depend on the response. That response, in turn, depends critically on data. This data must be as complete and accurate as possible; easily and flexibly accessible, and equipped to communicate effectively with decision-makers and the public.

The COVID-19 Data Forum is a project to bring together those involved with relevant data in a series of multidisciplinary online meetings discussing current resources, needed enhancements, and the potential for co-operative efforts.

Speakers (full slides for each presentation available soon)

  • Orhun Aydin, Researcher and Product Engineer, ESRI
  • Ryan Hafen, data scientist consultant with Preva Group, and adjunct assistant professor, Purdue University
  • Alison L. Hill, Research Fellow and independent principal investigator at Harvard’s Program for Evolutionary Dynamics.
  • Noam Ross, Senior Research Scientist, EcoHealth Alliance

See the COVID-19 Data Forum site for information on upcoming Data Forum series virtual events!

Stanford Data Science Institute Joins R Consortium in Sponsoring COVID-19 Data Forum – Join Us!

By Announcement, Blog

The Stanford Data Science Institute, which aims to give Stanford faculty and students the tools, skills and understanding they need to do cutting-edge research, is joining with the R Consortium to build the COVID-19 Data Forum series.

The first meeting is May 14, 12:00 pm PT, California time.

UPDATED: Full video recording available here:

The COVID-19 Data Forum series is an ongoing set of online meetings that connect multidisciplinary topic experts to focus on data-related aspects of the COVID-19 pandemic modeling process such as data access and sharing, essential data resources for modeling and how we can best support decision making. 

The first half of the meeting is a public webinar and all are welcome to attend.

Connect with R Consortium via email ( or on Twitter @RConsortium

COVID-19 Data Forum

Helping R Community Events Go Virtual

By Blog

The R Consortium helps provide all sorts of resources to projects, companies, and events to help build R infrastructure and expand the R community. We have given out grants over $1 million dollars to developers (and it’s a good time to prep for the next Fall Grant Cycle), we give funding to events and meetups through out R User Group (RUGs) program, we help fund the popular R-Ladies, which promotes diversity in the R community through meetups, mentorship and global collaboration and has 170+ groups worldwide, and much more.

At the Linux Foundation, we have been studying robust, scalable virtual events platforms that we can not only use for our own R Consortium events, but that we could extend as a resource to the R community. 

Here is the current state of our evaluation. We’ve covered 86 virtual event platforms, and come up with a list of 4 finalists. Since specific circumstances and goals for events will always vary, we expect that there will never be a one-size-fits-all solution.

The four finalists are: 

inXpo Intrado

Best for large events with high budgets requiring a virtual conference experience with few compromises


Best for medium to large events with smaller budgets that want to offer a 3D environment/booth experience


Best for any size event where attendee networking tools are a priority and sponsor ‘booths’ aren’t required

QiQo Chat

QiQo is best for smaller technical gatherings that don’t need all the bells and whistles of an industry event focus, a great option for developer meetings and hackathons

From our blog on the selection process (“Virtual event suggestions for open source communities”).

The good news is that for those events that can no longer safely take place in person, virtual events still offer the opportunity to connect within our communities to share valuable information and collaborate. While not as powerful as a face-to-face gathering, a variety of virtual event platforms available today offer a plethora of features that can get us as close as possible to those invaluable in-person experiences. Thanks to our community members, we’ve received suggestions for platforms and services that the events team has spent the past several weeks evaluating. 

After researching a large number of possibilities over the last few weeks, the Linux Foundation has identified four virtual event platforms (and a small-scale developer meeting tool) that could serve the variety of needs within our diverse project communities. Our goal was to determine the best options that capture as much of the real-world experience as we can in a virtual environment for virtual gatherings ranging from large to small. 

If you are considering a virtual alternative for your R community meetup or event, please contact us. We may be able to help!

Hosting a Virtual useR Meetup

By Blog

By Rachael Dempsey, Senior Enterprise Advocate at RStudio / Greater Boston useR Organizer

Last month, the Boston useR Group held our very first virtual meetup and opened this up to anyone that was interested in joining. While I wasn’t sure what to expect at first, I was so happy with the turnout and reminded again of just how great the R community is. Everyone was so friendly and appreciative of the opportunity to meet together during this time. It was awesome to see that people joined from all over the world – not just from the Boston area. We had attendees from the Netherlands, Spain, Mexico, Chile, Canada, Ireland, and I’m sure many other places!

Our event was a virtual TidyTuesday Meetup held over Zoom, which can hold up to 100 people without having to purchase the large meeting add-on. (If you’re worried about the number of people being over this, keep in mind that often half the people that register will attend.)

This was our agenda:

  • 5:30: Introductions to useR Meetup & TidyTuesday (Rachael Dempsey & Tom Mock)
  • 5:35: Presentation #1 – Meghan Hall: “Good to Great: Making Custom Themes in ggplot2”
  • 5:50: Presentation #2 – Kevin Kent: “The science of (data science) teaching and learning”
  • 6:00: Introduction to R for Data Science Slack Channel – Jon Harmon
  • 6:05: Breakout into groups to work on TidyTuesday dataset – groups will be open for two-hours but you can come and go as you want!
  • 7:30: Come back together to the Main Room for an opportunity to see a few of the examples that people would like to share

If you’re thinking of keeping your monthly event and want to host it virtually, I’ve included a few tips below:

Find someone (or multiple people) to co-host with you!

Thank you, Kevin Kent and Asmae Toumi! Kevin, a member of the Boston useR Group was originally going to be the lead for our in-person TidyTuesday meetup and posted about the meetup on Twitter, where we both met our other co-host, Asmae Toumi. Asmae then introduced us to one of our presenters, Meghan Hall. Having co-hosts not only made me feel more comfortable, but gave me a chance to bounce ideas off of someone and made it much easier to market the event to different groups of people. While I often share events on LinkedIn, Kevin and Asmae have a much bigger presence on Twitter. Aside from your own meetup group and social media, another helpful place to find potential co-hosts may be on the events thread of Instead of co-hosting, you could also just ask people if they would be willing to volunteer to help at the meetup. Thank you to Carl Howe, Jon Harmon, Josiah Parry, Meghan Hall, Priyanka Gagneja, and Tom Mock for your support. If I can help you with finding volunteers, please don’t hesitate to reach out on LinkedIn.

Have a practice session on Zoom!

The day before the event we held a practice session on Zoom to work out a few of the kinks. As we were hosting a TidyTuesday meetup, we wanted to be able to meet in smaller groups too, as we would if we were in-person. I had never used Zoom breakout rooms before and wanted to test this out first. After the initial presentations, we broke out into 7 smaller groups. These groups worked well to help facilitate conversation among attendees. During the test, we confirmed that you can move people from different breakout groups if needed. This was helpful for keeping the groups even as some attendees had to leave before the end of the event.

Have a Slack Channel or a way for people to chat if they have questions

During the meetup, we used the R for Data Science Online Learning Community Slack Channel as a venue to ask questions and share examples of what people were working on. You can join this Slack channel by going to We used the channel, #chat-tidytuesday which you can find by using the search bar within Slack.

Accept that it won’t be perfect

You can practice and plan how you want things to go, but I think it’s helpful to recognize that this is the first time doing this and it’s okay if things aren’t perfect. For example, we were going to create separate breakout groups based on people’s interests and have everyone use a Google doc to indicate this at the start. While it was good in theory, we determined this would be a bit too hard to manage and complicate things so I just automatically split people up into the 7 different groups. It wasn’t perfect, but it worked!

Think about Zoom best practices

This came up in discussion during our practice call and I think we’ve all seen recently that there can be a few bad-actors out there trying to ruin open meetings. @alexlmiller shared a few tips on Twitter that I’d like to cross post here as well.

You can start with the Main Settings on your Zoom account and do the following:

1) Disable “Join Before Host”

2) Give yourself some moderation help by enabling “Co-Host” – this lets you assign the same host controls to another person in the call

3) Change “Screen sharing” to “Host Only”

4) Disable “File Transfer”

5) Disable “Allow Removed Participants to Rejoin”

And also to make the overall experience a little nicer:

1) Disable “Play sounds when participants join or leave”

2) Enable “Mute participants upon entry”

3) Turn on “Host Video” and “Participants Vide” (if you want that)

One more thing, if you want to split meeting participants into separate, smaller rooms you have to enable “Breakout Rooms”.

Market your event on social media

Once your event is posted to meetup, share it with others through multiple channels. Maybe that’s a mix of your internal Slack channel, Twitter, your LinkedIn page and/or the “R Project Group” on LinkedIn …or wherever you prefer to connect with people online. Keep in mind that this could be a different audience than your usual meetups because it’s now accessible to people all over the world. Ask a few people to share your post as well so that you can leverage their network as well. 

Have fun!

Reflecting back on our meetup, some of us found that with the use of Zoom breakout groups and a Slack channel our event was surprisingly more interactive than our actual in-person meetups. It was also an awesome opportunity to do something social and get together with others from the community during this crazy time. If you have any tips from your own experiences, please let me know and don’t hesitate to reach out if I can assist in any way. Hope this helps! 

R Consortium Member Esri Empowers Informed Decision-Making Around COVID-19

By Blog

Esri, international supplier of geographic information system software, web GIS and geodatabase management applications, is providing a comprehensive set of resources for researchers and others mapping the spread of the coronavirus 2019 (COVID-19) pandemic.

Esri COVID-19 Overview

From Esri: “As the situation surrounding coronavirus disease 2019 (COVID-19) continues to evolve, Esri is supporting our users and the community at large with location intelligence, geographic information system (GIS) and mapping software, services, and materials that people are using to help monitor, manage, and communicate the impact of the outbreak. Use and share these resources to help your community or organization respond effectively.”

The site provides 

  • GIS Help
  • Access GIS Resources: COVID-19 GIS Hub
  • View global maps and dashboards
  • Get insights – View reliable, up-to-date content related to COVID-19 from trusted sources.

From Esri: As global communities and businesses seek to respond to the COVID-19 pandemic, you can take these five proactive steps to create an instant picture of your organization’s risk areas and response capacity.

Step 1

Map the cases

Map confirmed and active cases, fatalities, and recoveries cases to identify where COVID-19 infections exist and have occurred.

Step 2

Map the spread

Time-enabled maps can reveal how infections spread over time and where you may want to target interventions.

Step 3

Map vulnerable populations

COVID-19 disproportionally impacts certain demographics such as the elderly and those with underlying health conditions. Mapping social vulnerability, age, and other factors helps you monitor the most at-risk groups and regions.

Step 4

Map your capacity

Map facilities, employees or citizens, medical resources, equipment, goods, and services to understand and respond to current and potential impacts of COVID-19.

Step 5

Communicate with maps

Use interactive web maps, dashboard apps, and story maps to help rapidly communicate your situation.

Esri COVID-19 Overview

Community of Bioinformatics Software Developers (CDSB): The story of a diversity and outreach hotspot in Mexico that hopes to empower local R developers

By Blog

By Leonardo Collado Torres, Ph. D., Research Scientist, Lieber Institute for Brain Development, Brain genomics #rstats coder working w/ @andrewejaffe @LieberInstitute. @lcgunam @jhubiostat @jtleek alumni. @LIBDrstats @CDSBMexico co-founder

I have been attending R conferences since 2008, and while I’ve seen the R community grow rapidly, I generally don’t encounter as many Latin Americans (LatAm) among communities of R developers. Traditionally, a lab lead investigator invested in R or Bioconductor would teach their trainees and students these skills, becoming a local R hotspot. However, that scenario is uncommon in Mexico for several reasons. Recognizing some of these challenges and driven to promote R in our home country and LatAm, in 2017 Alejandro Reyes and I teamed up with Alejandra Medina Rivera and Heladia Salgado to eventually launch the Community of Bioinformatics Software Developers CDSB (in Spanish) in 2018. One of our goals is to facilitate and encourage the transition from R user to R/Bioconductor developer. We have organized yearly one-week long workshops together with NNB-UNAM and RMB and just announced our 2020 workshop (August 3-7 2020 Cuernavaca, Mexico). 

Now unto our third workshop, I feel like we’ve had several success stories.

We have greatly benefited from the logistics and organization support by NNB-UNAM and RMB local teams, allowing us to focus on designing the workshop curriculum and inviting a diverse set of instructors, including Maria Teresa Ortiz who is an RLadiesCDMX co-founder and has been supporting us from the beginning. However, we face economic challenges as the budget for the national science foundation (CONACyT) has decreased in recent years. The support by the small R conference fund by R Consortium and other sponsors has been instrumental, as well as diversity and travel scholarships some of our instructors have secured at R conferences. We just recently revamped our sponsor page and answered the question: why should you support us?

However, while we are just getting started, one of our highlights was born by rOpenSci’s icebreaker exercise at CDSB2019. We were able to really build a sense of community and desire to perform outreach activities at our local communities. Particularly, a CDSB2018 and 19 alumni, Joselyn Chávez, volunteered to join the CDSB board. At CDSB2019 we also created an #rladies channel in our Slack where at the time we had members of 3/4 Mexico’s RLadies chapters (Qro, Xalapa, CDMX) and now have 5/6 (Cuerna, Monterrey), as CDSB2018 and CDSB2019 alumni have been co-founders of two chapters: Ana Beatriz Villaseñor-Altamirano for Qro and Joselyn Chávez for Cuerna.

I am proud and excited of what we have achieved with our one-week long CDSB workshops, but also with how we used the tools we’ve learnt from other communities in order to keep interacting and communicating throughout the rest of the year. Time will tell if our efforts created a ripple that grew into a wave or if we’ll end burning out. Sustainability is a challenge, but we are greatly motivated by the impact we’ve had and can only imagine a brighter future.

Stay Strong Stronger Together GIF by GIPHY Studios Originals - Find & Share on GIPHY

March 2020 ISC call for proposals – Now Open!

By Announcement, Blog

The March 2020 ISC Call for Proposals is now open. Once again, we are looking for ambitious projects that will contribute to the infrastructure of the R ecosystem and benefit large sections of the R community.  Our goal is to stimulate creativity and help you turn good ideas into tangible benefits. 

It is very likely that everyone who reads this post will be reorganizing aspects of their everyday lives to cope with the challenge of the Covid-19 virus. Accordingly, we are suggesting a theme for this call for proposals: What can we do to improve the R infrastructure for locating, accessing, cleaning and reporting on data related to the epidemic that will be useful now and in the future?

In the recently published post COVID-19 epidemiology with R, researcher Tim Churches highlights some of the challenges presented in acquiring accurate “real time” data. These include locating sources, writing code to scrape Wikipedia, a site whose structure may change every time it is updated, digging out data embedded in multiple different languages etc and providing mechanisms for researchers to store data, share code and exchange ideas. 

But don’t be constrained by the theme. There is other work that needs to be done and we want to hear about ideas that we may be able to facilitate.

As always, “Think Big” but structure your proposal with intermediate milestones. The ISC is not likely to fund proposals that ask for large initial cash grants. We tend to be conservative with initial grants, preferring projects structured in such a way that significant initial milestones can be achieved with modest amounts of cash.

As with any proposed project, the more detailed and credible the project plan, and the better the track record of the project team, the higher the likelihood of receiving funding. Please be sure that your proposal includes measurable objectives, intermediate milestones, a list of all team members who will be contributing work and a detailed accounting of how the grant money will be spent.

To submit a proposal for ISC funding, read the Call for Proposals page and submit a self-contained pdf using the online form. You should receive confirmation within 24 hours.

The deadline for submitting a proposal is midnight, April 2, 2020.

R Consortium Welcomes New Member ThinkR, R Language and Data Science Engineering Company

By Announcement, Blog

Services include R consulting, development, and training; contributes to multiple R open source projects including golem, framework for building robust Shiny apps

SAN FRANCISCO, March 3, 2020 – The R Consortium, a Linux Foundation project supporting the R Foundation and R community, today announced that ThinkR has joined the R Consortium as a Silver Member. ThinkR provides R engineering, training, and consulting, and is based in France. 

“We provide R Language infrastructure, engineering and training to our clients, and at the same time we believe it is important to give back to the R community by participating in open source projects, holding meetups and training, and promoting R in many ways. Joining the R Consortium will help us to expand our support for R even more, and allow us to work toward building better R infrastructure that helps R developers and our customers,” said Diane Beldame, CEO, ThinkR. “Joining the R Consortium will allow us to better support and promote the R community and that is a big benefit for our clients.”

ThinkR developers devote a part of their time to R and Data Science communities. This includes supporting various R packages on Github, holding meetups and other conferences connected to R, posting development tips on the ThinkR blog, and responding on Stackoverflow and other Slack communities.

“We are excited to welcome ThinkR to the R Consortium. ThinkR is on the front lines of providing R to industries in ways that immediately contribute to their customers’ success,” said Joseph Rickert, RStudio’s R Community Ambassador and R Consortium Board Chair. “At the same time, ThinkR contributes to the R community with open source projects and much more, and we’re very pleased they will be involved in moving the R Consortium forward.”

ThinkR has clients in a wide range of industries including public institutions, Pharmaceutical, Energy, Banking, Electronics Manufacturing, Research, and more. 

ThinkR Resources

About The R Consortium 

The R Consortium is a 501(c)6 nonprofit organization and Linux Foundation project dedicated to the support and growth of the R user community. The R Consortium provides support to the R Foundation and to the greater R Community for projects that assist R package developers, provide documentation and training, facilitate the growth of the R Community and promote the use of the R language. For more information about R Consortium, please visit:

About Linux Foundation 

Founded in 2000, the Linux Foundation is supported by more than 1,000 members and is the world’s leading home for collaboration on open source software, open standards, open data, and open hardware. Linux Foundation projects like Linux, Kubernetes, Node.js and more are considered critical to the development of the world’s most important infrastructure. Its development methodology leverages established best practices and addresses the needs of contributors, users and solution providers to create sustainable models for open collaboration. For more information, please visit us at

# # #