Mark Hornick, Author at R Consortium

Mar 27

CII Best Practices – R Package Leaderboard

By Mark Hornick Blog

Since my last post on the Core Infrastructure Initiative CII Best Practices Badge for R Packages – responding to concerns, there have been many R language projects started – and completed – on the CII Best Practices site. In this post, we recognize the R projects that have achieved the CII Best Practices – Passing level, and note that several are well on their way to achieving silver level. In all, there are more than 50 CII projects related to R packages, with the popular ggplot2 package at the cusp of joining the group below with 97% completion as of this post.

Please congratulate these package owners for their achievement. If you’re a package developer, consider adding your package to the CII Best Practices ranks, and work your way through the levels of passing, silver, and gold.

Id	Name	Description	Owner
265	madrid.air	Parse air quality data published by http://datos.madrid.es/	Ramón Novoa
1882	DBI	A database interface (DBI) definition for communication between R and RDBMSs	Kirill Müller
2011	Delaporte	Provides the probability mass, distribution, quantile, random variate generation, and method of moments parameter estimation	Avraham Adler
2022	lamw	Calculates the real-valued branches of the Lambert-W function	Avraham Adler
2033	pade	Returns the numerator and denominator when given a vector of Taylor series coefficients of sufficient length as input	Avraham Adler
2041	fixedWidth	Save fixed width files	Jeston
2053	DataExplorer	Simplified Exploratory Data Analysis	Boxuan Cui
2054	PKNCA	Perform all noncompartmental analysis (NCA) calculations for pharmacokinetic (PK) data	Bill Denney
2055	BAS	Bayesian Variable Selection and Model Averaging using Bayesian Adaptive Sampling	Merlise Clyde
2083	MortalityLaws	Fit and compare the most popular human mortality laws	MariusD. Pascariu
2135	drake	A general-purpose workflow manager for data-driven tasks in R that rebuilds intermediate data objects when their dependencies	Will Landau
2136	httptest	A Test Environment for HTTP Requests in R	Neal Richardson
468	busdater	Business dates for R	Mick Mioduszewski
2527	jtools	Summarize and visualize regressions with other helpful tools	Jacob Long

Mar 25

Love0

Package licensing and enterprise use

By Mark Hornick Blog

For enterprise users of R, licensing terms of open source software can occupy a significant share of Legal and Corporate Architecture departments time. In Should R Consortium Recommend CII Best Practices Badge for R Packages: Latest Survey Results, one survey topic touched on the licensing of R packages. In talking with various enterprise users of R, there were a few suggestions about how the R community could make leveraging R packages easier within enterprises, while allowing Legal and Corporate Architecture departments to get more sleep.

Getting approvals to use packages

Some of you may be familiar with the process that enterprise users of R packages go through for approvals to use R in their products. Third party software often needs to go through legal reviews, corporate architectural reviews, security reviews, and line of business approvals before they can find their way into use within an enterprise or in products that they produce.

One area of concern is the use of GPL licenses, and the potential impact they may have on proprietary software. See Why GPL still gives enterprises the jitters for more discussion. While there are varying debates about the true impact of a certain license designation, for example, GPL–2 versus GPL–3, in many large organizations, a more conservative interpretation is often applied. (Comparing license options.)

Perhaps less known, is that it’s not just the license of the package in question, but all of its dependent packages, recursively. For example, is a GPL–3 licensed package using a GPL–2 license package validly designated?

What can we do?

When we ask representatives of enterprises who are responsible for approving the use of third-party software what would make their easier, a few suggestions for package authors and maintainers arise concerning licensing:

Packages should not depend on other packages that have incompatibly licensed materials
Use the most permissive license possible for your package, for example, LGPL, GPL–3 or GPL>=2, as opposed to just GPL–2
Minimize the number of dependent packages whenever possible, since each one requires its own approval process which affects adoption
Avoid using packages with more restrictive licensing terms than you intend for your package

We encourage package authors and maintainers to review their dependent packages and look for opportunities to address the suggestions above. Where possible, encourage dependent package authors and maintainers to adopt more permissive licenses as well. Where not possible, ask whether the functionality provided by the dependent package is essential.

For enterprise users of open source software, ask your Legal departments to share their concerns with developers so more informed choices can be made in the future.

Aug 16

Love0

CII Best Practices Badge for R Packages – responding to concerns

By Mark Hornick Blog, R Consortium Project

Our last post Should R Consortium Recommend CII Best Practices Badge for R Packages: Latest Survey Results summarized results from the CII Best Practices survey conducted this summer. A goal of the CII Best Practices program is to help improve open source software quality. Respondents shared several concerns to which David Wheeler, project lead for the Core Infrastructure Initiative (CII) with the Linux Foundation, and I wanted to respond.

Let’s dive in…

Concern #1: Does the CII Badge have the “correct” or “best” set of criteria?

The CII Badge criteria are the best general-purpose OSS project criteria that we, the OSS community, have developed to date. The CII Badge criteria were developed based on the experience and recommendations of many experts, previous criteria developed by various organizations, and the examination of real-world successful OSS projects. No doubt the criteria could be improved further, but the badge criteria are themselves open source and can be improved using the same process as any OSS code: simply propose changes for review!

Concern #2: Achieving a badge does not necessarily mean a given package is well-designed or well-implemented.

Many of the CII criteria can help push projects towards creating better- or well-designed and implemented packages. In the “passing” level, the CII criteria include these requirements:

[warnings] criterion requires enabling compiler warning flags or similar
[static_analysis] requires the use of at least one static code analysis tool (assuming one exists)
[test] requires a test suite, which often nudges people towards better design and implementation
[test_policy] requires that you keep adding tests, especially as new functionality comes online
[know_secure_design] requires at least one primary developer know how to design secure software. The best practices site explains what this means under the “details” of this criterion. In summary, this criterion requires that at least one primary developer understands the 8 principles of Saltzer and Schroeder (as explained by the CII Best Practices site) and also knowing to (1) limit the attack surface and (2) perform input validation with whitelists. Software can be badly designed by knowledgeable people, but software is much more likely to be designed and implemented well if developers know the basics.
[know_common_errors] requires that at least one of the project’s primary developers must know of common kinds of errors that lead to vulnerabilities and at least one method to counter or mitigate each of them

Higher badge levels (“silver” and “gold”) offer even more.

It’s true that a badge doesn’t guarantee that a package is well-designed or implemented by some measure, but part of the problem is that it’s difficult to unambiguously determine if something is well-designed or well-implemented. Much depends on the purpose of the package! So instead, many criteria focus on enabling mass peer review and managing improvements, so that problems are more likely to be detected and corrected.

In short, software normally undergoes change over time. Instead of requiring that a project be perfect at one point in time, we focus on criteria that will help projects continuously improve over the long run.

Concern #3: How does the CII help to ensure the validity of self-certification, e.g., through automated tools?

We use automated tools and reject some answers that are clearly false. We require that replies be public and that there be URLs for some answers; that makes it easy for anyone to check answers. In the worst case, we can override false answers, though in practice we’ve almost never found that necessary.

Concern #4: Even if every R package had a badge, the issue of finding a needed package among over 12K packages remains.

Finding a desired package or the “best” one for a given task is largely orthogonal to improving package quality, though the two can be related. The badging process can help, because one of the criteria is “The project website MUST succinctly describe what the software does (what problem does it solve?)”. Search engines are much more effective at finding relevant packages once that kind of information is available. In another way, if using packages that state adherence to the CII criteria is important to you or your organization, the search space may be significantly reduced – at least as a starting point.

Concern #5: Can the CII criteria be streamlined to reflect only the needs of R packages, including those that are more data and documentation than code?

Our current primary approach for streamlining is to automate criteria. That said, if you have a specific idea for streamlining things further, please file an issue on GitHub here.

Concern #6: Will automated tools be available for performing at least parts of the assessment, e.g., as found in R’s devtools?

We already use automated tools to assist in completing the form. We’d rather not require people to install tools to fill in information, because that would be a barrier for some. If there are tools we aren’t using and should use, let us know!

Concern #7: A badge program could penalize developers who do not have time, money, or skills to meet the criteria, making their packages less desirable if they do not achieve a badge.

We’ve worked hard to make the badge “passing” criteria doable for single-person projects. Daniel Stenberg is the author and maintainer of cURL and libcurl, and he’s been especially influential in ensuring that the “passing” badge is doable for single-person projects. If you have no tests, cannot automatically build your software (even though it requires building), or have never run a static analysis tool of any kind, then there is some work… but it’s better for users if these are addressed.

The top “gold” level requires multiple people in a project, e.g., because the project MUST have a “bus factor” of 2 or more. That can be a challenge for developers, but it’s a big advantage for users – users would much rather depend on software where a single death doesn’t suddenly mean that there’s no one to update the software. No one is required to get the gold level, however, and there are many ways to resolve this.

Concern #8: Introducing more process comes with additional burdens for package developers, perhaps reducing overall ecosystem participation.

We’ve done our best to minimize the risk from additional burdens. We automate some answers, and that helps. We reduce the risk of duplicated evaluation processes by having a single set of criteria for all OSS. Perhaps most importantly: the criteria were developed by examining real-world successful projects, so they require actions that other projects are already doing and finding helpful.

Perhaps more importantly, keep in mind that getting a CII best practices badge is optional – a package author can decide if the benefits of adhering to the CII criteria outweigh the costs.

Concern #9: Is there a way to distinguish tests for validating statistical software numerical computations and statistical properties?

Sure. Naming conventions for tests are a common way to distinguish types of tests; you can also put different kinds of tests in different directories. From the badge perspective, we don’t focus on that distinction. For “passing” the key is that your project must have a general policy that as major new functionality is added to the software produced by the project, tests of that functionality should be added to an automated test suite. Passing doesn’t require a perfect test suite; instead, we require that you have a test suite and that you’re committed to improving it. Since OSS is visible to the user community, a potential user may want to examine the type and quality of tests performed. The higher-level badges do require better test suites, as you might expect.

We continue to receive valuable comments through the survey and are pleased to report that more R package authors are choosing to participate in the CII as evidenced by the surge in new R CII project entries.

Jul 26

Love0

Should R Consortium Recommend CII Best Practices Badge for R Packages: Latest Survey Results

By Mark Hornick Blog, R Consortium Project

Based on our Fall 2017 survey, where the R Consortium asked about opportunities, concerns, and issues facing the R community, the R Consortium conducted a new survey this past month to solicit feedback on using the Linux Foundation (LF) Core Infrastructure Initiative (CII) Best Practices Badge Program for R packages. With your feedback, the R Consortium will base its recommendation for using the CII. Your feedback will also help us and the Linux Foundation evolve the CII with the needs of the R Community, and FLOSS projects in general, in mind.

Introduction

With over 12,000 R packages on CRAN alone, the choice of which package to use for a given task is challenging. While summary descriptions, documentation, download counts and word-of-mouth may help direct selection, a standard assessment of package quality can greatly help identify the suitability of a package for a given need – commercial, academic, or otherwise. Providing the R Community of package users an easily recognized badge indicating the level of quality achievement would make it easier for users to know the quality of a package along several dimensions. In addition, providing R package authors and maintainers a checklist of “best practices” can help guide package development and evolution, as well as help package users know what to look for in a package.

The R Consortium has been exploring the pros and cons of recommending that R package authors, contributors, and maintainers adopt the Linux Foundation (LF) Core Infrastructure Initiative (CII) “best practices” badge. This badge provides a means for Free/Libre and Open Source Software (FLOSS) projects to highlight to what extent package authors follow best software practices, while enabling individuals and enterprises to assess quickly a package’s strengths and weaknesses across a range of dimensions. The CII Best Practices Badge Program is a voluntary, self-certification, at no cost to submit a questionnaire and earn a badge. An easy to use web application guides users in the process, even automating some of the steps.

More information on the CII Best Practices Badging Program is available: criteria, is available on GitHub. Project statistics, criteria statistics., and videos. The projects page shows participating projects and supports queries (e.g., you can see projects that have a passing badge).

What did we learn?

Will the CII Best Practices Badge Program provide value to the R Community’s package developers or package users? 90% of survey respondents say ‘yes’ with 77% saying it has benefit for both developers and users. Perhaps not surprisingly, 95% of respondents had never heard of the CII before, but 74% would be willing to try it. This is according to 41 respondents, 56% of whom have been developing R packages 4 years or more, and over 60% who have developed two or more packages.

Of the six categories covered by the CII – licensing, documentation, change control, software quality, security, code analysis – over 55% of respondents found all criteria to be somewhat or highly beneficial. Over 80% found documentation and software quality criteria to be somewhat or highly beneficial. The details are provided in the table below.

Table: Expected degree of benefit for each CII criteria category

Using an open ended question, we asked respondents why the CII is good for the R Community? Here is a summary of the responses. The CII…

helps users discover and select R packages that adhere to software development best practices.
shows R developers through the badge criteria what is possible or desirable for FLOSS, especially if developers do not have a software engineering background.
provides an additional degree of assurance to the user community around package quality as well as provide a way for developers to assert more formally that they follow such best practices.
gathers and presents lessons learned from other FLOSS projects so developers don’t need to re-discover them.
creates an incentive to adopt a consistent set of practices throughout the R ecosystem.

While respondents were generally very positive about the use of the CII, concerns did arise:

Does the CII Badge have the “correct” or “best” set of criteria?
Achieving a badge does not necessarily mean a given package well designed or implemented.
How does the CII help to ensure the validity of self-certification, e.g., through automated tools?
Even if every R package had a badge, the issue of finding a needed package among over 12K packages remains.
Can the CII criteria be streamlined to reflect only the needs of R packages, including those that are more data and documentation than code?
Will automated tools be available for performing at least parts of the assessment, e.g., as found in R’s devtools?
A badge program could penalize developers who do not have time, money, or skills to meet the criteria, making their packages less desirable if they do not achieve a badge.
Introducing more process comes with additional burdens for package developers, perhaps reducing overall ecosystem participation.
Is there a way to distinguish tests for validating statistical software numerical computations and statistical properties?

Suggestions from the respondents on how best to take advantage of the CII Badge Program include:

The CII should be sure to reflect the existing quality criteria provided through CRAN.
Integrate the CII with CRAN or Bioconductor, e.g., display badges on respective package CRAN pages to give CII more visibility and so that users can identify more easily which package to use.
Use the CII to encourage package developers to train themselves in best practices.
Develop an automatic framework that will create/enforce all the criteria whenever possible.
Make the security criterion conditional based on what the package does. If a package never goes outside the R session, does it need a dedicated security expert?
Require packages implementing a statistical method be backed up by a peer-reviewed article.
Make it easier to recognize which criteria categories passed and by what percentage in a high level visual representation, perhaps incorporated into the badge itself.
“Not use it at all, it creates false impressions and discriminates against good domain packages in disciplines that simply use software rather than seek rewards.”
“Encourage R-Core to adopt these practices for R itself. Also, loosen the approach to LICENSE files on CRAN so as to make compliance easier.”
Keep it simple.

As you can see, there is quite a range of sentiment expressed regarding introducing such a badging program. Some concerns seem to be based on misunderstandings, for example, the badging process does not require a “dedicated security expert,” and there already is some degree of automation in the process. The R Consortium is grateful to the respondents for taking the time to provide their insightful and thoughtful responses. We will continue to work with the CII team to explore addressing the issues raised above, including clarifying misunderstandings where we can do so.

Since initiating this survey, however, multiple package have already taken the plunge to try the CII badge program:

foghorn	R package to summarize CRAN Check Results in the Terminal	https://github.com/fmichonneau/foghorn
osrm	Shortest Paths and Travel Time from OpenStreetMap with R	https://rgeomatic.hypotheses.org/category/osrm
R_Matrix	R package for Sparse and Dense Matrix Classes and Methods A rich hierarchy of matrix classes, including triangular, symmetric, and diagonal matrices, …	http://matrix.r-forge.r-project.org
base64enc	R tools for base64 encoding	https://github.com/s-u/base64enc
ggplot2	An implementation of the Grammar of Graphics in R	https://ggplot2.tidyverse.org
covr	Test coverage reports for R	https://github.com/r-lib/covr
datastructures	Implementation of core data structures for R.	https://dirmeier.github.io/datastructures
madrid.air	R package to parse air quality data published by http://datos.madrid.es/.	https://github.com/nramon/madrid.air
pandas	Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical…	http://pandas.pydata.org
An R Package for Quick Uncertainty Intervals	ciTools is an R package that makes working with model uncertainty as easy as possible. It gives the user easy access to confidence or prediction intervals…	https://github.com/jthaman/ciTools
dodgr	Distances on Directed Graphs in R	https://ATFutures.github.io/dodgr
netReg	Network-penalized generalized linear models in R and C++.	https://dirmeier.github.io/netReg
DBI	A database interface (DBI) definition for communication between R and RDBMSs	http://dbi.r-dbi.org

If you’re a package developer, we hope you’ll join the package developers above and start your own CII Best Practices Badge. The survey will remain open to collect your feedback on the experience.