Enhancing R: The Vision and Impact of Jan Vitek's MaintainR Initiative

The R Consortium recently interviewed Jan Vitek, a professor at Northeastern University’s Khoury College of Computer Sciences. He specializes in programming languages, compilers, and systems. Notably, he developed one of the first real-time Java virtual machines in collaboration with Boeing, which involved writing the navigation software of a ScanEagle UAV in Java and demonstrating that it out-performed the legacy version of the system written in C++. Vitek is actively involved in the programming language community and has held multiple leadership roles, including chairing SIGPLAN. In his spare time Vitek is a cinephile with a presence on Letterboxd and is the human of a dog named Olaf.

Vitek has been working on R for a decade. He is currently working on the MaintainR 2021 project, which aims to support and update the key components of the R ecosystem. The R Consortium is funding this project.

Can you provide an overview of the MaintainR 2021 project and its main objectives?

“When does a programming language die?” is the wrong question. Languages do not die, they slowly fade into irrelevance. A language fades away when no longer deemed useful enough for people to learn it and convince their colleagues to adopt it in their work and to maintain software projects written in it. Why does this happen? It comes about when newer languages that are better or appear cooler, start to emerge. The rise of Python has shifted many machine learning users from R to Python. The success of Julia has pushed performance-sensitive users to develop new mathematical libraries in this new language. Is R fading?

The programming landscape is evolving, and R, which has been around since 1995, isn’t the newest option available. To remain relevant any complex language depends on a large ecosystem of software elements that must be maintained and fixed regularly. R is certainly complex and it has many dependencies. It relies on a core group of developers who are allowed to make changes in the key parts of the language. These developers, while, on average, being significantly younger than Joe Biden, are not getting younger.

My work focuses on trying to modernize R. I’ve been examining R from a computer science perspective for about a decade, focusing on software components such as just-in-time compilers. My group is currently in the midst of writing our third attempt at writing a compiler for R. This effort led me to bring my collaborator, Tomas Kalibera, into the R community through our projects, which sparked his desire to assist the community. This was all part of a natural extension of our research. The goal of the MaintainR project is to maintain key parts of the R environment, which are challenging for volunteers to sustain. We have not found companies willing to contribute top-notch software engineers for this maintenance effort for their own reasons—perhaps they don’t have the resources, or they’re occupied with other tasks. Thus, our effort is focused on providing the necessary maintenance to prolong R’s usefulness.

The R ecosystem is dependent on the R interpreter, the core libraries, and CRAN. Which takes the most effort to maintain and why?

Everything in this project is challenging because the components vary greatly in size and heterogeneity. The interpreter is the smallest part, which everyone relies on. Then, there’s the core library, which is about ten times larger than the interpreter and is a mix of R, C, and Fortran. Fortran isn’t as popular as it used to be, and we encounter issues when compiling it with modern compilers like LLVM. Ensuring Fortran compiles across all desired architectures and operating systems has been a persistent challenge.

We’ve also had difficulties integrating patches into LLVM and GCC for this purpose. Changes in these compilers can lead to breakages in our environment. The crown packages contain vast code—potentially 100 times more than the core library. This creates an inverted pyramid scenario where the amount of code increases as you move up the structure.

Maintaining these packages is not our direct responsibility, but we can’t ignore them. Some are crucial for the users’ satisfaction. Some packages inevitably break as the language evolves and new versions are released. Tomas often has to approach maintainers to inform them of these issues. Sometimes, they respond and agree to implement fixes, but not always. Even when a technical fix might take just half a day, it can require a full week of negotiation with a developer to accept the patch.

This social aspect of software maintenance is significant and often the most challenging part. Developers have their own priorities, and a patch that doesn’t align with their goals can be seen as disruptive. Sometimes, the delay is simply because they are slow to respond. This complex interplay of technical and social challenges is a constant part of our efforts to keep the project moving forward.

Tomas Kalibera, a member of the core R team and supported as part of the MaintainR project, has implemented CheckR, a software tool for verification of the C code linked against the R interpreter. Can you explain how CheckR improves the R ecosystem and the overall quality of packages available to R users?

My team developed a tool called CheckR, which addresses issues arising from libraries written with a substantial amount of C code. The aim is to identify potentially misbehaving C that could cause unpredictable crashes, leading end users to mistakenly believe that R itself is faulty when, in fact, the issue may stem from a poorly written library or careless usage.

CheckR processes the C code, transforms it, and an analyzer identifies points where things might go wrong. A common issue it detects involves what we call “Protect bugs.” This happens when the R code sends a value down to C, and C must “protect” this value to prevent it from being reclaimed by the garbage collector. Sometimes, developers handle this in a hasty and imprecise manner. If they make an error, the value given to C can be reclaimed and reused, leading to memory corruption—this could result in security flaws or crashes.

CheckR is a static analyzer that flags potential issues but is not always definitive. It identifies possible problems, and we return to the developers to discuss whether these should be fixed. Often, developers are skeptical about the identified issues, which can lead to extended discussions. Sometimes, these issues might never occur, but often they do, and since CheckR is used daily across our entire codebase, it automatically generates reports that help us address these vulnerabilities.

The next steps with this tool aren’t always clear-cut because we can’t predict all potential issues. For example, one persistent challenge has been how the Windows operating system encodes Unicode characters, requiring months of troubleshooting. Could we have foreseen this particular issue? Not really. It’s part of the unpredictable nature of software development, where new problems can emerge at any time.

Have you been successful in extending the life of R by having CheckR run daily and helping with the interpreter, libraries, and CRAN?

Thousands of changes have been made to the R environment over time, and while I’d like to say that these changes have definitely improved things, as a scientist, I feel the need to provide concrete evidence, which I can’t always do. However, I can confidently say that each time we identify and fix a bug, the system has one less problem. The challenge, though, is that the potential for bugs can be virtually unlimited because new code is continually being added. It’s an ongoing process, and realistically, there might never be a point where we can declare it completely done.

How has it been working with the R Consortium? Would you recommend applying for an ISC grant to other R developers?

In our case, a lot of the work we do isn’t glamorous, and most volunteers are drawn to projects where they can attach their names to something flashy. Yet, there’s a continuous stream of necessary tasks that aren’t as appealing but are essential. Without a steady source of funding, sustaining efforts like ours would be impossible. The process we follow is streamlined, the community is welcoming, and your contributions can significantly impact a large user base.

The key message here is that funding is incredibly beneficial, especially for supporting those who contribute more gradually; the return on this investment is significant. For instance, in our project, without the funding, we couldn’t have supported the work of Tomas Kalibera, and nothing would have progressed. No company was willing to employ someone full-time for this, despite it being a crucial component of the ecosystem. Being able to provide funding allows us to engage someone who might otherwise have to spend their time on other activities and only contribute to this project in their spare time. Having someone fully dedicated for even a limited period is a tremendous advantage.

About ISC Funded Projects

A major goal of the R Consortium is to strengthen and improve the infrastructure supporting the R Ecosystem. We seek to accomplish this by funding projects that will improve both technical infrastructure and social infrastructure.

Learn more

Enhancing R: The Vision and Impact of Jan Vitek’s MaintainR Initiative

About ISC Funded Projects