R's extensible community and the open-source revolution in data science

Portfolio | Writing | Data Teams | Open Source

June 2024


"There was no real intention to build anything other than a toy to play around with ideas." — Robert Gentleman


Gentleman and Ihaka frequently crossed paths, sharing a penchant for "playing academic fun and games with statistical programming languages." Finally, Gentleman stopped Ihaka in the corridor with an invitation to write software together. The two huddled over the same computer — "one person typing, the other person looking over their shoulder at what they were doing, and criticizing, making suggestions." According to Ihaka, this collaboration led to a "kind of mind meld where we could pretty much complete each other's sentences."

R and R, as they came to be known, had grown frustrated with the lack of data analysis programming tools for their Macintosh computer lab. The application Scheme had become unwieldy to write due to its complex syntax. Meanwhile, the language "S" had the syntax they wanted but lacked the Scheme-like interpreter. They set out to develop a new program using the S syntax but with improved memory management and the ability to create variables in functions locally rather than globally.

The Tidyverse

As the R community has grown, its needs have evolved. Individual users originally flocked to R because of its interoperability and open library of packages. However, cleaning and preparing data made up an increasingly large part of data scientists' jobs. Answering this call, Hadley Wickham of Posit rose to data science fame in the early aughts when he developed a set of packages called the tidyverse. One such package, dplyr, which makes data munging simpler, has grown to over 1 million monthly downloads.

Perhaps un-coincidentally, Wickham completed his Masters in statistics from the University of Auckland — where Ihaka and Gentleman first met. He is a descendant of Ihaka's whakapapa, or genealogy. Although not Māori himself, Wickham has remained true to those statistical roots, renaming the latest version of R7 to "S7," hearkening back to the original S language developed at Bell Labs.

Arrow and Beyond

While staying true to R's roots, Posit continues to push the boundaries of R's original principle of interoperability. Wickham has launched a project at Posit with the creator of the Python data science library "pandas," Wes McKinney, to potentially merge R with Python using a new framework called Apache Arrow. As Wickham recently said, "With R, because you can combine things from different packages, that leads to fairly big impacts on the user experience and almost even how the community has to work together and form."

The journey of R from a simple tool for academic exploration to a cornerstone of data science is a testament to the power of collaboration and open-source development. Its continued popularity and flexibility proves that small but diverse and global teams can make scientific analysis with data easier — even fun. As R-Core passes the torch to the next generation of developers, users, and commercial outfits like Posit, the spirit of R continues. Today, the hallmark of R remains its extensibility. The answer to what comes next remains: anything you like.


Metaphor to Nature

In the gardening fields of Japan, cucumber mosaic virus (CMV) joins with a Y-satellite RNA. The leaves first develop a mosaic pattern and then a yellow twinge. The mustard color attracts a swarm of aphids. In small numbers, aphids enjoy the snack thanks to the parvovirus and move on. When the aphids swarm in large numbers, a minuscule miracle occurs: the Y-sat infected with CMV promotes wing formation. The virus gives the aphids the power to fly, enabling them to survive the wild beyond.

← Back to Blog