icon picker
Adding code to an R package

version 1.1
Authors:
@Joshua Campbell
,
@Rui Hong
,
@Aaron Chevalier
,
@Salam Alabdullatif
,
@Christopher Husted
,
@Yusuke Koga
,
@Yuan Yin
,
@Kelly Geyer
Date:
May 4, 2022

1. Introduction

When developing an R package, the process of adding code that is error free, does the thing that you want it to do, and is maintainable in the long run by other users can be quite cumbersome. Several online tutorials and handbooks are available that cover various aspects of the code development process. However, a developer first needs to know what the major steps are before they can find the right resource. For example, if someone did not know about unit tests and their importance, they wouldn’t know to Google “unit test examples”. This article provides a broad overview of the major steps that are used by the to add code to an existing R package. It also contains links to other articles or tutorials for many of the steps that go into more detail. Topics not included here but are worth understanding if you are an R developer include the , , and . Additionally, this article uses GitHub for the code repository but only covers some of the basic commands. Reviewing is also recommended. Lastly, some of the steps listed here are optional or may vary depending on the needs of the package or development group. Overall, we hope this article will help new R developers get up to speed more quickly. If you would like to make suggestions or additional tips for this article, please email
@Joshua Campbell
or tweet us @camplab1.

2. Install prerequisites

Several tools and R packages are used at different steps along the way. It is easiest to install all of these programs and dependencies at once.
Latest
Install - RStudio is an integrated development environment (IDE) which helps developers write R code more quickly and efficiently. Most of the screenshots in this article are taken from RStudio.
Once you have both R and RStudio, you can install several packages that will be utilized in this tutorial:
- A popular to for development in R
- For easy documentation of functions in R packages
- Perform unit testing in R
- For package version management in R
- Applies style to code.
- Analyze source code for formatting and to stylistic errors
- Package for setting up a new package/project
- Makes a website for your R package

All packages can be installed with the following command:
install.packages(c("devtools", "roxygen2", "testthat", "renv", "styler", "lintr", "usethis", "pkgdown"))

3. Environment setup

A. Git repository

I. Fork the repo from GitHub

For this walkthrough, we will use the repository in the campbio organization as an example. Navigate to this page and click on “Fork” in the top right corner to create a copy of the repo on your own GitHub account:
image.png
Click “Create” and then a fork will be created for your GitHub account under this link:

II. Clone the fork to your local machine


Copy the link for the repo from your GitHub page by clicking the green Code button in the top right and then copy the link:
image.png
Open your command line on your local system and navigate to the directory where you wish to store the code. Then clone the package by pasting the link after git clone :
git clone https://github.com/joshua-d-campbell/DevelExample.git
image.png
Note: If you are using a Windows system, you can use Git Bash to complete git clone and other Git related operations. Git Bash is installed when Git is installed to your Windows system.

III. Setup the remote repositories

You may eventually want to be able to get new changes that others have made to the upstream repo from the original GitHub user or organization. To do this, you need to use the “git remote” command to make a link. Go back to the GitHub page of the original repo where you forked from (“campbio” in this example). Click on the green “Code” button and copy the text in the same way you did when you were cloning your own fork to your local machine. Go into the directory of the clone on your local machine and run git remote add [name_of_upstream] <url> to set up the link to the upstream repo:
cd DevelExample
Some people just like to call the original repo “upstream” while others like to name it according to the name of the upstream group/user. You can then run git remote -v to view the list of remotes and double check that everything is set up correctly. You should see origin set to your own personal GitHub account and the new remote should be set to the original upstream repo which is campbio in this example.
image.png
If your R package is in Bioconductor, you may also want to set up the Bioconductor remote:
git remote add bioc git@git.bioconductor.org:packages/celda.git
Note: You will not be able to push to the Bioconductor repo unless you have permission as a maintainer. For more information on getting permission and working with Bioconductor packages, see the Bioconductor .

B. Rstudio

There are many reasons to do your coding inside Rstudio. It has integrated features for package development such as buttons for the building and checking, an IDE for highlighting R code syntax, and enhanced abilities for such as adding breakpoints. The steps to create an Rstudio project to work with your package are outlined below:
Open Rstudio and Click New Project in the project tab on the top right part of the window:
image.png
2. If you are developing a package and already initiated a git directory for the package, choose Existing Directory:
4. Then pick the git directory of the package:
image.png
5. After clicking ‘Create Project’, Rstudio will create a new R session. Since you are in a R package folder, new options will be in the top right such as Build and Git.
6. Now you have created a new project in Rstudio which is linked the git directory of your package. Every time you make changes, you can use the Install and Restart button under the Build tab to install the current version to your library. You can also select Load All under the More button to load the current version into memory without actually installing it into your library.
4. To use the package to help with function documentation, click the Build item in the menu at the top of the window and then click “Configure Build Tools...”
Click Generate documentation with Roxygen in the new window. Click “Configure...” and make sure all of the same options are checked after:
image.png

C. Manage dependencies

One challenge of contributing code to multiple packages or performing analysis on different datasets at same time is managing different versions of R package dependencies. If you rely on one particular version of a dependency for one project, but need to use newer or older versions in another project, it can be frustrating to continually install or re-install these different versions. One option to manage dependencies is by using the package
, which allows for packages to be installed in a local folder. A brief introduction on how to setup and use local renv libraries is shown below. Although useful to create snapshots of package versions that work, this step is largely optional.

I. Set up new renv library

Install renv and set up a new local library:

install.packages("renv")
renv::init(bare = TRUE) # Need to restart R after this command
install.packages("BiocManager")
2. Set the global repos option to include packages from Bioconductor as well as CRAN.
options("repos" = BiocManager::repositories(version = "3.10"))
Use “version = 3.10” for R-3.6, “version = 3.11” for R 4.0, etc., according to Bioconductor .
3. Install all package dependencies listed in DESCRIPTION file. If you have any of these already installed in your global R library, they will be linked. Otherwise, it will install the packages into the local library folder ‘renv/lib’ within your package:
renv::install()
4. Create “renv.lock” file which contains package versions. If any new dependencies are added during development, then the “snapshot” command should be run again to create a new “renv.lock” file.
renv::snapshot()
5. Add these lines to the .gitignore file so GitHub does not keep track of the local libraries:
.Rprofile
renv/
6. Add this line to .Rbuildignore file so R does not think they are package-related files:
renv*

II. Install from existing renv

If a lock file has already been generated and is present in the repo, then you can run these commands to install versions of packages known to alrea”dy work with the current version of the package:
renv::init(bare = TRUE)
renv::restore(“renv.lock”)
where “renv.lock” is the name of the appropriate lock file.

4. Adding Code

The Software Lifecycle or the refers to the complete processes of designing, implementing, and maintaining code for a software package. Similarly, DevOps refers to a set of practices that combines (Dev) and (Ops). DevOps aims to shorten the and provide with high .
is a nice summary of the steps that can be involved in DevOps:
image.png
In this section, we will only focus on the “Dev” side related to coding, building, and testing R packages with GitHub.

A. Before adding code

I. Choose the appropriate branch

Understanding how the particular repo you are working with uses different branches for different purposes is important so you can know where and when to add your code. Different organizations and repos may have different practices so you will need to look at their documentation or ask their developers. For example, one group might just use “master” or “main” as the primary branch and everyone can push/pull code from directly to/from it all the time. In this tutorial, we make use of both a “devel” and a “master” branch. Developers push changes to the “devel” branch when making updates. Only stable versions are pushed to the “master” branch along with a version bump and corresponding new release. For our DevelExample repo, run this command to switch to the “devel” branch if not already there:
git checkout devel
Or you can do this within Rstudio by clicking on Git tab in the top right and selected devel from the drop down box:
image.png
Note: You may also want to create your own separate branch from devel if you are making substantial changes. This can be done with the command git checkout -b <new_branch_name>. See this article for more information on utilizing Git .

II. Merge/Pull changes from upstream repos

If you just cloned the repo, then chances are that you have the latest version of the code and do not need to worry about syncing with the upstream repo. If you have been working on a package for a while along with other developers, then your local version of the package may be behind the upstream repo in the original organization (”campbio” in this example). It is often a good idea to incorporate changes that other developers have made before starting to add or change code on your local repo as this may help reduce potential merge conflicts later on. Here is the code to fetch the latest code:
git fetch campbio
git merge campbio/devel
Similar functionality can be done with git pull. See this article for more information on Git and comparisons between them. If you have already started making changes to your local repo and then try to merge from an upstream repo, you may already have merge conflicts. These can be resolved with git mergetool or with other text editors. See this article for a brief introduction to in Git or Google “Git merge conflicts” to find any number of tutorials/examples.

B. Adding code to the package

If you are starting a package from scratch, you can read through other tutorials about how to set up the package structure, including ones from or . However, this only needs to be done once. So most of the time, you will be adding code to an existing package.

I. Function code

Most of the code in your R package will be enclosed within an R function. In this tutorial, we will add a function to the DevelExample package that calculates the Euclidean distance between two vectors. Below is an example of an R function that calculates this distance and another one that checks for NAs in our vectors. This code can be copied into a new file called “distance.R” in the “R” subdirectory of the package.
euclideanDist <- function(a, b, verbose = FALSE) {
if (isTRUE(verbose)) {
message("Calculating distance ...")
}
# Check validity of data
.check_data(a)
.check_data(b)
# Perform calculation
res <- sqrt(sum((a-b)^2))
return(res)
}

.check_data <- function(input) {
if (any(is.na(input))) {
stop("'input' must not contain NAs")
}
}
To test your function, you can run the command devtools::load_all() which will load the current version of your code into the R environment without actually reinstalling the package. You can also run this by selecting Load All after clicking More under the Build tab in Rstudio:
image.png
In this example, we actually put some of the code inside of a “dot” utility function (i.e. the function starting with a “.”). There are two reasons to split up code for your function into smaller functions. The first reason is if that chunk of code will be used or called in multiple places or from multiple functions across your package. It is generally a bad idea to have redundant code in multiple places as it is harder to maintain when changes to the code will inevitably be needed in the future. The second reason is if your function is long because it contains multiple, complex parts. If your function can be split up into smaller steps with each step coded in its own small function, this will improve the design, maintainability, and readability of the overall function. Note that these functions do not actually have to start with a period, but it is a convention that some groups like to use to help developers distinguish between the functions that they want users to see versus internal functions they use to better organize the code (i.e. exported vs non-exported functions).
Here is a brief list of some additional best practices to make your code consistent and maintainable:
Standard function naming. If you are adding to an existing repo, make sure to understand the conventions for that repo. Make sure to use standard conventions for function and parameter names. Common naming conventions include (e.g. euclideanDist), (e.g. euclidean_dist), or (e.g. EuclideanDist). Understand if there are preferences for abbreviated words or full words in function names (eucl_dist vs. euclidean_distance). It is also a good idea to document these preferences in a developer wiki.
File organization and naming. If you have a lot of functions, it is generally a good idea to split them up in multiple files. Multiple functions should only be in the same file if they are functionally related. Many packages keep separate files for the accessors and utility functions. It is also a good idea to think through the convention for the names of these files so other developers can quickly find the code they want to understand or modify.
Namespace. Always remember to specify the namespace of each function from other packages in case other package has a function (e.g. stats::anova instead of anova). You can also use import to import several functions at a time. When you build and check the package later, you will get warnings if you do not call functions properly. See this chapter on for more information.
Accessor functions. These types of functions are a staple of programming (OOP). Do not directly use "@" to access slots in an S4 object. Use the package-specified accessor functions. For example, do not use obj@metadata to access the metadata from an object, but use metadata(obj). This is because the locations of data may change within the object in new releases, but the accessor functions should be more static and always return the same underlying data. See this article on for more info.
Boolean flags. Use TRUE and FALSE and not T or F. T/F are just variable set to TRUE/FALSE that can be changed. When checking Boolean flags in “if” statements, use the isTRUE() function like this if(isTRUE(flag)) {...} . For example if(1) will be evaluated as TRUE and run whereas if(isTRUE(1)) will be evaluated as FALSE and not run.

II. Function documentation

Each function should be fully documented including title, description, parameters, return, and examples. This may also include additional sections for details and “see also”. Read the tutorials from the and for a more complete description of the different documentation elements with examples. Here are our functions that calculate Euclidean distance with documentation added:
#' @title Euclidean distance
#' @description Calculates Euclidean distance between two vectors. An error will be
#' given if NAs are present in either vector.
#'
#' @param a The first vector to use in the distance calculation.
#' @param b The second vector to use in the distance calculation.
#' @param verbose Boolean. If \code{TRUE}, a message will be printed. Default \code{TRUE}.
#' @return A numeric value of a distance
#' @examples
#' euclideanDist(c(1, 2), c(2, 3), verbose = FALSE)
#' @export
euclideanDist <- function(a, b, verbose = FALSE) {
if (isTRUE(verbose)) {
message("Calculating distance ...")
}
# Check validity of data
.check_data(a)
.check_data(b)
# Perform calculation
res <- sqrt(sum((a-b)^2))
return(res)
}

.check_data <- function(input) {
if (any(is.na(input))) {
stop("'input' must not contain NAs")
}
}
Once the function documentation has been written, you can run devtools::document() to write/update man .Rd files and the NAMESPACE file. You can also use the shortcut Shift + Ctrl/Cmd + d or select Document after clicking More under the Build tab in Rstudio:
image.png
Remember that only functions with the @export tag will be visible to the user. You can preview documentation with ?functionName (?euclideanDist in this example) and then make modifications as needed.

III. Example data

Many function examples will need to run on some sort of data. Sometimes it is more efficient to save a small dataset within the R package, especially if it can be used as the input for several examples. R packages have the ability to to include data in a few different ways which are described in the chapter of the R Packages book. We will use the first way of storing example data using the data/ folder. First we run the following code to set up the data-raw folder (if it has not been set up already):
usethis::use_data_raw(name = "example_data")
This will also create a file in data-raw folder called example_data.R which we can use to store code that creates the example dataset. Note that this folder is added to the .Rbuildignore file so it (and all of the files within it) will be included in our GitHub repo but not in the bundled version of the package.
Here is the code we put in the file and then run to create and save an example dataset with two vectors:
## code to prepare `example_data` dataset goes here
set.seed(123)
a <- rnorm(100)
b <- rnorm(100)
example_data <- cbind(a, b)
usethis::use_data(example_data, overwrite = TRUE)
Don’t forget to make sure to actually run the file, so that the data is generated. You can do this by sourcing the file as follows:
source("path/to/example_data.R")
All data objects must also be documented. In order to document this example dataset, we can create a file called data.R in the R subdirectory with the following code:
#' Example dataset
#'
#' A dataset containing a matrix with two columns that were generated
#' with a random normal distribution with a mean of 0 and stdev of 1.
#'
#' @format A matrix with 100 rows and 2 columns
#' @keywords datasets
#' @usage data("example_data")
#' @examples
#' data("example_data")
"example_data"

Also include and @source tag if you returned the data from an outside database or website.

We can now update the original @examples code in our function euclideanDist to use this example dataset instead:
#' @examples
#' data(example_data)
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.