5 Optimize (To Go Fast)

Staying organized means keeping project complexity in check as it expands. Each of these checks below help to avoid unnecessary and common frustrations. They address how to optimize the organization of your files, the speed and focus of your reports, the reliability of data access, and code readability.

5.1 Optimal File-Folder Structure (Explicit)

We discussed how to optimally name data files in Section 2.1. We discussed how to write a Quarto report (a .qmd file) in Chapter 1. Should these sit in the same folder? For a small project, that can be adequate. However, it’s best to organize different kinds of files into different folders. This will help prevent confusion if and when our projects become large. It can also dramatically speed up the rendering of our reports, which is covered in the next section.

Generally, project files should be organized into the below folders:

data-raw (raw data files)
data (.R files that process raw data and create clean subsets)
R_analysis (.R files that create analytic results)

If reporting, you should also have the following:

output (outputs that will be shared or used in reports; the latter can be the final objects in R_analysis saved as .rds)
qmd (.qmd files which are Quarto reports)
Rmd (.Rmd files which are RMarkdown reports)

Optionally:

prose (references like Word documents, text files, etc.)

5.2 Optimal Reports (Fast Renderability)

Motivation

In Chapter 1 we processed raw data in a Quarto document, which would go inside the qmd folder. Above we are saying that raw data should be processed in .R files inside of the data folder. Why the change?

To be fair, many analysts start their R journey using Quarto (or RMarkdown), including me. For Python users likewise, they often start with Jupyter. The ability to combine prose (written text), code, and outputs is beginner-friendly. It’s not, however, always computer-friendly. If code that processes data takes a moderate time to run, the report render will take at least the same amount of time. If, instead, the code takes a short amount of time to run, because it does not process data beyond simply importing it, then the report render will be much, much faster.

Method

If we want fast reports, we want to simply import data and plots. We take out the data processing parts of our .qmd file and place them into an .R file (or multiple .R files) inside a new analysis folder. We name these .R files with the same philosophy as in Section 2.1: phrases separated by _, words in phrases separated by -. With several .R files, we add 0-padded numbers to the front of the file name. Here is an example:

01_merge-data.R
02_filter-data.R
03_plot-data.R
04_stat-analysis.R

Any objects that we need for our report are saved using readr::write_rds(). The first two arguments are the object to write, and the file path (where to write it). The object can, for example, be a data file we want to display as a table in our report. The file path is going to refer to the new output folder. The best way to refer to files in this folder is to use the here package. here::here() is the function, and its name comes from the idea: “Here, I’m in this project, so let’s start at the file path to our project folder, i.e. the root folder.” Knowing this, here::here() simply takes strings that point to the location of any of our project’s files. For example, here::here("output", "data_for_table.rds") will create a file path to the output folder ending with data_for_table.rds.

For example, we have a data file we want to display as a table, it will be saved somewhere in the above files with a line like write_rds(data_for_table, here("output", "data_for_table.rds")).

Then, in our .qmd file, we can simply add data_for_table <- read_rds(here("output", "data_for_table.rds")). Either load the readr library or prepend read_rds() with readr::.

Finally, place the .qmd file into the qmd folder. Now we not only have a faster report, but our project is more organized.

5.3 Optimal Data Access (Reliable)

It’s a great idea to review how your getting raw data. If they are emailed .csv or .xlsx files, and especially if they have been manually edited or downloaded, there might be a better way to access data from the source. Accessing raw data from the source is more reliable and automatic. There’s less chance of a middle person between you and the data making a mistake and you having to re-run scripts with the corrections if the mistake even gets discovered! Also, you can get data without the limitations that come from waiting for a person to manage and deliver it.

Reviewing your strategy and improving it to allow for direct access may take more up-front work, but it usually will reduce future work. Another benefit is that learning to directly access data will up-skill you and further your career. You can learn software (data) engineering concepts like databases, API’s (Application Programming Interfaces), and more. These topics are beyond the scope of this book, but this section is here to encourage you to ask questions and learn more. A good book or large-language model can help introduce you to these topics. Once you become familiar with the lingo, you can start to incorporate database concepts with the following packages:

dbplyr for lazy loading and computation (i.e. using the database server instead of your computer to hold data in memory and perform work)
DBI for reading and writing to databases
dm for data models (tables in your database and their relationships to each other)
httr2 for data to and from API’s

5.4 Optimal Code (Maintanable)

If you find yourself copy and pasting code, then it is time to use a function. In R, writing functions is simpler than in other languages. Functions do not require type-checking since R is not a type-checked language. We will go over types at the very end of the section.

Making Your Own Functions

If, for example, you need to read another set of data with a similar naming structure to Section 2.1, then a function will communicate your code more clearly. To recap, the code below imported and cleaned our multiple data files into a signle data frame:

file_names <- directory_path |> list.files(pattern = "csv")
file_paths <- directory_path |> dir_ls()

split_matrix <- file_names |> str_split_fixed("[._]", n_phrases)
colnames(split_matrix) <- names_phrases

df <- split_matrix |>
  as_tibble() |>
  mutate(data = map(file_paths, read_csv))

For starters, a function is created by assigning function() to an object, and entering the inputs inside the (). The inputs of a function should be the parts of the function that change over repeated use.

sum <- function(x, y) x + y
sum(2, 2)

[1] 4

What part of our code is not repetitive now that we want to repeat the task for a set of different files?

For one, we can imagine that we are pointing to a different folder (directory_path). So the first input can be file_path (a folder itself is a file). A new set of files might have a different number of phrases in the file names (n_phrases). So another input of our function can be n_phrases. We can also imagine that the phrases represent something different (names_phrases). Let us call another input of our function labels.

Lastly, we need to name our function. The function name should reflect its main behavior, and as a verb (since functions do something). If it is hard to identify a main behavior, then it is probably best to split the function into multiple functions. Naming functions appropriately is important for readability.

Let us start by listing the behaviors of our function or functions:

Take a file path to a folder and get the names of its csv files
Get a matrix from splitting the phrases in each name
Name the columns and turn the matrix into a tibble
Where each row now represents a file, read its respective data into the tibble
The result is a nested data frame of imported data

That is a lot for one name to represent. Hence it is more prudent to separate these behaviors into multiple functions.

Let us start with the behaviors dealing with file names. The first 3 behaviors do so. A good name for the combination of these behaviors is describe_files_as_df().

describe_files_as_df <- function(file_path, n_phrases, labels) {
  file_names <- file_path |> list.files(pattern = "csv")
  split_matrix <- file_names |> str_split_fixed("[._]", n_phrases)
  colnames(split_matrix) <- labels
}

The last two behaviors concern reading in nested data. Hence read_csv_file_paths()

read_csv_file_paths <- function(df) {
  df |>
    as_tibble() |>
    mutate(data = map(file_paths, read_csv))
}

Now our processing code can look like so:

gapminder_nested <- here("data-raw", "gapminder") |> 
  describe_files_as_df(n_phrases = 4, labels = c("source", "country", "date", "file_type")) |> 
  read_csv_file_paths()

other_data_nested <- here("data-raw", "other_data") |> 
  describe_files_as_df(n_phrases = 2, labels = c("source", "country")) |> 
  read_csv_file_paths()

Code like this is clear and concise. Each line acts as a single action (even if it is composed of multiple, smaller actions) and communicates that action to the reader.

Adding Function Checks

Did you notice that n_phrases and the length of labels are the same (and should be)? What if they aren’t? We will get an error. Even the error is clear in this situation, when functions become more complicated, the chance of an incomprehensible error increases. Hence it is best to be explicit about the requirements of any functions we create. We can improve describe_files_as_df() by checking for the above. If the check fails, we alert the user of the exact problem. For the alert, we use the cli package.

describe_files_as_df <- function(file_path, n_phrases, labels) {
  if (n_phrases != length(labels)) {
    cli_abort(
      "n_phrases must be the same length as labels.", 
    )
  }
  
  file_names <- file_path |> list.files(pattern = "csv")
  split_matrix <- file_names |> str_split_fixed("[._]", n_phrases)
  colnames(split_matrix) <- labels
}

The error will then look this:

if (TRUE) {
    cli_abort(
      "n_phrases must be the same length as labels.", 
    )
  }

Error:
! n_phrases must be the same length as labels.

We can improve it further with cli tricks:

describe_files_as_df <- function(file_path, n_phrases, labels) {
  if (n_phrases != length(labels)) {
    cli_abort(
      c(
        "{.var n_phrases} must be the same length as {.var labels}.", 
        "x" = "You've supplied {.var n_phrases} as {n_phrases} but {.var labels} has length {length(labels)}."
      )
    )
  }
  
  file_names <- file_path |> list.files(pattern = "csv")
  split_matrix <- file_names |> str_split_fixed("[._]", n_phrases)
  colnames(split_matrix) <- labels
}

describe_files_as_df(here(), "5", c("test"))

Error in `describe_files_as_df()`:
! `n_phrases` must be the same length as `labels`.
✖ You've supplied `n_phrases` as 5 but `labels` has length 1.

Furthermore, we know file_path should be a single string, n_phrases should be a number and labels should be a vector of strings. R is relaxed about numbers being portrayed as, for example, "5" or 5. On the other hand, we can check for the types for file_path and labels using cli again and rlang.

describe_files_as_df <- function(file_path, n_phrases, labels) {
  if (!is_string(file_path)) {
    cli_abort(
      c(
        "{.var file_path} must be a single string.",
        "x" = "You've supplied a {.cls {class(file_path)}} vector with length {length(file_path)}."
      )
    )
  }
  
  if (class(labels) != "character") {
    cli_abort(
      c(
        "{.var labels} must be a character vector.",
        "x" = "You've supplied a {.cls {class(labels)}} vector."
      )
    )
  }
  
  if (n_phrases != length(labels)) {
    cli_abort(
      c(
        "{.var n_phrases} must be the same length as {.var labels}.", 
        "x" = "You've supplied {.var n_phrases} as {n_phrases} but {.var labels} has length {length(labels)}."
      )
    )
  }
  
  file_names <- file_path |> list.files(pattern = "csv")
  split_matrix <- file_names |> str_split_fixed("[._]", n_phrases)
  colnames(split_matrix) <- labels
}
describe_files_as_df(c(here(), here()), 5, 0)

Error in `describe_files_as_df()`:
! `file_path` must be a single string.
✖ You've supplied a <character> vector with length 2.

describe_files_as_df(here(), 5, 0)

Error in `describe_files_as_df()`:
! `labels` must be a character vector.
✖ You've supplied a <numeric> vector.

We can apply the same concepts to read_csv_file_paths().

read_csv_file_paths <- function(df) {
  if (class(df) != "data.frame") {
    cli_abort(
      c(
        "{.var df} must be a data frame.",
        "x" = "You've supplied something of class {.cls {class(df)}}."
      )
    )
  }
  
  if (!"file_paths" %in% names(df)) {
    cli_abort(
      c(
        "{.var df} must contain a column named {.var file_paths}.",
        "x" = "You've supplied a data frame with columns {.var {colnames(df)}}."
      )
    )
  }
  
  df |>
    as_tibble() |>
    mutate(data = map(file_paths, read_csv))
}

read_csv_file_paths(iris)

Error in `read_csv_file_paths()`:
! `df` must contain a column named `file_paths`.
✖ You've supplied a data frame with columns `Sepal.Length`, `Sepal.Width`,
  `Petal.Length`, `Petal.Width`, and `Species`.

Comparing Your Functions to Others

When writing functions, it is important to be aware of other, already existing functions and objects.

For example, labels() is actually a base R function. If we do not overwrite labels, it prints as:

labels

function (object, ...) 
UseMethod("labels")
<bytecode: 0x00000155ed08bc40>
<environment: namespace:base>

In our describe_files_as_df(), we can get different results:

describe_files_as_df <- function(file_path, n_phrases, labels) {
  print(labels)
  print(labels(labels))
}

describe_files_as_df(labels = "test")

[1] "test"
[1] "1"

Here labels is both the string "test" and the function name (the labels part) of labels(). There’s no trouble in using both, unless it causes confusion. If you believe something like this will cause confusion, you can rename the labels argument in describe_files_as_df() as labels_vector.

Where we need to be more careful is when a function with the same name as our function already exists. For example, if we were to create our own labels() function. Another example would be if describe_files_as_df() was already a function in some library we loaded. R will not, by default, warn or error on these conflicts. In this situation, it’s best to directly refer to the source of the functions coming from packages or from base R:

x <- "test"
base::labels(x)

[1] "1"

labels <- function(x) colnames(x)
labels(x)

NULL

names(x) <- "an example"
base::labels(x)

[1] "an example"

labels(x)

NULL