bulkreadr: Fast, Reproducible Bulk Data Import for R-Based Research Workflows

R Medicine 2026 Talk

Author

Affiliation

African Institute for Mathematical Sciences (AIMS)
Rwanda

About the Package

bulkreadr is a comprehensive R package designed to simplify and accelerate the process of reading and processing large volumes of data. The package allows users to effortlessly load data from multiple sources, such as Microsoft Excel, Google Sheets, CSV files, SPSS, and Stata. It efficiently reads multiple sheets from workbooks and handles bulk file imports, combining the data into organized data frames for further analysis and manipulation. Whether you’re working with a single large dataset or multiple files in different formats, bulkreadr streamlines the data import process, making it faster, reproducible, and error-free.

For installation and repository details, visit the https://github.com/gbganalyst/bulkreadr.

How to Load the Package

Before using bulkreadr, ensure the package is installed. Then, load it into your R session:

library(bulkreadr)
library(tidyverse)

Function Overview

This section provides a concise overview of the different functions available in the bulkreadr package. These functions serve various purposes and are designed to handle importing of data in bulk.

Reading Excel Workbooks

The function read_excel_workbook() reads all sheets from an Excel workbook and appends them into a single dataframe for analysis.

# Load Excel workbook and read all sheets
path <- "data/Diamonds.xlsx"

diamond_data <- read_excel_workbook(path = path)

diamond_data |> glimpse()
#> Rows: 260
#> Columns: 9
#> $ carat   <dbl> 2.00, 0.70, 1.51, 0.70, 0.35, 0.50, 1.00, 1.09, 0.98, 0.70, 2.…
#> $ color   <chr> "I", "H", "E", "D", "F", "E", "E", "J", "H", "F", "F", "E", "F…
#> $ clarity <chr> "SI1", "SI1", "SI1", "SI2", "VVS1", "VS2", "SI1", "VS2", "SI2"…
#> $ depth   <dbl> 65.9, 65.2, 58.4, 65.5, 54.6, 64.9, 65.1, 64.6, 67.9, 65.3, 66…
#> $ table   <dbl> 60, 58, 70, 57, 59, 56, 61, 58, 60, 54, 57, 62, 59, 68, 53, 58…
#> $ price   <dbl> 13764, 2048, 11102, 1806, 1011, 1397, 4435, 3443, 2777, 1974, …
#> $ x       <dbl> 7.80, 5.49, 7.55, 5.56, 4.85, 5.01, 6.15, 6.48, 6.05, 5.58, 7.…
#> $ y       <dbl> 7.73, 5.55, 7.39, 5.43, 4.79, 4.95, 6.08, 6.41, 5.97, 5.54, 7.…
#> $ z       <dbl> 5.12, 3.60, 4.36, 3.60, 2.63, 3.23, 3.98, 4.16, 4.08, 3.63, 5.…

# Load Excel workbook and read all sheets with sheet names as an identifier

diamond_data <- read_excel_workbook(path = path, .id = "sheet_name")

diamond_data |> view()

Importing Multiple Excel Files from a Directory

For directories with multiple Excel files, read_excel_files_from_dir() reads all the workbooks in a given directory and appends them.

# Import Excel files from the directory
diamond_combined <- read_excel_files_from_dir("data/xlsxfolder")

diamond_combined |> view()

Handling CSV Files in Bulk

The function read_csv_files_from_dir() loads all CSV files from a directory, making it easy to work with large amounts of data stored across multiple files.

# Import CSV files from directory
read_csv_files_from_dir("data/csvfolder", .id = "location")

Google Sheets Integration

The read_gsheets() function enables users to read and combine data from multiple Google Sheets, simplifying the process of importing distributed data from cloud sources.

# Specify the Google Sheets URL
url <- "https://docs.google.com/spreadsheets/d/1izO0mHu3L9AMySQUXGDn9GPs1n-VwGFSEoAKGhqVQh0"

# Browse to see how the data in Google Sheets looks
googlesheets4::gs4_browse(url)

# Import all sheets from the Google Sheet
diamond_gsheet <- read_gsheets(ss = url, .id = "data_sheet")

diamond_gsheet |> glimpse()
#> Rows: 260
#> Columns: 10
#> $ data_sheet <chr> "Fair", "Fair", "Fair", "Fair", "Fair", "Fair", "Fair", "Fa…
#> $ carat      <dbl> 2.00, 0.70, 1.51, 0.70, 0.35, 0.50, 1.00, 1.09, 0.98, 0.70,…
#> $ color      <chr> "I", "H", "E", "D", "F", "E", "E", "J", "H", "F", "F", "E",…
#> $ clarity    <chr> "SI1", "SI1", "SI1", "SI2", "VVS1", "VS2", "SI1", "VS2", "S…
#> $ depth      <dbl> 65.9, 65.2, 58.4, 65.5, 54.6, 64.9, 65.1, 64.6, 67.9, 65.3,…
#> $ table      <dbl> 60, 58, 70, 57, 59, 56, 61, 58, 60, 54, 57, 62, 59, 68, 53,…
#> $ price      <dbl> 13764, 2048, 11102, 1806, 1011, 1397, 4435, 3443, 2777, 197…
#> $ x          <dbl> 7.80, 5.49, 7.55, 5.56, 4.85, 5.01, 6.15, 6.48, 6.05, 5.58,…
#> $ y          <dbl> 7.73, 5.55, 7.39, 5.43, 4.79, 4.95, 6.08, 6.41, 5.97, 5.54,…
#> $ z          <dbl> 5.12, 3.60, 4.36, 3.60, 2.63, 3.23, 3.98, 4.16, 4.08, 3.63,…

Labelled Data Import

read_spss_data(): Imports SPSS data files
read_stata_data(): Imports Stata data files

Both functions import labelled data and convert labelled variables into factors, which is essential for effective data manipulation and analysis in R.

# Using haven to read SPSS data
spss_file_path <- "data/wages.sav"

wages_data <- haven::read_sav(file = spss_file_path)
wages_data |> glimpse()
#> Rows: 400
#> Columns: 9
#> $ id    <dbl> 3, 4, 5, 12, 13, 14, 17, 20, 21, 23, 25, 28, 30, 31, 32, 36, 38,…
#> $ educ  <dbl> 12, 13, 10, 9, 9, 12, 11, 12, 11, 6, 10, 8, 12, 12, 12, 14, 13, …
#> $ south <dbl+lbl> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, …
#> $ sex   <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ exper <dbl> 17, 9, 27, 30, 29, 37, 16, 9, 14, 45, 30, 19, 36, 20, 35, 14, 8,…
#> $ wage  <dbl> 7.50, 13.07, 4.45, 6.25, 19.98, 7.30, 3.65, 3.75, 4.50, 5.75, 6.…
#> $ occup <dbl+lbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, …
#> $ marr  <dbl+lbl> 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, …
#> $ ed    <dbl+lbl> 2, 3, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 3, 1, …

read_spss_data()

# Read SPSS data without convert variable labels as column names
wages_data <- read_spss_data(file = spss_file_path)
wages_data |> glimpse()
#> Rows: 400
#> Columns: 9
#> $ id    <dbl> 3, 4, 5, 12, 13, 14, 17, 20, 21, 23, 25, 28, 30, 31, 32, 36, 38,…
#> $ educ  <dbl> 12, 13, 10, 9, 9, 12, 11, 12, 11, 6, 10, 8, 12, 12, 12, 14, 13, …
#> $ south <fct> does not live in South, does not live in South, lives in South, …
#> $ sex   <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, Male…
#> $ exper <dbl> 17, 9, 27, 30, 29, 37, 16, 9, 14, 45, 30, 19, 36, 20, 35, 14, 8,…
#> $ wage  <dbl> 7.50, 13.07, 4.45, 6.25, 19.98, 7.30, 3.65, 3.75, 4.50, 5.75, 6.…
#> $ occup <fct> Other, Other, Other, Other, Other, Other, Other, Other, Other, O…
#> $ marr  <fct> Married, Not married, Not married, Not married, Married, Married…
#> $ ed    <fct> High school degree, Some college, Less than h.s. degree, Less th…

# Read SPSS data and convert variable labels as column names
wages_data <- read_spss_data(file = spss_file_path, label = TRUE)
wages_data

read_stata_data()

# Read Stata data without converting labels
stata_file_path <- "data/Wages.dta"
data <- read_stata_data(file = stata_file_path)
data

# Read Stata data and convert variable labels as column names
data <- read_stata_data(file = stata_file_path, label = TRUE)
data

Data Quality and Missing Values

Generating Data Dictionaries

The generate_dictionary() function creates a comprehensive data dictionary from an imported labelled dataframe, which is essential for documenting and understanding your dataset’s structure.

generate_dictionary(wages_data) |>
  view()

The look_for() function mimics the Stata lookfor command in R, allowing users to quickly search large datasets for variable names, labels, factor levels, and value labels.

# Look for the keyword "south" in the dataset
look_for(wages_data, "south") |>
  view()

# Look for the keyword "e" in the dataset
look_for(wages_data, "e") |>
  view()

Handling Missing Values

inspect_na(): Quickly checks for missing data across a dataframe.
fill_missing_values(): Offers multiple imputation strategies for filling missing values.

# Inspect missing data in the 'airquality' dataset
inspect_na(airquality)

Imputing Missing Values

Use fill_missing_values() to fill missing values with the mean, median, or other methods:

# Create a sample dataset with missing values
sample_iris <- tibble::tibble(
  Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
  Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
  Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
  Species = c("setosa", "setosa", "versicolor", "setosa", "virginica", "virginica", "setosa")
)

sample_iris


# Fill missing values using mean for numeric variables
result_df_mean <- fill_missing_values(sample_iris, method = "mean")

You can impute missing values for selected variable in the data fame by using the selected_variables argument. For example, to impute missing values in the Petal_Length, and Petal_Width variables using the median method, you can do this:

result_df_mean <- fill_missing_values(
  sample_iris,
  selected_variables = c("Petal_Length", "Petal_Width"),
  method = "median"
)

Other Utility Functions

pull_out()

pull_out() extracts or replaces parts of vectors, matrices, arrays, or lists. It works seamlessly with magrittr (%>%) or base (|>) operators.

top_10_richest_nig <- c("Aliko Dangote", "Mike Adenuga", "Femi Otedola", "Arthur Eze", "Abdulsamad Rabiu", "Cletus Ibeto", "Orji Uzor Kalu", "ABC Orjiakor", "Jimoh Ibrahim", "Tony Elumelu")

# Extract specific elements from the list
top_10_richest_nig |>
  pull_out(c(1, 5, 2))
#> [1] "Aliko Dangote"    "Abdulsamad Rabiu" "Mike Adenuga"

# Exclude specific elements from the list
top_10_richest_nig |>
  pull_out(-c(1, 5, 2))
#> [1] "Femi Otedola"   "Arthur Eze"     "Cletus Ibeto"   "Orji Uzor Kalu"
#> [5] "ABC Orjiakor"   "Jimoh Ibrahim"  "Tony Elumelu"

convert_to_date()

convert_to_date() efficiently parses dates from various formats into POSIXct date objects, enabling smooth date handling and analysis.

# Heterogeneous dates
dates <- c(
  44869, "22.09.2022", NA, "02/27/92", "01-19-2022",
  "13-01-2022", "2023", "2023-2", 41750.2, 41751.99,
  "11 07 2023", "2023-4"
)

# Convert to POSIXct or Date object
convert_to_date(dates)
#>  [1] "2022-11-04" "2022-09-22" NA           "1992-02-27" "2022-01-19"
#>  [6] "2022-01-13" "2023-01-01" "2023-02-01" "2014-04-21" "2014-04-22"
#> [11] "2023-07-11" "2023-04-01"

# Convert date-time object to date object
convert_to_date(lubridate::now())
#> [1] "2026-04-27"

Let’s look at an example using data frame

ogd_data <- read_excel_workbook("data/OGD.xlsx")

ogd_data %>%
  head()


# Convert to POSIXct or Date object
modified_ogd_data <- ogd_data %>%
  mutate(Date_format = convert_to_date(Date))

modified_ogd_data %>%
  glimpse()
#> Rows: 760
#> Columns: 3
#> $ PID         <chr> "NIG-CON-002", "NIG-CON-004", "NIG-CON-007", "NIG-CON-009"…
#> $ Date        <chr> "22.09.2022", "44569", "44569", "44569", "44569", "44569",…
#> $ Date_format <date> 2022-09-22, 2022-01-08, 2022-01-08, 2022-01-08, 2022-01-0…

Conclusion

In this session, we’ve explored how to use bulkreadr to handle and import data from multiple sources, including Excel, CSV, Google Sheets, SPSS, and Stata. The package helps streamline the data import process, saving time and ensuring that your workflows are reproducible. We also covered essential data quality checks and strategies for handling missing values.

Feel free to explore further, ask questions, or refer to the bulkreadr GitHub repository, https://github.com/gbganalyst/bulkreadr for more details.

Thank you for attending the R/Medicine 2026 Workshop!