
Other Utility Functions in bulkreadr
Ezekiel Ogundepo and Ernest Fokoué
Source:vignettes/other-functions.Rmd
other-functions.RmdThe bulkreadr package includes specialized functions
beyond bulk data reading, aimed at enhancing data analysis efficiency.
These functions are designed to operate on individual vectors, except
for inspect_na() and fill_missing_values(),
which work on data frames.
pull_out()
pull_out() extracts or replaces parts of vectors,
matrices, arrays, or lists. It works seamlessly with magrittr
(%>%) or base (|>) operators.
library(bulkreadr)
library(dplyr)
top_10_richest_nig <- c("Aliko Dangote", "Mike Adenuga", "Femi Otedola", "Arthur Eze", "Abdulsamad Rabiu", "Cletus Ibeto", "Orji Uzor Kalu", "ABC Orjiakor", "Jimoh Ibrahim", "Tony Elumelu")
# Extract specific elements from the list
top_10_richest_nig |>
pull_out(c(1, 5, 2))
#> [1] "Aliko Dangote" "Abdulsamad Rabiu" "Mike Adenuga"
# Exclude specific elements from the list
top_10_richest_nig |>
pull_out(-c(1, 5, 2))
#> [1] "Femi Otedola" "Arthur Eze" "Cletus Ibeto" "Orji Uzor Kalu"
#> [5] "ABC Orjiakor" "Jimoh Ibrahim" "Tony Elumelu"convert_to_date()
convert_to_date() efficiently parses dates from various
formats into POSIXct date objects, enabling smooth date
handling and analysis.
# heterogeneous dates
dates <- c(
44869, "22.09.2022", NA, "02/27/92", "01-19-2022",
"13-01- 2022", "2023", "2023-2", 41750.2, 41751.99,
"11 07 2023", "2023-4"
)
# Convert to POSIXct or Date object
convert_to_date(dates)
#> [1] "2022-11-04" "2022-09-22" NA "1992-02-27" "2022-01-19"
#> [6] "2022-01-13" "2023-01-01" "2023-02-01" "2014-04-21" "2014-04-22"
#> [11] "2023-07-11" "2023-04-01"
# Convert date-time object to date object
convert_to_date(lubridate::now())
#> [1] "2026-04-25"Handling Missing Values with inspect_na() and
fill_missing_values()
inspect_na(): Quickly checks for missing data across a dataframe.fill_missing_values(): Offers multiple imputation strategies for filling missing values.
# Inspect missing data in the 'airquality' dataset
inspect_na(airquality)
#> # A tibble: 6 × 3
#> col_name cnt pcnt
#> <chr> <int> <dbl>
#> 1 Ozone 37 24.2
#> 2 Solar.R 7 4.58
#> 3 Wind 0 0
#> 4 Temp 0 0
#> 5 Month 0 0
#> # ℹ 1 more rowinspect_na() also works with grouped data frames,
allowing you to inspect missing values within each group. For example,
to check for missing values in the airquality dataset
grouped by Month, you can use:
airquality %>%
group_by(Month) %>%
inspect_na()
#> # A tibble: 25 × 4
#> Month col_name cnt pcnt
#> <int> <chr> <int> <dbl>
#> 1 5 Ozone 5 16.1
#> 2 5 Solar.R 4 12.9
#> 3 5 Wind 0 0
#> 4 5 Temp 0 0
#> 5 5 Day 0 0
#> # ℹ 20 more rowsImputing Missing Values
fill_missing_values() addresses missing values in a data
frame. It uses imputation by function, also known as column-based
imputation, to impute the missing values. It supports various imputation
methods for continuous variables, including minimum,
maximum, mean, median,
harmonic mean, and geometric mean. For
categorical variables, missing values are replaced with the
mode of the column. This approach ensures accurate and
consistent replacements derived from individual columns, resulting in a
complete and reliable dataset for improved analysis and
decision-making.
df <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Sepal.Width = c(4.1, 3.6, 3, 3, 2.9, 2.5, 2.4),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c(
"setosa", NA, "versicolor", "setosa",
NA, "virginica", "setosa"
)
)
df
#> # A tibble: 7 × 5
#> Sepal_Length Sepal.Width Petal_Length Petal_Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.2 4.1 1.5 NA setosa
#> 2 5 3.6 1.4 0.2 NA
#> 3 5.7 3 4.2 1.2 versicolor
#> 4 NA 3 1.4 0.2 setosa
#> 5 6.2 2.9 NA 1.3 NA
#> # ℹ 2 more rowsImpute using the mean method for continuous variables
result_df_mean <- fill_missing_values(df, method = "mean")
result_df_mean
#> # A tibble: 7 × 5
#> Sepal_Length Sepal.Width Petal_Length Petal_Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.2 4.1 1.5 0.94 setosa
#> 2 5 3.6 1.4 0.2 setosa
#> 3 5.7 3 4.2 1.2 versicolor
#> 4 5.72 3 1.4 0.2 setosa
#> 5 6.2 2.9 3 1.3 setosa
#> # ℹ 2 more rowsImpute using the geometric mean for continuous variables and
specify variables Petal_Length and
Petal_Width
result_df_geomean <- fill_missing_values(df, selected_variables = c
("Petal_Length", "Petal_Width"), method = "geometric")
result_df_geomean
#> # A tibble: 7 × 5
#> Sepal_Length Sepal.Width Petal_Length Petal_Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.2 4.1 1.5 0.732 setosa
#> 2 5 3.6 1.4 0.2 NA
#> 3 5.7 3 4.2 1.2 versicolor
#> 4 NA 3 1.4 0.2 setosa
#> 5 6.2 2.9 2.22 1.3 NA
#> # ℹ 2 more rowsImpute missing values (NAs) in a grouped data frame
You can use the fill_missing_values() in a grouped data
frame by using other grouping and map functions. Here is an example of
how to do this:
sample_iris <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", "setosa", "versicolor", "setosa",
"virginica", "virginica", "setosa")
)
sample_iris
#> # A tibble: 7 × 4
#> Sepal_Length Petal_Length Petal_Width Species
#> <dbl> <dbl> <dbl> <chr>
#> 1 5.2 1.5 0.3 setosa
#> 2 5 1.4 0.2 setosa
#> 3 5.7 4.2 1.2 versicolor
#> 4 NA 1.4 0.2 setosa
#> 5 6.2 NA 1.3 virginica
#> # ℹ 2 more rows
sample_iris %>%
group_by(Species) %>%
group_split() %>%
map_df(fill_missing_values, method = "median")
#> # A tibble: 7 × 4
#> Sepal_Length Petal_Length Petal_Width Species
#> <dbl> <dbl> <dbl> <chr>
#> 1 5.2 1.5 0.3 setosa
#> 2 5 1.4 0.2 setosa
#> 3 5.2 1.4 0.2 setosa
#> 4 5.5 3.7 0.2 setosa
#> 5 5.7 4.2 1.2 versicolor
#> # ℹ 2 more rows