Introduction to labelled data
Ezekiel Ogundepo and Ernest Fokoué
Source:vignettes/labelled-data.Rmd
labelled-data.Rmd
What is labelled data in R?
Labelled data in SPSS and Stata refers to datasets where each variable (or column) and its values are assigned meaningful labels. These labels provide context, such as descriptions or categories, making the data easier to understand and analyze. For instance, a variable representing gender might have numerical codes (1, 2) with labels (“Male”, “Female”). This feature enhances data analysis by allowing researchers to work with descriptive labels instead of deciphering codes or numeric values, facilitating clearer interpretation and communication of statistical results.
The R ecosystem, through packages like foreign
and
haven
, facilitates the importation of labelled data from
software like SPSS and Stata, ensuring a smooth transition into R. The
bulkreadr
package extends this functionality by leveraging
haven
to further streamline the process. It automatically
converts labelled data into R’s factor data type, eliminating the need
for manual recoding. This enhancement significantly improves the
efficiency of the data analysis workflow within the R environment.
Note
For the majority of functions within this package, we will utilize data stored in the system file by the
bulkreadr
, which can be accessed using thesystem.file()
function. If you wish to utilize your own data stored in your local directory, please ensure that you have set the appropriate file path prior to using any functions provided by the bulkreadr package.
read_spss_data()
read_spss_data()
is designed to seamlessly import data
from an SPSS data (.sav
or .zsav
) files. It
converts labelled variables into factors, a crucial step that enhances
the ease of data manipulation and analysis within the R programming
environment.
Read the SPSS data file without converting variable labels as column names
library(bulkreadr)
file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr")
data <- read_spss_data(file = file_path)
data
#> # A tibble: 400 × 9
#> id educ south sex exper wage occup marr ed
#> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <fct> <fct> <fct>
#> 1 3 12 does not live in South Male 17 7.5 Other Married High s…
#> 2 4 13 does not live in South Male 9 13.1 Other Not married Some c…
#> 3 5 10 lives in South Male 27 4.45 Other Not married Less t…
#> 4 12 9 lives in South Male 30 6.25 Other Not married Less t…
#> 5 13 9 lives in South Male 29 20.0 Other Married Less t…
#> # ℹ 395 more rows
Read the SPSS data file and convert variable labels as column names
data <- read_spss_data(file = file_path, label = TRUE)
data
#> # A tibble: 400 × 9
#> `Worker ID` `Number of years of education` `Live in south` Gender
#> <dbl> <dbl> <fct> <fct>
#> 1 3 12 does not live in South Male
#> 2 4 13 does not live in South Male
#> 3 5 10 lives in South Male
#> 4 12 9 lives in South Male
#> 5 13 9 lives in South Male
#> # ℹ 395 more rows
#> # ℹ 5 more variables: `Number of years of work experience` <dbl>,
#> # `Wage (dollars per hour)` <dbl>, Occupation <fct>, `Marital status` <fct>,
#> # `Highest education level` <fct>
read_stata_data()
read_stata_data()
reads Stata data file
(.dta
) into an R data frame, converting labeled variables
into factors.
Read the Stata data file without converting variable labels as column names
file_path <- system.file("extdata", "Wages.dta", package = "bulkreadr")
data <- read_stata_data(file = file_path)
data
#> # A tibble: 400 × 9
#> id educ south sex exper wage occup marr ed
#> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <fct> <fct> <fct>
#> 1 3 12 does not live in South Male 17 7.5 Other Married High s…
#> 2 4 13 does not live in South Male 9 13.1 Other Not married Some c…
#> 3 5 10 lives in South Male 27 4.45 Other Not married Less t…
#> 4 12 9 lives in South Male 30 6.25 Other Not married Less t…
#> 5 13 9 lives in South Male 29 20.0 Other Married Less t…
#> # ℹ 395 more rows
Read the Stata data file and convert variable labels as column names
data <- read_stata_data(file = file_path, label = TRUE)
data
#> # A tibble: 400 × 9
#> `Worker ID` `Number of years of education` `Live in south` Gender
#> <dbl> <dbl> <fct> <fct>
#> 1 3 12 does not live in South Male
#> 2 4 13 does not live in South Male
#> 3 5 10 lives in South Male
#> 4 12 9 lives in South Male
#> 5 13 9 lives in South Male
#> # ℹ 395 more rows
#> # ℹ 5 more variables: `Number of years of work experience` <dbl>,
#> # `Wage (dollars per hour)` <dbl>, Occupation <fct>, `Marital status` <fct>,
#> # `Highest education level` <fct>
generate_dictionary()
generate_dictionary()
creates a data dictionary from a
specified data frame. This function is particularly useful for
understanding and documenting the structure of your dataset, similar to
data dictionaries in Stata or SPSS.
# Creating a data dictionary from an SPSS file
file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr")
wage_data <- read_spss_data(file = file_path)
generate_dictionary(wage_data)
#> # A tibble: 9 × 6
#> position variable description `column type` missing levels
#> <int> <chr> <chr> <chr> <int> <name>
#> 1 1 id Worker ID dbl 0 <NULL>
#> 2 2 educ Number of years of education dbl 0 <NULL>
#> 3 3 south Live in south fct 0 <chr>
#> 4 4 sex Gender fct 0 <chr>
#> 5 5 exper Number of years of work experi… dbl 0 <NULL>
#> # ℹ 4 more rows
look_for()
The look_for()
function is designed to emulate the
functionality of the Stata lookfor
command in R. It
provides a powerful tool for searching through large datasets,
specifically targeting variable names, variable label descriptions,
factor levels, and value labels. This function is handy for users
working with extensive and complex datasets, enabling them to quickly
and efficiently locate the variables of interest.
# Look for a single keyword.
look_for(wage_data, "south")
#> pos variable label col_type missing values
#> 3 south Live in south fct 0 does not live in South
#> lives in South
look_for(wage_data, "^s")
#> pos variable label col_type missing values
#> 3 south Live in south fct 0 does not live in South
#> lives in South
#> 4 sex Gender fct 0 Male
#> Female