Importing and Cleaning Data • blackstone

blackstone contains many functions for importing and cleaning data. This vignette shows you:

How to create a codebook for SurveyMonkey data.
How edit the codebook and use it to read in SurveyMonkey data.
How to merge data from different time points (pre-post data).
How to clean a dataset and make sure all variables are the correct type for analysis.

We’ll start by loading blackstone:

library(blackstone)

Data Cleaning and Manipulation

Importing data from SurveyMonkey

Our current survey provider is SurveyMonkey, blackstone contains several functions that makes the process of reading SurveyMonkey data into R a more manageable process and creates a codebook for the data along the way.

SurveyMonkey exports data with two header rows, which does not work with R where tibbles and dataframes can only have one row of names.

Here is how to import data from SurveyMonkey using example data provided with blackstone, this is a fake dataset of a pre (baseline) survey. There are three steps to this process:

Create a codebook.
Edit and save the codebook to create meaningful variable names.
Read in the data and rename the variables with the codebook.

Pre Survey Data

1. Creating the codebook

# File path for pre example data:
pre_data_fp <- blackstone::blackstoneExample("sm_data_pre.csv")
# 1. Create the codebook:
codebook_pre <- blackstone::createCodebook(pre_data_fp)
codebook_pre
#> # A tibble: 33 × 5
#>   header_1      header_2 combined_header position variable_name
#>   <chr>         <chr>    <chr>              <int> <chr>        
#> 1 Respondent ID NA       Respondent ID          1 respondent_id
#> 2 Collector ID  NA       Collector ID           2 collector_id 
#> 3 Start Date    NA       Start Date             3 start_date   
#> 4 End Date      NA       End Date               4 end_date     
#> 5 IP Address    NA       IP Address             5 ip_address   
#> # ℹ 28 more rows

For this codebook, the first column header_1 is the first header from the SurveyMonkey data, the second column header_2 is the second header, the third column combined_header is the combination of the two headers, position is the column number position for each combined_header, and variable_name is a cleaned up version for combined_header and will be the column to edit to change the column names later on to shorter and more meaningful names.

variable_name will be the column that renames all the variables in the SurveyMonkey data.

2. Editing and Saving the Codebook

# Step 2. Edit the codebook: 
# Set up sequential naming convections for matrix-style questions with shared likert scale response options:
# 8 items that are matrix-style likert scales- turned into a scale called `research`- here is how to easily name them all at once:
# Rows 11 to 18 belong to the "research" matrix question (you will have to look at the codebook and match the header_1 and header_2 to variable_name to change)
research_items <- codebook_pre[["variable_name"]][11:18]
research_names <- paste0("research_", seq_along(research_items)) %>% purrr::set_names(., research_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ability`- Rows 19 to 24 named `variable_name`:
ability_items <- codebook_pre[["variable_name"]][19:24]
ability_names <- paste0("ability_", seq_along(ability_items)) %>% purrr::set_names(., ability_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ethics`- Rows 19 to 24 named `variable_name`:
ethics_items <- codebook_pre[["variable_name"]][25:29]
ethics_names <- paste0("ethics_", seq_along(ethics_items)) %>% purrr::set_names(., ethics_items) # Create a new named vector of names for these columns
# Edit the `variable_names` column: Use dplyr::mutate() and dplyr::case_match() to change the column `variable_name`:
codebook_pre <- codebook_pre %>% dplyr::mutate(
    variable_name = dplyr::case_match(
        variable_name, # column to match
        'custom_data_1' ~ "unique_id", # changes 'custom_data_1' to "unique_id"
        'to_what_extent_are_you_knowledgeable_in_conducting_research_in_your_field_of_study' ~ "knowledge",
        'with_which_gender_do_you_most_closely_identify' ~ "gender",
        'what_is_your_current_role_in_the_program' ~ "role",
        'which_race_ethnicity_best_describes_you_please_choose_only_one' ~ "ethnicity",
        'are_you_a_first_generation_college_student' ~ "first_gen",
        names(research_names) ~ research_names[variable_name], # takes the above named vector and when the name matches, applies new value in that position as replacement.
        names(ability_names) ~ ability_names[variable_name],   # Same for `ability_names`
        names(ethics_names) ~ ethics_names[variable_name],   # Same for `ability_names`
        .default = variable_name # returns default value from original `variable_name` if not changed.
        )
    )
codebook_pre
#> # A tibble: 33 × 5
#>   header_1      header_2 combined_header position variable_name
#>   <chr>         <chr>    <chr>              <int> <chr>        
#> 1 Respondent ID NA       Respondent ID          1 respondent_id
#> 2 Collector ID  NA       Collector ID           2 collector_id 
#> 3 Start Date    NA       Start Date             3 start_date   
#> 4 End Date      NA       End Date               4 end_date     
#> 5 IP Address    NA       IP Address             5 ip_address   
#> # ℹ 28 more rows

# Write out the edited codebook to save for future use-
# Be sure to double check questions match new names before writing out:
# readr::write_csv(codebook_pre, file = "{filepath-to-codebok}")

3. Import the Data and Rename the Variables with the Codebook

# 3. Read in the data and rename the vars using readRenameData(), passing the file path and the edited codebook:
pre_data <- blackstone::readRenameData(pre_data_fp, codebook = codebook_pre)
pre_data
#> # A tibble: 100 × 33
#>   respondent_id collector_id start_date end_date   ip_address      email_address
#>           <dbl>        <dbl> <date>     <date>     <chr>           <chr>        
#> 1  114628000001    431822954 2024-06-11 2024-06-12 227.224.138.113 coraima59@me…
#> 2  114628000002    431822954 2024-06-08 2024-06-09 110.241.132.50  mstamm@hermi…
#> 3  114628000003    431822954 2024-06-21 2024-06-22 165.58.112.64   precious.fei…
#> 4  114628000004    431822954 2024-06-08 2024-06-09 49.34.121.147   ines52@gmail…
#> 5  114628000005    431822954 2024-06-02 2024-06-03 115.233.66.80   franz44@hotm…
#> # ℹ 95 more rows
#> # ℹ 27 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
#> #   knowledge <chr>, research_1 <chr>, research_2 <chr>, research_3 <chr>,
#> #   research_4 <chr>, research_5 <chr>, research_6 <chr>, research_7 <chr>,
#> #   research_8 <chr>, ability_1 <chr>, ability_2 <chr>, ability_3 <chr>,
#> #   ability_4 <chr>, ability_5 <chr>, ability_6 <chr>, ethics_1 <chr>,
#> #   ethics_2 <chr>, ethics_3 <chr>, ethics_4 <chr>, ethics_5 <chr>, …

The SurveyMonkey example data is now imported with names taken from the codebook column variable_name:

names(pre_data)
#>  [1] "respondent_id" "collector_id"  "start_date"    "end_date"     
#>  [5] "ip_address"    "email_address" "first_name"    "last_name"    
#>  [9] "unique_id"     "knowledge"     "research_1"    "research_2"   
#> [13] "research_3"    "research_4"    "research_5"    "research_6"   
#> [17] "research_7"    "research_8"    "ability_1"     "ability_2"    
#> [21] "ability_3"     "ability_4"     "ability_5"     "ability_6"    
#> [25] "ethics_1"      "ethics_2"      "ethics_3"      "ethics_4"     
#> [29] "ethics_5"      "gender"        "role"          "ethnicity"    
#> [33] "first_gen"

Post Survey Data

Do that same process over again with the post data, if the variables are all the same you can use the same codebook.
For this example there are additional post variables so a new codebook will need to be created to rename the variables when reading in the data with readRenameData().

# File path for pre example data:
post_data_fp <- blackstone::blackstoneExample("sm_data_post.csv")
# 1. Create the codebook using the filepath:
codebook_post <- blackstone::createCodebook(post_data_fp)
codebook_post
#> # A tibble: 38 × 5
#>   header_1      header_2 combined_header position variable_name
#>   <chr>         <chr>    <chr>              <int> <chr>        
#> 1 Respondent ID NA       Respondent ID          1 respondent_id
#> 2 Collector ID  NA       Collector ID           2 collector_id 
#> 3 Start Date    NA       Start Date             3 start_date   
#> 4 End Date      NA       End Date               4 end_date     
#> 5 IP Address    NA       IP Address             5 ip_address   
#> # ℹ 33 more rows

# Step 2. Edit the codebook: 
# Set up sequential naming convections for matrix-style questions with shared likert scale response options:
# 8 items that are matrix-style likert scales- turned into a scale called `research`- here is how to easily name them all at once:
# Rows 11 to 18 belong to the "research" matrix question (you will have to look at the codebook and match the header_1 and header_2 to variable_name to change)
research_items <- codebook_post[["variable_name"]][11:18]
research_names <- paste0("research_", seq_along(research_items)) %>% purrr::set_names(., research_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ability`- Rows 19 to 24 named `variable_name`:
ability_items <- codebook_post[["variable_name"]][19:24]
ability_names <- paste0("ability_", seq_along(ability_items)) %>% purrr::set_names(., ability_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ethics`- Rows 19 to 24 named `variable_name`:
ethics_items <- codebook_post[["variable_name"]][25:29]
ethics_names <- paste0("ethics_", seq_along(ethics_items)) %>% purrr::set_names(., ethics_items) # Create a new named vector of names for these columns
# 5 items that are Open-ended follow up when corresponeding ethics items were answered "Strongly disagree"or "Disagree"- Rows 30 to 34 named `variable_name`:
ethics_items_oe <- codebook_post[["variable_name"]][30:34]
ethics_names_oe <- paste0("ethics_", seq_along(ethics_items), "_oe") %>% purrr::set_names(., ethics_items_oe) # Create a new named vector of names for these columns
# Edit the `variable_names` column: Use dplyr::mutate() and dplyr::case_match() to change the column `variable_name`:
codebook_post <- codebook_post %>% dplyr::mutate(
    variable_name = dplyr::case_match(
        variable_name, # column to match
        'custom_data_1' ~ "unique_id", # changes 'custom_data_1' to "unique_id"
        'to_what_extent_are_you_knowledgeable_in_conducting_research_in_your_field_of_study' ~ "knowledge",
        'with_which_gender_do_you_most_closely_identify' ~ "gender",
        'what_is_your_current_role_in_the_program' ~ "role",
        'which_race_ethnicity_best_describes_you_please_choose_only_one' ~ "ethnicity",
        'are_you_a_first_generation_college_student' ~ "first_gen",
        names(research_names) ~ research_names[variable_name], # takes the above named vector and when the name matches, applies new value in that position as replacement.
        names(ability_names) ~ ability_names[variable_name],   # Same for `ability_names`
        names(ethics_names) ~ ethics_names[variable_name],   # Same for `ability_names`
        names(ethics_names_oe) ~ ethics_names_oe[variable_name],   # Same for `ethics_names_oe`
        .default = variable_name # returns default value from original `variable_name` if not changed.
        )
    )
codebook_post
#> # A tibble: 38 × 5
#>   header_1      header_2 combined_header position variable_name
#>   <chr>         <chr>    <chr>              <int> <chr>        
#> 1 Respondent ID NA       Respondent ID          1 respondent_id
#> 2 Collector ID  NA       Collector ID           2 collector_id 
#> 3 Start Date    NA       Start Date             3 start_date   
#> 4 End Date      NA       End Date               4 end_date     
#> 5 IP Address    NA       IP Address             5 ip_address   
#> # ℹ 33 more rows
# Write out the edited codebook to save for future use-
# Be sure to double check questions match new names before writing out:
# readr::write_csv(codebook_post, file = "{filepath-to-codebok}")
# 3. Read in the data and rename the vars using readRenameData(), passing the file path and the edited codebook:
post_data <- blackstone::readRenameData(post_data_fp, codebook = codebook_post)
post_data
#> # A tibble: 100 × 38
#>   respondent_id collector_id start_date end_date   ip_address      email_address
#>           <dbl>        <dbl> <date>     <date>     <chr>           <chr>        
#> 1  114628000001    431822954 2024-06-11 2024-06-12 227.224.138.113 coraima59@me…
#> 2  114628000002    431822954 2024-06-08 2024-06-09 110.241.132.50  mstamm@hermi…
#> 3  114628000003    431822954 2024-06-21 2024-06-22 165.58.112.64   precious.fei…
#> 4  114628000004    431822954 2024-06-08 2024-06-09 49.34.121.147   ines52@gmail…
#> 5  114628000005    431822954 2024-06-02 2024-06-03 115.233.66.80   franz44@hotm…
#> # ℹ 95 more rows
#> # ℹ 32 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
#> #   knowledge <chr>, research_1 <chr>, research_2 <chr>, research_3 <chr>,
#> #   research_4 <chr>, research_5 <chr>, research_6 <chr>, research_7 <chr>,
#> #   research_8 <chr>, ability_1 <chr>, ability_2 <chr>, ability_3 <chr>,
#> #   ability_4 <chr>, ability_5 <chr>, ability_6 <chr>, ethics_1 <chr>,
#> #   ethics_2 <chr>, ethics_3 <chr>, ethics_4 <chr>, ethics_5 <chr>, …

Finally, it is important to add pre_ and post_ prefixes to all unique variables before merging the datasets, (i.e. the survey items that differ pre-post- the SM items and demos are identical):

# Pre data:
pre_data <- pre_data %>% dplyr::rename_with(~ paste0("pre_", .), .cols = c(knowledge:ethics_5))
# Pre data:
post_data <- post_data %>% dplyr::rename_with(~ paste0("post_", .), .cols = c(knowledge:ethics_5_oe))

Merging Data

Merge pre-post data by joining on all the variables that are shared in common.

The dplyr package has many joining functions, the most commonly use is dplyr::left_join() which keeps all the observations from the first table provided and merges all observations from the second that match.

For most data analysis, we will want to use the post data as the primary table and merge all the pre data since most post surveys drop some participants, so we can run our analysis on complete data.

# left_join() will automatically join by all the shared columns, be sure to include all shared variables that should be identical pre-post to the 'by = join_by()' as an arg
# (otherwise you will get a message about additional variables to be joined by):
sm_data <- post_data %>% dplyr::left_join(pre_data, by = dplyr::join_by(respondent_id, collector_id, start_date, end_date, ip_address, email_address, 
                                                                        first_name, last_name, unique_id, gender, role, ethnicity, first_gen))
sm_data
#> # A tibble: 100 × 58
#>   respondent_id collector_id start_date end_date   ip_address      email_address
#>           <dbl>        <dbl> <date>     <date>     <chr>           <chr>        
#> 1  114628000001    431822954 2024-06-11 2024-06-12 227.224.138.113 coraima59@me…
#> 2  114628000002    431822954 2024-06-08 2024-06-09 110.241.132.50  mstamm@hermi…
#> 3  114628000003    431822954 2024-06-21 2024-06-22 165.58.112.64   precious.fei…
#> 4  114628000004    431822954 2024-06-08 2024-06-09 49.34.121.147   ines52@gmail…
#> 5  114628000005    431822954 2024-06-02 2024-06-03 115.233.66.80   franz44@hotm…
#> # ℹ 95 more rows
#> # ℹ 52 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
#> #   post_knowledge <chr>, post_research_1 <chr>, post_research_2 <chr>,
#> #   post_research_3 <chr>, post_research_4 <chr>, post_research_5 <chr>,
#> #   post_research_6 <chr>, post_research_7 <chr>, post_research_8 <chr>,
#> #   post_ability_1 <chr>, post_ability_2 <chr>, post_ability_3 <chr>,
#> #   post_ability_4 <chr>, post_ability_5 <chr>, post_ability_6 <chr>, …

Data Cleaning

Convert all likert scales to factors (for ordering) and all demographics.
Create numeric variables from the factor variables for use with statistical tests later on.
If applicable, drop all “Missing”/NA observations.

## Knowledge scale
levels_knowledge <- c("Not knowledgeable at all", "A little knowledgeable", "Somewhat knowledgeable", "Very knowledgeable", "Extremely knowledgeable")
## Research Items scale:
levels_confidence <- c("Not at all confident", "Slightly confident", "Somewhat confident", "Very confident", "Extremely confident")
## Ability Items scale: 
levels_min_ext <- c("Minimal", "Slight", "Moderate", "Good", "Extensive")
## Ethics Items scale:
levels_agree5 <- c("Strongly disagree", "Disagree", "Neither agree nor disagree", "Agree", "Strongly agree")

# Demographic levels:
gender_levels <- c("Female","Male","Non-binary", "Do not wish to specify")
role_levels <- c("Undergraduate student", "Graduate student", "Postdoc",  "Faculty")
ethnicity_levels <- c("White (Non-Hispanic/Latino)", "Asian", "Black",  "Hispanic or Latino", "American Indian or Alaskan Native",
                      "Native Hawaiian or other Pacific Islander", "Do not wish to specify")
first_gen_levels <- c("Yes", "No", "I'm not sure")

# Use mutate() for convert each item in each scale to a factor with vectors above, 
# across() will perform a function for items selected using contains() or can be selected 
# by variables names individually using a character vector: _knowledge or use c("pre_knowledge","post_knowledge")
# Also create new numeric variables for all the likert scale items and use the suffix '_num' to denote numeric:
sm_data <- sm_data %>% dplyr::mutate(dplyr::across(tidyselect::contains("_knowledge"), ~ factor(., levels = levels_knowledge)), # match each name pattern to select to each factor level
                                     dplyr::across(tidyselect::contains("_knowledge"), as.numeric, .names = "{.col}_num"), # create new numeric items for all knowledge items
                                     dplyr::across(tidyselect::contains("research_"), ~ factor(., levels = levels_confidence)), 
                                     dplyr::across(tidyselect::contains("research_"), as.numeric, .names = "{.col}_num"), # create new numeric items for all research items
                                     dplyr::across(tidyselect::contains("ability_"), ~ factor(., levels = levels_min_ext)),
                                     dplyr::across(tidyselect::contains("ability_"), as.numeric, .names = "{.col}_num"), # create new numeric items for all ability items
                                     # select ethics items but not the open_ended responses:
                                     dplyr::across(tidyselect::contains("ethics_") & !tidyselect::contains("_oe"), ~ factor(., levels = levels_agree5)),
                                     dplyr::across(tidyselect::contains("ethics_") & !tidyselect::contains("_oe"), as.numeric, .names = "{.col}_num"), # new numeric items for all ethics items
                                     # individually convert all demographics to factor variables:
                                     gender = factor(gender, levels = gender_levels),
                                     role = factor(role, levels = role_levels),
                                     ethnicity = factor(ethnicity, levels = ethnicity_levels),
                                     first_gen = factor(first_gen, levels = first_gen_levels),
                                     )
sm_data
#> # A tibble: 100 × 98
#>   respondent_id collector_id start_date end_date   ip_address      email_address
#>           <dbl>        <dbl> <date>     <date>     <chr>           <chr>        
#> 1  114628000001    431822954 2024-06-11 2024-06-12 227.224.138.113 coraima59@me…
#> 2  114628000002    431822954 2024-06-08 2024-06-09 110.241.132.50  mstamm@hermi…
#> 3  114628000003    431822954 2024-06-21 2024-06-22 165.58.112.64   precious.fei…
#> 4  114628000004    431822954 2024-06-08 2024-06-09 49.34.121.147   ines52@gmail…
#> 5  114628000005    431822954 2024-06-02 2024-06-03 115.233.66.80   franz44@hotm…
#> # ℹ 95 more rows
#> # ℹ 92 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
#> #   post_knowledge <fct>, post_research_1 <fct>, post_research_2 <fct>,
#> #   post_research_3 <fct>, post_research_4 <fct>, post_research_5 <fct>,
#> #   post_research_6 <fct>, post_research_7 <fct>, post_research_8 <fct>,
#> #   post_ability_1 <fct>, post_ability_2 <fct>, post_ability_3 <fct>,
#> #   post_ability_4 <fct>, post_ability_5 <fct>, post_ability_6 <fct>, …

The data cleaned in this vignette will be used as the example data in the vignettes Data analysis and Statistical Inference and Data Visualization to further showcase all the functions contained in blackstone.