blackstone contains many functions for importing and
cleaning data. This vignette shows you:
How to create a codebook for SurveyMonkey data.
How edit the codebook and use it to read in SurveyMonkey data.
How to merge data from different time points (pre-post data).
How to clean a dataset and make sure all variables are the correct type for analysis.
We’ll start by loading blackstone:
Data Cleaning and Manipulation
Importing data from SurveyMonkey
Our current survey provider is SurveyMonkey,
blackstone contains several functions that makes the
process of reading SurveyMonkey data into R a more
manageable process and creates a codebook for the data along the
way.
SurveyMonkey exports data with two header rows, which does not work
with R where tibbles and dataframes can only have one row
of names.
Here is how to import data from SurveyMonkey using example data
provided with blackstone, this is a fake dataset of a pre
(baseline) survey. There are three steps to this process:
- Create a codebook.
- Edit and save the codebook to create meaningful variable names.
- Read in the data and rename the variables with the codebook.
Pre Survey Data
1. Creating the codebook
# File path for pre example data:
pre_data_fp <- blackstone::blackstoneExample("sm_data_pre.csv")
# 1. Create the codebook:
codebook_pre <- blackstone::createCodebook(pre_data_fp)
codebook_pre
#> # A tibble: 33 × 5
#> header_1 header_2 combined_header position variable_name
#> <chr> <chr> <chr> <int> <chr>
#> 1 Respondent ID NA Respondent ID 1 respondent_id
#> 2 Collector ID NA Collector ID 2 collector_id
#> 3 Start Date NA Start Date 3 start_date
#> 4 End Date NA End Date 4 end_date
#> 5 IP Address NA IP Address 5 ip_address
#> # ℹ 28 more rowsFor this codebook, the first column header_1 is the
first header from the SurveyMonkey data, the second column
header_2 is the second header, the third column
combined_header is the combination of the two headers,
position is the column number position for each
combined_header, and variable_name is a
cleaned up version for combined_header and will be the
column to edit to change the column names later on to shorter and more
meaningful names.
variable_name will be the column that renames all the
variables in the SurveyMonkey data.
2. Editing and Saving the Codebook
# Step 2. Edit the codebook:
# Set up sequential naming convections for matrix-style questions with shared likert scale response options:
# 8 items that are matrix-style likert scales- turned into a scale called `research`- here is how to easily name them all at once:
# Rows 11 to 18 belong to the "research" matrix question (you will have to look at the codebook and match the header_1 and header_2 to variable_name to change)
research_items <- codebook_pre[["variable_name"]][11:18]
research_names <- paste0("research_", seq_along(research_items)) %>% purrr::set_names(., research_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ability`- Rows 19 to 24 named `variable_name`:
ability_items <- codebook_pre[["variable_name"]][19:24]
ability_names <- paste0("ability_", seq_along(ability_items)) %>% purrr::set_names(., ability_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ethics`- Rows 19 to 24 named `variable_name`:
ethics_items <- codebook_pre[["variable_name"]][25:29]
ethics_names <- paste0("ethics_", seq_along(ethics_items)) %>% purrr::set_names(., ethics_items) # Create a new named vector of names for these columns
# Edit the `variable_names` column: Use dplyr::mutate() and dplyr::case_match() to change the column `variable_name`:
codebook_pre <- codebook_pre %>% dplyr::mutate(
variable_name = dplyr::case_match(
variable_name, # column to match
'custom_data_1' ~ "unique_id", # changes 'custom_data_1' to "unique_id"
'to_what_extent_are_you_knowledgeable_in_conducting_research_in_your_field_of_study' ~ "knowledge",
'with_which_gender_do_you_most_closely_identify' ~ "gender",
'what_is_your_current_role_in_the_program' ~ "role",
'which_race_ethnicity_best_describes_you_please_choose_only_one' ~ "ethnicity",
'are_you_a_first_generation_college_student' ~ "first_gen",
names(research_names) ~ research_names[variable_name], # takes the above named vector and when the name matches, applies new value in that position as replacement.
names(ability_names) ~ ability_names[variable_name], # Same for `ability_names`
names(ethics_names) ~ ethics_names[variable_name], # Same for `ability_names`
.default = variable_name # returns default value from original `variable_name` if not changed.
)
)
codebook_pre
#> # A tibble: 33 × 5
#> header_1 header_2 combined_header position variable_name
#> <chr> <chr> <chr> <int> <chr>
#> 1 Respondent ID NA Respondent ID 1 respondent_id
#> 2 Collector ID NA Collector ID 2 collector_id
#> 3 Start Date NA Start Date 3 start_date
#> 4 End Date NA End Date 4 end_date
#> 5 IP Address NA IP Address 5 ip_address
#> # ℹ 28 more rows
# Write out the edited codebook to save for future use-
# Be sure to double check questions match new names before writing out:
# readr::write_csv(codebook_pre, file = "{filepath-to-codebok}")3. Import the Data and Rename the Variables with the Codebook
# 3. Read in the data and rename the vars using readRenameData(), passing the file path and the edited codebook:
pre_data <- blackstone::readRenameData(pre_data_fp, codebook = codebook_pre)
pre_data
#> # A tibble: 100 × 33
#> respondent_id collector_id start_date end_date ip_address email_address
#> <dbl> <dbl> <date> <date> <chr> <chr>
#> 1 114628000001 431822954 2024-06-11 2024-06-12 227.224.138.113 coraima59@me…
#> 2 114628000002 431822954 2024-06-08 2024-06-09 110.241.132.50 mstamm@hermi…
#> 3 114628000003 431822954 2024-06-21 2024-06-22 165.58.112.64 precious.fei…
#> 4 114628000004 431822954 2024-06-08 2024-06-09 49.34.121.147 ines52@gmail…
#> 5 114628000005 431822954 2024-06-02 2024-06-03 115.233.66.80 franz44@hotm…
#> # ℹ 95 more rows
#> # ℹ 27 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
#> # knowledge <chr>, research_1 <chr>, research_2 <chr>, research_3 <chr>,
#> # research_4 <chr>, research_5 <chr>, research_6 <chr>, research_7 <chr>,
#> # research_8 <chr>, ability_1 <chr>, ability_2 <chr>, ability_3 <chr>,
#> # ability_4 <chr>, ability_5 <chr>, ability_6 <chr>, ethics_1 <chr>,
#> # ethics_2 <chr>, ethics_3 <chr>, ethics_4 <chr>, ethics_5 <chr>, …The SurveyMonkey example data is now imported with names taken from
the codebook column variable_name:
names(pre_data)
#> [1] "respondent_id" "collector_id" "start_date" "end_date"
#> [5] "ip_address" "email_address" "first_name" "last_name"
#> [9] "unique_id" "knowledge" "research_1" "research_2"
#> [13] "research_3" "research_4" "research_5" "research_6"
#> [17] "research_7" "research_8" "ability_1" "ability_2"
#> [21] "ability_3" "ability_4" "ability_5" "ability_6"
#> [25] "ethics_1" "ethics_2" "ethics_3" "ethics_4"
#> [29] "ethics_5" "gender" "role" "ethnicity"
#> [33] "first_gen"Post Survey Data
Do that same process over again with the post data, if the variables are all the same you can use the same codebook.
For this example there are additional post variables so a new codebook will need to be created to rename the variables when reading in the data with
readRenameData().
# File path for pre example data:
post_data_fp <- blackstone::blackstoneExample("sm_data_post.csv")
# 1. Create the codebook using the filepath:
codebook_post <- blackstone::createCodebook(post_data_fp)
codebook_post
#> # A tibble: 38 × 5
#> header_1 header_2 combined_header position variable_name
#> <chr> <chr> <chr> <int> <chr>
#> 1 Respondent ID NA Respondent ID 1 respondent_id
#> 2 Collector ID NA Collector ID 2 collector_id
#> 3 Start Date NA Start Date 3 start_date
#> 4 End Date NA End Date 4 end_date
#> 5 IP Address NA IP Address 5 ip_address
#> # ℹ 33 more rows
# Step 2. Edit the codebook:
# Set up sequential naming convections for matrix-style questions with shared likert scale response options:
# 8 items that are matrix-style likert scales- turned into a scale called `research`- here is how to easily name them all at once:
# Rows 11 to 18 belong to the "research" matrix question (you will have to look at the codebook and match the header_1 and header_2 to variable_name to change)
research_items <- codebook_post[["variable_name"]][11:18]
research_names <- paste0("research_", seq_along(research_items)) %>% purrr::set_names(., research_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ability`- Rows 19 to 24 named `variable_name`:
ability_items <- codebook_post[["variable_name"]][19:24]
ability_names <- paste0("ability_", seq_along(ability_items)) %>% purrr::set_names(., ability_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ethics`- Rows 19 to 24 named `variable_name`:
ethics_items <- codebook_post[["variable_name"]][25:29]
ethics_names <- paste0("ethics_", seq_along(ethics_items)) %>% purrr::set_names(., ethics_items) # Create a new named vector of names for these columns
# 5 items that are Open-ended follow up when corresponeding ethics items were answered "Strongly disagree"or "Disagree"- Rows 30 to 34 named `variable_name`:
ethics_items_oe <- codebook_post[["variable_name"]][30:34]
ethics_names_oe <- paste0("ethics_", seq_along(ethics_items), "_oe") %>% purrr::set_names(., ethics_items_oe) # Create a new named vector of names for these columns
# Edit the `variable_names` column: Use dplyr::mutate() and dplyr::case_match() to change the column `variable_name`:
codebook_post <- codebook_post %>% dplyr::mutate(
variable_name = dplyr::case_match(
variable_name, # column to match
'custom_data_1' ~ "unique_id", # changes 'custom_data_1' to "unique_id"
'to_what_extent_are_you_knowledgeable_in_conducting_research_in_your_field_of_study' ~ "knowledge",
'with_which_gender_do_you_most_closely_identify' ~ "gender",
'what_is_your_current_role_in_the_program' ~ "role",
'which_race_ethnicity_best_describes_you_please_choose_only_one' ~ "ethnicity",
'are_you_a_first_generation_college_student' ~ "first_gen",
names(research_names) ~ research_names[variable_name], # takes the above named vector and when the name matches, applies new value in that position as replacement.
names(ability_names) ~ ability_names[variable_name], # Same for `ability_names`
names(ethics_names) ~ ethics_names[variable_name], # Same for `ability_names`
names(ethics_names_oe) ~ ethics_names_oe[variable_name], # Same for `ethics_names_oe`
.default = variable_name # returns default value from original `variable_name` if not changed.
)
)
codebook_post
#> # A tibble: 38 × 5
#> header_1 header_2 combined_header position variable_name
#> <chr> <chr> <chr> <int> <chr>
#> 1 Respondent ID NA Respondent ID 1 respondent_id
#> 2 Collector ID NA Collector ID 2 collector_id
#> 3 Start Date NA Start Date 3 start_date
#> 4 End Date NA End Date 4 end_date
#> 5 IP Address NA IP Address 5 ip_address
#> # ℹ 33 more rows
# Write out the edited codebook to save for future use-
# Be sure to double check questions match new names before writing out:
# readr::write_csv(codebook_post, file = "{filepath-to-codebok}")
# 3. Read in the data and rename the vars using readRenameData(), passing the file path and the edited codebook:
post_data <- blackstone::readRenameData(post_data_fp, codebook = codebook_post)
post_data
#> # A tibble: 100 × 38
#> respondent_id collector_id start_date end_date ip_address email_address
#> <dbl> <dbl> <date> <date> <chr> <chr>
#> 1 114628000001 431822954 2024-06-11 2024-06-12 227.224.138.113 coraima59@me…
#> 2 114628000002 431822954 2024-06-08 2024-06-09 110.241.132.50 mstamm@hermi…
#> 3 114628000003 431822954 2024-06-21 2024-06-22 165.58.112.64 precious.fei…
#> 4 114628000004 431822954 2024-06-08 2024-06-09 49.34.121.147 ines52@gmail…
#> 5 114628000005 431822954 2024-06-02 2024-06-03 115.233.66.80 franz44@hotm…
#> # ℹ 95 more rows
#> # ℹ 32 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
#> # knowledge <chr>, research_1 <chr>, research_2 <chr>, research_3 <chr>,
#> # research_4 <chr>, research_5 <chr>, research_6 <chr>, research_7 <chr>,
#> # research_8 <chr>, ability_1 <chr>, ability_2 <chr>, ability_3 <chr>,
#> # ability_4 <chr>, ability_5 <chr>, ability_6 <chr>, ethics_1 <chr>,
#> # ethics_2 <chr>, ethics_3 <chr>, ethics_4 <chr>, ethics_5 <chr>, …Finally, it is important to add pre_ and
post_ prefixes to all unique variables before merging the
datasets, (i.e. the survey items that differ pre-post- the SM items and
demos are identical):
# Pre data:
pre_data <- pre_data %>% dplyr::rename_with(~ paste0("pre_", .), .cols = c(knowledge:ethics_5))
# Pre data:
post_data <- post_data %>% dplyr::rename_with(~ paste0("post_", .), .cols = c(knowledge:ethics_5_oe))Merging Data
Merge pre-post data by joining on all the variables that are shared in common.
The dplyr package has many joining
functions, the most commonly use is dplyr::left_join()
which keeps all the observations from the first table provided and
merges all observations from the second that match.
For most data analysis, we will want to use the post data as the primary table and merge all the pre data since most post surveys drop some participants, so we can run our analysis on complete data.
# left_join() will automatically join by all the shared columns, be sure to include all shared variables that should be identical pre-post to the 'by = join_by()' as an arg
# (otherwise you will get a message about additional variables to be joined by):
sm_data <- post_data %>% dplyr::left_join(pre_data, by = dplyr::join_by(respondent_id, collector_id, start_date, end_date, ip_address, email_address,
first_name, last_name, unique_id, gender, role, ethnicity, first_gen))
sm_data
#> # A tibble: 100 × 58
#> respondent_id collector_id start_date end_date ip_address email_address
#> <dbl> <dbl> <date> <date> <chr> <chr>
#> 1 114628000001 431822954 2024-06-11 2024-06-12 227.224.138.113 coraima59@me…
#> 2 114628000002 431822954 2024-06-08 2024-06-09 110.241.132.50 mstamm@hermi…
#> 3 114628000003 431822954 2024-06-21 2024-06-22 165.58.112.64 precious.fei…
#> 4 114628000004 431822954 2024-06-08 2024-06-09 49.34.121.147 ines52@gmail…
#> 5 114628000005 431822954 2024-06-02 2024-06-03 115.233.66.80 franz44@hotm…
#> # ℹ 95 more rows
#> # ℹ 52 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
#> # post_knowledge <chr>, post_research_1 <chr>, post_research_2 <chr>,
#> # post_research_3 <chr>, post_research_4 <chr>, post_research_5 <chr>,
#> # post_research_6 <chr>, post_research_7 <chr>, post_research_8 <chr>,
#> # post_ability_1 <chr>, post_ability_2 <chr>, post_ability_3 <chr>,
#> # post_ability_4 <chr>, post_ability_5 <chr>, post_ability_6 <chr>, …Data Cleaning
Convert all likert scales to factors (for ordering) and all demographics.
Create numeric variables from the factor variables for use with statistical tests later on.
If applicable, drop all “Missing”/
NAobservations.
## Knowledge scale
levels_knowledge <- c("Not knowledgeable at all", "A little knowledgeable", "Somewhat knowledgeable", "Very knowledgeable", "Extremely knowledgeable")
## Research Items scale:
levels_confidence <- c("Not at all confident", "Slightly confident", "Somewhat confident", "Very confident", "Extremely confident")
## Ability Items scale:
levels_min_ext <- c("Minimal", "Slight", "Moderate", "Good", "Extensive")
## Ethics Items scale:
levels_agree5 <- c("Strongly disagree", "Disagree", "Neither agree nor disagree", "Agree", "Strongly agree")
# Demographic levels:
gender_levels <- c("Female","Male","Non-binary", "Do not wish to specify")
role_levels <- c("Undergraduate student", "Graduate student", "Postdoc", "Faculty")
ethnicity_levels <- c("White (Non-Hispanic/Latino)", "Asian", "Black", "Hispanic or Latino", "American Indian or Alaskan Native",
"Native Hawaiian or other Pacific Islander", "Do not wish to specify")
first_gen_levels <- c("Yes", "No", "I'm not sure")
# Use mutate() for convert each item in each scale to a factor with vectors above,
# across() will perform a function for items selected using contains() or can be selected
# by variables names individually using a character vector: _knowledge or use c("pre_knowledge","post_knowledge")
# Also create new numeric variables for all the likert scale items and use the suffix '_num' to denote numeric:
sm_data <- sm_data %>% dplyr::mutate(dplyr::across(tidyselect::contains("_knowledge"), ~ factor(., levels = levels_knowledge)), # match each name pattern to select to each factor level
dplyr::across(tidyselect::contains("_knowledge"), as.numeric, .names = "{.col}_num"), # create new numeric items for all knowledge items
dplyr::across(tidyselect::contains("research_"), ~ factor(., levels = levels_confidence)),
dplyr::across(tidyselect::contains("research_"), as.numeric, .names = "{.col}_num"), # create new numeric items for all research items
dplyr::across(tidyselect::contains("ability_"), ~ factor(., levels = levels_min_ext)),
dplyr::across(tidyselect::contains("ability_"), as.numeric, .names = "{.col}_num"), # create new numeric items for all ability items
# select ethics items but not the open_ended responses:
dplyr::across(tidyselect::contains("ethics_") & !tidyselect::contains("_oe"), ~ factor(., levels = levels_agree5)),
dplyr::across(tidyselect::contains("ethics_") & !tidyselect::contains("_oe"), as.numeric, .names = "{.col}_num"), # new numeric items for all ethics items
# individually convert all demographics to factor variables:
gender = factor(gender, levels = gender_levels),
role = factor(role, levels = role_levels),
ethnicity = factor(ethnicity, levels = ethnicity_levels),
first_gen = factor(first_gen, levels = first_gen_levels),
)
sm_data
#> # A tibble: 100 × 98
#> respondent_id collector_id start_date end_date ip_address email_address
#> <dbl> <dbl> <date> <date> <chr> <chr>
#> 1 114628000001 431822954 2024-06-11 2024-06-12 227.224.138.113 coraima59@me…
#> 2 114628000002 431822954 2024-06-08 2024-06-09 110.241.132.50 mstamm@hermi…
#> 3 114628000003 431822954 2024-06-21 2024-06-22 165.58.112.64 precious.fei…
#> 4 114628000004 431822954 2024-06-08 2024-06-09 49.34.121.147 ines52@gmail…
#> 5 114628000005 431822954 2024-06-02 2024-06-03 115.233.66.80 franz44@hotm…
#> # ℹ 95 more rows
#> # ℹ 92 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
#> # post_knowledge <fct>, post_research_1 <fct>, post_research_2 <fct>,
#> # post_research_3 <fct>, post_research_4 <fct>, post_research_5 <fct>,
#> # post_research_6 <fct>, post_research_7 <fct>, post_research_8 <fct>,
#> # post_ability_1 <fct>, post_ability_2 <fct>, post_ability_3 <fct>,
#> # post_ability_4 <fct>, post_ability_5 <fct>, post_ability_6 <fct>, …The data cleaned in this vignette will be used as the example data in
the vignettes Data analysis and Statistical Inference
and Data Visualization to
further showcase all the functions contained in
blackstone.
