--- title: "Making cheese" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Making cheese} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- Structuring code that is robust to updates in the data, changes in methodological specifications, etc., and can get to presentation-ready results in an automated way can mitigate errors caused by painful, tedious tasks--it can also be a huge time saver. The `cheese` package was designed with this philosophy in mind, heavily influenced by (and explicit dependencies on) [`tidy`](https://www.tidyverse.org/) concepts. Its intention is to provide general tools for working with data and results during statistical analysis, in addition to a collection of functions designated for specific statistical tasks. Let's take a closer look: ```{r} #Load the package require(cheese) ``` ### Heart disease dataset This data comes from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Heart+Disease), containing a collection of demographic and clinical characteristics from `r nrow(heart_disease)` patients. It was subsequently processed and cleaned into a format suitable for analysis--details of which can be found [here](https://github.com/zajichek/cheese/blob/master/data-raw/heart_disease.R). ```{r} #Look at the data heart_disease ``` The functions that will be introduced are roughly in order of their development, as many build off of one another. Selection of columns is done with non-standard evaluation (NSE) or with `tidyselect::select_helpers`, where applicable. # Splitting and binding data `divide()` is used to split up a data frame into a list of subsets. Suppose you want to split the example data set by sex and heart disease status: ```{r} div_dat <- heart_disease %>% divide( Sex, HeartDisease ) div_dat ``` The default behavior is to continually split the data in a hierarchical structure based on the order of the input columns, and to remove them from the result. The `keep` argument can be used to retain the column(s) in the data frames. `fasten()` allows you to take a list structure that contains data frames at an arbitrary depth, and roll them back up into a single data frame. This is useful when a statistical process needs to be mapped to many subsets, and you want to be able to easily collect the results without repeatedly binding data. You can call the function with the divided data from above: ```{r} div_dat %>% fasten() ``` It was binded back to the original number of rows, but it lost the columns it was split by. The `into` argument can be used to handle this: ```{r} div_dat %>% fasten( into = c("Sex", "HeartDisease") ) ``` The positions of the `into` values always lineup with the level of the list hierarchy. So, for example, if you don't care about retaining the heart disease status, you can do this: ```{r} div_dat %>% fasten( into = "Sex" ) ``` In contrast, if you want to forgo the sex indicator, empty strings will need to be used as placeholders so the names are applied at the correct levels. ```{r} div_dat %>% fasten( into = c("", "HeartDisease") ) ``` Obviously, the classes of the split columns are not retained from the original data since the splitting and binding processes are independent. ## Adjusting the depth As shown above, the default behavior for these functions is to split or bind "as much as possible". However, it can be useful to have control over the shape of the split. This is where the `depth` argument comes in. Suppose you want the same data frames at the leaves of the list, but only one level deep: ```{r} heart_disease %>% divide( Sex, HeartDisease, depth = 1 ) ``` You now have list with four elements containing the subsets--the names of which are the concatenated levels of the split columns (see the `sep` argument). This argument also works when binding data. Going back to the original divided data frame, you may only want to bind the internal lists. ```{r} div_dat %>% fasten( into = "HeartDisease", depth = 1 ) ``` Note that the positions of the `into` values are directly related to how shallow/deep you want the result to be. The depth can also be controlled relative to the leaves by using negative integers. This can be useful if it is unclear how deep the list will be (or is). In the example above, suppose you didn't know how deep the list was, but you knew the heart disease status was the most internal split, and thus wanted to bind it. ```{r} div_dat %>% fasten( into = "HeartDisease", depth = -1 ) ``` The new depth is one less than the input. In this case, the same result was achieved because of the symmetry. # Applying functions The next set of functions are rooted in functional programming (i.e. [`purrr`](https://purrr.tidyverse.org/)). In each instance, given a pre-defined function, you can easily control the scope of the data in which it is evaluated on. This is an incredibly useful and efficient philosophy for generating reusable code that is non-repetitive. ## Pairwise combinations of column sets Often times a statistical analysis is concerned with multiple outcomes/targets, though each one potentially uses the same set of explanatory variables. You could of course hard code the specific analysis steps for each outcome (e.g. formulas, etc.), but then the code becomes repetitive. You could also make separate data sets for each outcome with the associated predictors and then apply the same analysis process to each data set, but then you waste resources by storing multiple copies of the same data in memory. `dish()` allows you to distribute a function to various pairwise sets of columns from the same data set in a single call. The motivation described above is merely that--this is a general function that can be applied to any context. As a simple first example, lets compute the Pearson correlation of blood pressure and cholesterol with all numeric columns: ```{r} heart_disease %>% dish( f = cor, left = c(BP, Cholesterol), right = where(is.numeric) ) ``` The `left` argument controls which columns are entered in the first argument of `f`, and the `right` argument controls which columns are entered into the second. In the example, you'll notice that the `left` columns were also included in the set of `right` columns, because they are, in fact, `numeric()`. If you didn't want this, you can omit the `right` argument, which will then include everything not in `left` as the second argument of `f`: ```{r} heart_disease %>% dplyr::select(where(is.numeric)) %>% #Need to do this so only numeric columns are evaluated dish( f = cor, left = c(BP, Cholesterol) ) ``` Now lets suppose you want to regress both blood pressure and cholesterol on all other variables in the data set. The `each_` arguments allow you to control whether the column sets are entered into the function individually as vectors, or together in a single data frame. The former is the default, so you'll need to use this argument here: ```{r} heart_disease %>% dish( f = function(y, x) lm(y ~ ., data = x), left = c(BP, Cholesterol), each_right = FALSE ) ``` ## Subsets of data `stratiply()` streamlines the familiar `split()` then `purrr::map()` approach by making it simple to select the stratification columns, and apply a function to each subset. Let's run a multiple logistic regression model using heart disease as the outcome, stratified by sex: ```{r} heart_disease %>% stratiply( f = glm, by = Sex, formula = HeartDisease ~ . -ChestPain, family = "binomial" ) ``` Adding multiple stratification columns will produce a deeper list: ```{r} heart_disease %>% stratiply( f = glm, by = c(Sex, ExerciseInducedAngina), formula = HeartDisease ~ . -ChestPain, family = "binomial" ) ``` ## Distributing to specific classes `typly()` allows you to apply a function to columns of a data frame whose type inherits at least one (or none) of the specified candidates. It is similar to `purrr::map_if()`, but uses `methods::is()` for determining types (see `some_type()`), and only returns the results for the selected columns: ```{r} heart_disease %>% typly( f = mean, types = c("numeric", "logical") ) ``` You can use the `negated` argument to apply the function to columns that are not any of the listed types: ```{r} heart_disease %>% typly( f = table, types = c("numeric", "logical"), negated = TRUE, useNA = "always" ) ``` # Populating key-value pairs into custom string templates `absorb()` allows you to create collections of strings that use keys as placeholders, which are then populated by the values. This is useful for setting up results of your analysis to be presented in a specific format that is independent of the data set at hand. Here's a simple example: ```{r} absorb( key = c("mean", "sd", "var"), value = c("10", "2", "4"), text = c( "MEAN: mean, SD: sd", "VAR: var = sd^2", MEAN = "mean" ) ) ``` The input `text` is scanned to look for the keys (interpreted as regular expressions) and are replaced with the corresponding values. Let's look at a more useful example. Suppose we have a collection summary statistics for patient age by angina and heart disease status: ```{r} age_stats <- heart_disease %>% dplyr::group_by( ExerciseInducedAngina, HeartDisease ) %>% dplyr::summarise( mean = round(mean(Age), 2), sd = round(sd(Age), 2), median = median(Age), iqr = IQR(Age), .groups = "drop" ) %>% tidyr::gather( key = "key", value = "value", -ExerciseInducedAngina, -HeartDisease ) age_stats ``` Next, you can specify the string format you want to nicely summarize the information in each group: ```{r} age_stats %>% dplyr::group_by( ExerciseInducedAngina, HeartDisease ) %>% dplyr::summarise( Summary = absorb( key = key, value = value, text = c( "mean (sd) | median (iqr)" ) ), .groups = "drop" ) ``` Of course, this is much more useful when the summary data is already in the format of `age_stats`. ## Concatenating repetitive keys In the case where the key pattern found in the string template is repeated, its values are collapsed into a concatenated string: ```{r} absorb( key = age_stats$key, value = age_stats$value, text = c("(mean) / (sd)", "median") ) ``` The `sep` argument can be used to customize how values are separated. ## Evaluating strings as expressions It may be useful in some cases to evaluate the resulting string as an expression. You can setup the prior example so that the result contains valid expressions: ```{r} absorb( key = age_stats$key, value = age_stats$value, text = c("(mean) / (sd)", "median"), sep = "+", evaluate = TRUE, trace = TRUE ) ``` These statistics are not entirely useful here, but you get the point. # Traversing lists to identify objects `depths()` and `depths_string()` are used for traversing a list structure to find the elements that satisfy a predicate. The former produces a vector of the unique depths that contain at least one element where `predicate` is true, and the latter creates a string representation of the traversal with the actual positions. ```{r} #Make a divided data frame div_dat1 <- heart_disease %>% divide( Sex, HeartDisease ) #Find the depths of the data frames div_dat1 %>% depths( predicate = is.data.frame ) div_dat1 %>% depths_string( predicate = is.data.frame ) ``` In the string result, the brackets represent the level of the list, and the integers represent the index at that level, which are negative if `predicate` is true for that element. By default, the algorithm continues when `rlang::is_bare_list()` so that the traversal can end with list-like objects. However, the `bare` argument can be used to traverse deeper into the list: ```{r} div_dat1 %>% depths_string( predicate = is.data.frame, bare = FALSE ) ``` Now it also evaluated the columns of each data frame in the list, though it still found the correct positions where the predicate held. # Distorting relationships between columns `muddle()` can be used to randomly shuffle columns in a data frame. This is a convenient way to remove confounding when exploring the effects of a variable on an outcome. ```{r} set.seed(123) heart_disease %>% muddle() ``` All columns are permuted by default. Use the `at` argument to only shuffle certain ones: ```{r} heart_disease %>% muddle( at = Age ) ``` Additional arguments can be passed to `sample()` as well. For example, you might want five random values from each column: ```{r} heart_disease %>% muddle( size = 5 ) ``` # Spanning keys and values across the columns `stretch()` is for spanning keys and values across the columns. It is similar to `tidyr::spread()`, but allows for any number of keys and values to be selected. It's functionality is more so targeted at setting up your results to be presented in a table, rather than for general data manipulation. Thus, the resulting column names and orderings are setup to make it a seamless transition. To illustrate, lets first collect the parameter estimates and standard errors from multiple regression models using blood pressure and cholesterol as outcomes, stratified by sex: ```{r} mod_results <- heart_disease %>% #For each sex stratiply( by = Sex, f = ~.x %>% #Collect coefficients and standard errors for model estimates on multiple outcomes dish( f = function(y, x) { model <- lm(y ~ ., data = x) tibble::tibble( Parameter = names(model$coefficients), Estimate = model$coefficients, SE = sqrt(diag(vcov(model))) ) }, left = c(BP, Cholesterol), each_right = FALSE ) ) %>% #Gather results into a single data frame fasten( into = c("Sex", "Outcome") ) mod_results ``` Now lets span the estimates and standard errors across the columns for both outcomes: ```{r} straught1 <- mod_results %>% stretch( key = Outcome, value = c(Estimate, SE) ) straught1 ``` The `sep` argument controls how the new column names are concatenated. Now suppose you wanted to also span sex across the columns. It can just be added to the keys: ```{r} straught2 <- mod_results %>% stretch( key = c(Sex, Outcome), value = c(Estimate, SE) ) straught2 ``` Notice how the resulting columns are positioned. They are sequentially sorted starting with the outer-most key in a hierarchical manner. Since there are multiple `value` columns in this example, their names are also appended to the result in the order in which they were requested. ## Stacking headers in a table `grable()` stands for "graded table". It is used to stack headers into a `knitr::kable()` in an automated fashion by iteratively parsing the column names by the specified delimiter. The result of `stretch()` was purposely made to flow directly into this function. ```{r} straught1 %>% grable( at = -c(Sex, Parameter) ) ``` The `at` argument specifies which columns should be spanned in the header. Let's try it with the two-key example from above: ```{r} straught2 %>% grable( at = -Parameter ) ``` The tables can then be piped through subsequent customized processing with [`kableExtra`](https://CRAN.R-project.org/package=kableExtra) tools. # Univariate analysis `descriptives()` produces a data frame in long format with a collection of descriptive statistics. ```{r} heart_disease %>% descriptives() ``` `univariate_associations()` is a special case of `dish()` specifically for computing association metrics of one or more `responses` with a collection of `predictors`. ```{r} heart_disease %>% univariate_associations( f = list(AIC = function(y, x) AIC(lm(y ~ x))), responses = c(BP, Cholesterol) ) ``` `univariate_table()` allows you to create a custom descriptive table for a data set. It uses almost every function in the package across its span of capabilities. ```{r} heart_disease %>% univariate_table() ``` See `vignette("describe")` for a more in-depth introduction to these functions.