# Tutorial: tbl_summary

## Introduction

The tbl_summary() function calculates descriptive statistics for continuous, categorical, and dichotomous variables in R, and presents the results in a beautiful, customizable summary table ready for publication (for example, Table 1 or demographic tables).

This vignette will walk a reader through the tbl_summary() function, and the various functions available to modify and make additions to an existing table summary object.

## Setup

Before going through the tutorial, install and load {gtsummary}.

# install.packages("gtsummary")
library(gtsummary)

## Example data set

We’ll be using the trial data set throughout this example.

• This set contains data from 200 patients who received one of two types of chemotherapy (Drug A or Drug B). The outcomes are tumor response and death.

• Each variable in the data frame has been assigned an attribute label (i.e. attr(trial$trt, "label") == "Chemotherapy Treatment") with the labelled package. These labels are displayed in the {gtsummary} output table by default. Using {gtsummary} on a data frame without labels will simply print variable names in place of variable labels; there is also an option to add labels later. Variable Class Label trt character Chemotherapy Treatment age numeric Age marker numeric Marker Level (ng/mL) stage factor T Stage grade factor Grade response integer Tumor Response death integer Patient Died ttdeath numeric Months to Death/Censor Includes mix of continuous, dichotomous, and categorical variables head(trial) #> # A tibble: 6 × 8 #> trt age marker stage grade response death ttdeath #> <chr> <dbl> <dbl> <fct> <fct> <int> <int> <dbl> #> 1 Drug A 23 0.16 T1 II 0 0 24 #> 2 Drug B 9 1.11 T2 I 1 0 24 #> 3 Drug A 31 0.277 T1 II 0 0 24 #> 4 Drug A NA 2.07 T3 III 1 1 17.6 #> 5 Drug A 51 2.77 T4 III 1 1 16.4 #> 6 Drug B 39 0.613 T4 I 0 1 15.6 For brevity, in this tutorial we’ll use a subset of the variables from the trial data set. trial2 <- trial %>% select(trt, age, grade) ## Basic Usage The default output from tbl_summary() is meant to be publication ready. Let’s start by creating a table of summary statistics from the trial data set. The tbl_summary() function can take, at minimum, a data frame as the only input, and returns descriptive statistics for each column in the data frame. trial2 %>% tbl_summary() Characteristic N = 2001 Chemotherapy Treatment Drug A 98 (49%) Drug B 102 (51%) Age 47 (38, 57) Unknown 11 Grade I 68 (34%) II 68 (34%) III 64 (32%) 1 n (%); Median (IQR) Note the sensible defaults with this basic usage; each of the defaults may be customized. • Variable types are automatically detected so that appropriate descriptive statistics are calculated. • Label attributes from the data set are automatically printed. • Missing values are listed as “Unknown” in the table. • Variable levels are indented and footnotes are added. For this study data the summary statistics should be split by treatment group, which can be done by using the by= argument. To compare two or more groups, include add_p() with the function call, which detects variable type and uses an appropriate statistical test. trial2 %>% tbl_summary(by = trt) %>% add_p() Characteristic Drug A, N = 981 Drug B, N = 1021 p-value2 Age 46 (37, 59) 48 (39, 56) 0.7 Unknown 7 4 Grade 0.9 I 35 (36%) 33 (32%) II 32 (33%) 36 (35%) III 31 (32%) 33 (32%) 1 Median (IQR); n (%) 2 Wilcoxon rank sum test; Pearson's Chi-squared test ## Customize Output There are four primary ways to customize the output of the summary table. 1. Use tbl_summary() function arguments 2. Add additional data/information to a summary table with add_*() functions 3. Modify summary table appearance with the {gtsummary} functions 4. Modify table appearance with {gt} package functions ### Modifying tbl_summary() function arguments The tbl_summary() function includes many input options for modifying the appearance. Argument Description label= specify the variable labels printed in table type= specify the variable type (e.g. continuous, categorical, etc.) statistic= change the summary statistics presented digits= number of digits the summary statistics will be rounded to missing= whether to display a row with the number of missing observations missing_text= text label for the missing number row sort= change the sorting of categorical levels by frequency percent= print column, row, or cell percentages include= list of variables to include in summary table Example modifying tbl_summary() arguments. trial2 %>% tbl_summary( by = trt, statistic = list( all_continuous() ~ "{mean} ({sd})", all_categorical() ~ "{n} / {N} ({p}%)" ), digits = all_continuous() ~ 2, label = grade ~ "Tumor Grade", missing_text = "(Missing)" ) Characteristic Drug A, N = 981 Drug B, N = 1021 Age 47.01 (14.71) 47.45 (14.01) (Missing) 7 4 Tumor Grade I 35 / 98 (36%) 33 / 102 (32%) II 32 / 98 (33%) 36 / 102 (35%) III 31 / 98 (32%) 33 / 102 (32%) 1 Mean (SD); n / N (%) There are multiple ways to specify the statistic= argument using a single formula, a list of formulas, and a named list. The following table shows equivalent ways to specify the mean statistic for continuous variables age and marker. Any {gtsummary} function argument that accepts formulas will accept each of these variations. Select with Helpers Select by Variable Name Select with Named List all_continuous() ~ "{mean}" c("age", "marker") ~ "{mean}" list(age = "{mean}", marker = "{mean}") list(all_continuous() ~ "{mean}") c(age, marker) ~ "{mean}" list(c(age, marker) ~ "{mean}") ### {gtsummary} functions to add information The {gtsummary} package has functions to adding information or statistics to tbl_summary() tables. Function Description add_p() add p-values to the output comparing values across groups add_overall() add a column with overall summary statistics add_n() add a column with N (or N missing) for each variable add_difference() add column for difference between two group, confidence interval, and p-value add_stat_label() add label for the summary statistics shown in each row add_stat() generic function to add a column with user-defined values add_q() add a column of q values to control for multiple comparisons ### {gtsummary} functions to format table The {gtsummary} package comes with functions specifically made to modify and format summary tables. Function Description modify_header() update column headers modify_footnote() update column footnote modify_spanning_header() update spanning headers modify_caption() update table caption/title bold_labels() bold variable labels bold_levels() bold variable levels italicize_labels() italicize variable labels italicize_levels() italicize variable levels bold_p() bold significant p-values Example adding tbl_summary()-family functions trial2 %>% tbl_summary(by = trt) %>% add_p(pvalue_fun = ~ style_pvalue(.x, digits = 2)) %>% add_overall() %>% add_n() %>% modify_header(label ~ "**Variable**") %>% modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment Received**") %>% modify_footnote( all_stat_cols() ~ "Median (IQR) or Frequency (%)" ) %>% modify_caption("**Table 1. Patient Characteristics**") %>% bold_labels() Table 1. Patient Characteristics Variable N Overall, N = 2001 Treatment Received p-value2 Drug A, N = 981 Drug B, N = 1021 Age 189 47 (38, 57) 46 (37, 59) 48 (39, 56) 0.72 Unknown 11 7 4 Grade 200 0.87 I 68 (34%) 35 (36%) 33 (32%) II 68 (34%) 32 (33%) 36 (35%) III 64 (32%) 31 (32%) 33 (32%) 1 Median (IQR) or Frequency (%) 2 Wilcoxon rank sum test; Pearson's Chi-squared test ### {gt} functions to format table The {gt} package is packed with many great functions for modifying table output—too many to list here. Review the package’s website for a full listing. To use the {gt} package functions with {gtsummary} tables, the summary table must first be converted into a gt object. To this end, use the as_gt() function after modifications have been completed with {gtsummary} functions. trial2 %>% tbl_summary(by = trt, missing = "no") %>% add_n() %>% as_gt() %>% gt::tab_source_note(gt::md("*This data is simulated*")) Characteristic N Drug A, N = 981 Drug B, N = 1021 Age 189 46 (37, 59) 48 (39, 56) Grade 200 I 35 (36%) 33 (32%) II 32 (33%) 36 (35%) III 31 (32%) 33 (32%) This data is simulated 1 Median (IQR); n (%) ## Select Helpers There is flexibility in how you select variables for {gtsummary} arguments, which allows for many customization opportunities! For example, if you want to show age and the marker levels to one decimal place in tbl_summary(), you can pass digits = c(age, marker) ~ 1. The selecting input is flexible, and you may also pass quoted column names. Going beyond typing out specific variables in your data set, you can use: 1. All {tidyselect} helpers available throughout the tidyverse, such as starts_with(), contains(), and everything() (i.e. anything you can use with the dplyr::select() function), can be used with {gtsummary}. 2. Additional {gtsummary} selectors that are included in the package to supplement tidyselect functions. • Summary type There are two primary ways to select variables by their summary type. This is useful, for example, when you wish to report the mean and standard deviation for all continuous variables: statistic = all_continuous() ~ "{mean} ({sd})". all_continuous() all_categorical() Dichotomous variables are, by default, included with all_categorical(). ## Multi-line Continuous Summaries Continuous variables may also be summarized on multiple lines—a common format in some journals. To update the continuous variables to summarize on multiple lines, update the summary type to "continuous2" (for summaries on two or more lines). trial2 %>% select(age, trt) %>% tbl_summary( by = trt, type = all_continuous() ~ "continuous2", statistic = all_continuous() ~ c( "{N_nonmiss}", "{median} ({p25}, {p75})", "{min}, {max}" ), missing = "no" ) %>% add_p(pvalue_fun = ~ style_pvalue(.x, digits = 2)) Characteristic Drug A, N = 98 Drug B, N = 102 p-value1 Age 0.72 N 91 98 Median (IQR) 46 (37, 59) 48 (39, 56) Range 6, 78 9, 83 1 Wilcoxon rank sum test ## Advanced Customization The information in this section applies to all {gtsummary} objects. The {gtsummary} table has two important internal objects: Internal Object Description .$table_body

data frame that is printed as the gtsummary output table

.$table_styling contains instructions for styling .$table_body when printed

When you print output from the tbl_summary() function into the R console or into an R markdown document, the .$table_body data frame is formatted using the instructions listed in .$table_styling. The default printer converts the {gtsummary} object to a {gt} object with as_gt() via a sequence of {gt} commands executed on .$table_body. Here’s an example of the first few calls saved with tbl_summary(): tbl_summary(trial2) %>% as_gt(return_calls = TRUE) %>% head(n = 4) #>$gt
#> gt::gt(data = x$table_body, groupname_col = NULL, caption = NULL) #> #>$fmt_missing
#> $fmt_missing[[1]] #> gt::sub_missing(columns = gt::everything(), missing_text = "") #> #> #>$cols_align
#> cols_align[[1]] #> gt::cols_align(columns = c("variable", "var_type", "var_label", #> "row_type", "stat_0"), align = "center") #> #>cols_align[[2]]
#> gt::cols_align(columns = "label", align = "left")
#>
#>
#> $indent #>$indent[[1]]
#> gt::text_transform(locations = gt::cells_body(columns = "label",
#>     rows = c(2L, 3L, 5L, 7L, 8L, 9L)), fn = function(x) paste0("    ",
#>     x))

The {gt} functions are called in the order they appear, beginning with gt::gt().

If the user does not want a specific {gt} function to run (i.e. would like to change default printing), any {gt} call can be excluded in the as_gt() function. In the example below, the default alignment is restored.

After the as_gt() function is run, additional formatting may be added to the table using {gt} functions. In the example below, a source note is added to the table.

tbl_summary(trial2, by = trt) %>%
as_gt(include = -cols_align) %>%
gt::tab_source_note(gt::md("*This data is simulated*"))
Characteristic Drug A, N = 981 Drug B, N = 1021
Age 46 (37, 59) 48 (39, 56)
Unknown 7 4
I 35 (36%) 33 (32%)
II 32 (33%) 36 (35%)
III 31 (32%) 33 (32%)
This data is simulated
1 Median (IQR); n (%)

## Set Default Options with Themes

The {gtsummary} tbl_summary() function and the related functions have sensible defaults for rounding and presenting results. If you, however, would like to change the defaults there are a few options. The default options can be changed using the {gtsummary} themes function set_gtsummary_theme(). The package includes prespecified themes, and you can also create your own. Themes can control baseline behavior, for example, how p-values and percentages are rounded, which statistics are presented in tbl_summary(), default statistical tests in add_p(), etc.

For details on creating a theme and setting personal defaults, review the themes vignette.

## Survey Data

The {gtsummary} package also supports survey data (objects created with the {survey} package) via the tbl_svysummary() function. The syntax for tbl_svysummary() and tbl_summary() are nearly identical, and the examples above apply to survey summaries as well.

To begin, install the {survey} package and load the apiclus1 data set.

install.packages("survey")
# loading the api data set
data(api, package = "survey")

Before we begin, we convert the data frame to a survey object, registering the ID and weighting columns, and setting the finite population correction column.

svy_apiclus1 <-
survey::svydesign(
id = ~dnum,
weights = ~pw,
data = apiclus1,
fpc = ~fpc
)

After creating the survey object, we can now summarize it similarly to a standard data frame using tbl_svysummary(). Like tbl_summary(), tbl_svysummary() accepts the by= argument and works with the add_p() and add_overall() functions.

It is not possible to pass custom functions to the statistic= argument of tbl_svysummary(). You must use one of the pre-defined summary statistic functions (e.g. {mean}, {median}) which leverage functions from the {survey} package to calculate weighted statistics.

svy_apiclus1 %>%
tbl_svysummary(
# stratify summary statistics by the "both" column
by = both,
# summarize a subset of the columns
include = c(api00, api99, both),
label = list(
api00 ~ "API in 2000",
api99 ~ "API in 1999"
)
) %>%
add_p() %>% # comparing values by "both" column
modify_spanning_header(c("stat_1", "stat_2") ~ "**Met Both Targets**")
Characteristic Overall, N = 6,1941 Met Both Targets p-value2
No, N = 1,6921 Yes, N = 4,5021
API in 2000 652 (552, 718) 631 (556, 710) 654 (551, 722) 0.4
API in 1999 615 (512, 691) 632 (548, 698) 611 (497, 686) 0.2
1 Median (IQR)
2 Wilcoxon rank-sum test for complex survey samples

tbl_svysummary() can also handle weighted survey data where each row represents several individuals:

Titanic %>%
as_tibble() %>%
survey::svydesign(data = ., ids = ~1, weights = ~n) %>%
tbl_svysummary(include = c(Age, Survived))
Characteristic N = 2,2011
Age
Child 109 (5.0%)
Survived 711 (32%)
1 n (%)

## Cross Tables

Use tbl_cross() to compare two categorical variables in your data. tbl_cross() is a wrapper for tbl_summary() that:

• Uses percent = "cell" by default.
• Adds row and column margin totals (customizable through the margin argument).
• Displays missing data in both row and column variables (customizable through the missing argument).
trial %>%
tbl_cross(
row = stage,
col = trt,
percent = "cell"
) %>%
add_p()
Chemotherapy Treatment Total p-value1
Drug A Drug B
T Stage 0.9
T1 28 (14%) 25 (12%) 53 (26%)
T2 25 (12%) 29 (14%) 54 (27%)
T3 22 (11%) 21 (10%) 43 (22%)
T4 23 (12%) 27 (14%) 50 (25%)
Total 98 (49%) 102 (51%) 200 (100%)
1 Pearson's Chi-squared test