# Introduction to the chainbinomial package

The Chain Binomial model for infectious disease spread is especially suitable for modelling of small outbreaks, such as outbreaks in households. This package contains tools for analyzing data using the Chain Binomial model.

The chain Binomial model has a single parameter, which is the the secondary attack rate (SAR). The household secondary attack rate is defined as the probability that an infected household member infects a susceptible household member.

This package contains functions related to the Chain Binomial probability distribution, as well as functions for estimating the SAR parameter and regression modelling relating the SAR to predictive factors. To get started the package need to be loaded as usual:

library(chainbinomial)
require(dplyr)
require(tidyr)

# The Chain Binomial model

The Chain Binomial model is a simple model for the spread of a infectious disease in a closed population, such as a household. An infectious disease is introduced into the household at time point 0 by one or more primary cases, which is denoted as $$I_0$$. The disease then spreads among the remaining $$S_0$$ household members, who are all susceptible to the disease, in discrete time steps called generations. In each generation 0 or more of the remaining susceptibles become infected, and the infected individuals from the previous generation is considered recovered and immune, and does not anymore contribute to the spread of the disease. The number of new infections in generation $$g+1$$ is modeled as a binomial model that depends on the number of infected $$I_g$$ and the number of remaining susceptibles $$S_g$$, in addition to the SAR.

$P(I_{g+1} | I_{g}, S_g. \theta) = \binom{S_g}{I_{g+1}} \pi_{g+1}^{I_{g+1}} (1-\pi_{g+1})^{S_g - I_{g+1}}$

where

$\pi_{g+1} = 1 - (1-\theta)^{I_{g}}$ is the per-person risk for getting infected, and depends on the the secondary attack rate (denoted $$\theta$$) and the number of infected individuals in generation $$g$$. This is the binomial probability of getting at least one “success” in $$I_g$$ binomial trials with probability $$\theta$$. This model is also referred to as the Reed-Frost model.

The point of the chainbinomial package is not to analyze the number of new infections in each generation since that is already possible using the glm function already included in R. A tutorial for doing this can be found by typing vignette('chain_glm', package = "chainbinomial"). Instead, the goal of this package is to analyze the final size of the outbreaks, that is, the final counts of the number of infected in each outbreak.

## The final Chain Binomial count probabilities

The Chain binomial probability of getting an outbreak of size $$I = x$$ in a household of $$S_0$$ initially susceptible individuals, with $$I_0$$ introductory cases is

$P(I = x; S_0, I_0, \theta) = \binom{S_0}{x} P(x,x) (1-\theta)^{S_0 - x + I_0} (1-\theta)^{x(S_0 - x)}$ Notice that this formula is recursive, as it also depends on $$P(x,x)$$, which is given as

$P(I = x, S_0 = x) = 1 - \sum_{j=1}^{x-1} P(I = j, S_0 = x)$ The recursion bottoms out at

$P(I=1, S_0 = 1; I_0) = 1 - P(I=0, S_0 = 1; I_0) = 1 - (1-\theta)^{I_0}$

This formula was derived by Ludwig (1975). The dchainbinom function can be used to compute these probabilities, and it works similarly to other discrete probability mass functions in R such as the dbinom and dpois.

Consider a household with 4 persons. A single household member becomes infected by a contagious disease outside of the household, and the other 3 household members are susceptible to the disease. Assuming a secondary attack rate of 0.23, we can compute the probability that 2 of the 3 susceptible household members becomes infected as follows:

dchainbinom(x = 2, s0 = 3, i0 = 1, sar = 0.23)
#> [1] 0.1840275

We can also compute the entire final size distribution

dchainbinom(x = 0:3, s0 = 3, i0 = 1, sar = 0.23)
#> [1] 0.4565330 0.2425560 0.1840275 0.1168835

Suppose instead that 2 of the 4 household members were infected simultaneously outside of the household. Then $$I_0 = 2$$. We can again compute the final size distribution. Note that the number of initial susceptible household members $$S_0$$ is now 2.

dchainbinom(x = 0:2, s0 = 2, i0 = 2, sar = 0.23)
#> [1] 0.3515304 0.3717092 0.2767604

## The incomplete Chain Binomial count probabilities

Now suppose that we don’t have observed the entire outbreak, but have observed the outbreak for a time corresponding to two generations. There is no simple formula for the probability of an outbreak after a given number of generations, but it is possible to compute it by considering all possible scenarios that lead to the desired number of cases, see Lindstrøm et al. (2024).

The entire probability distribution after 2 generations can be computed using the generations argument. By default the generations argument is Inf, meaning that the outbreak is assumed to be completely observed.

dchainbinom(x = 0:3, s0 = 3, i0 = 1, sar = 0.23, generations = 2)
#> [1] 0.45653300 0.24255598 0.21735536 0.08355566

# Estimating the SAR

The data we will look at comes from a study of of the common cold in 66 families, all of which consisted of a mother, father, and three children (Brimblecombe et al., 1958). In total 664 outbreaks were recorded in these families over a period of 1 and a half year. The data was analyzed by Heasman and Reid in a 1961 paper, where each infection in an outbreak was classified according to who the index case was. The data is included in the package as heasman_reid_1961_intro_case_status and contains the data from Table II in the 1961 paper.

Lets take a look at the data:

heasman_reid_1961_intro_case_status
#>   furter_cases father mother school_child pre_school_child
#> 1            0     53     75          148              147
#> 2            1     31     25           77               66
#> 3            2      4      4           22                9
#> 4            3      0      1            2                0
#> 5            4      0      0            0                0

The table counts the number of outbreaks that falls into each category (type of index case by number of infected). For analysis we need to make the data into a suitable long format, with one row for each outbreak.

library(dplyr)
library(tidyr)

heasman_reid_1961_intro_case_status %>%
pivot_longer(cols = -1,
names_to = 'intro_case',
values_to = 'N') %>%
uncount(weights = N) -> intro_case_status_long

#> # A tibble: 6 × 2
#>   furter_cases intro_case
#>          <int> <chr>
#> 1            0 father
#> 2            0 father
#> 3            0 father
#> 4            0 father
#> 5            0 father
#> 6            0 father

For the purpose of illustration, we will only estimate the SAR for outbreaks where the fathers were the index case. We can estimate the SAR using the estimate_sar function. We need to give it the number of infected and s0 as input. s0 will be 4 in this case, since number of index cases is always 1 and all families are of size 5. We can also give the arguments i0 and generations as in the dchainbinom function, but the default values are the correct ones in this case (i0 = 1 and generations = Inf).

intro_case_status_long %>%
filter(intro_case == 'father') -> intro_case_status_long_fathers

sar_est <- estimate_sar(infected = intro_case_status_long_fathers$furter_cases, s0 = 4) sar_est$sar_hat
#> [1] 0.08412651

Now let us take a look at the point estimate:

sar_est$sar_hat #> [1] 0.08412651 We can also compute the confidence intervals: confint(sar_est) #> 2.5 % 97.5 % #> 0.0582205 0.1160180 ## Regression model With an estimate of the SAR in families where the primary case was the father, a natural question would be what the SAR is when other family members are the primary case. The cbmod function let us do a regression analysis similar to the glm function, with predictors for the SAR. The cbmod function does not implement the formula interface as the glm function does, but you can use the model.matrix function instead. Note that s0, i0, and generations should not be thought of as predictor variables in the traditional sense and should in general not be included in the X matrix. You can also specify the link function to be used. Here we specify the identity link, which gives coefficients that are easy to interpret. Other options are log, logit, and cloglog. The identity link might not be suitable if there are more than one predictor or the predictor is numerical instead of categorical. xmat <- model.matrix(~ intro_case, data = intro_case_status_long) cbmod_res <- cbmod(y = intro_case_status_long$furter_cases,
s0 = rep(4, nrow(intro_case_status_long)),
x = xmat,
i0 = 1,

summary(cbmod_res)
#> Chain Binomial model with identity link.
#> Model successfully fitted in 2.37 seconds
#>
#> Model log-likelihood:            -582.5
#> Null log-likelihood:             -584.9
#> Chisq (df = 3):                   4.808
#> p-value:                          0.186
#>
#> Coefficients:
#>                            Estimate Std. Error  P-value
#> (Intercept)                   0.084     0.013     0.000
#> intro_casemother             -0.015     0.017     0.388
#> intro_casepre_school_child   -0.010     0.015     0.522
#> intro_caseschool_child        0.011     0.015     0.480

The output from the summary function gives you the coefficients and the associated standard errors and p-values of the null hypothesis that the coefficient is 0. Above the table of coefficients there are also an omnibus test of the entire model that tests if the model has better fit than a model with intercept-only (the null model).

Confidence intervals for all the coefficients can also be computed:

confint(cbmod_res)
#>                                  2.5 %     97.5 %
#> (Intercept)                 0.05886205 0.10939728
#> intro_casemother           -0.04812925 0.01871199
#> intro_casepre_school_child -0.03919720 0.01990321
#> intro_caseschool_child     -0.01903459 0.04045655

predict, vcov, coef, tidy, and glance methods are also available for cbmod objects.

# References

• Ludwig, D. (1975) Final Size Distributions for Epidemics. Mathematical Biosciences, 23, 33-46. https://doi.org/10.1016/0025-5564(75)90119-4

• Lindstrøm JC, et al (2024) Estimating the household secondary attack rate with the Incomplete Chain Binomial model. https://doi.org/10.48550/arXiv.2403.03948

• Brimblecombe et al. (1958) Family Studies of Respiratory Infections

• Heasman, M.A. and Reid, D.D. (1961) Theory and Observation in Family Epidemics of the Common Cold