Often, we’d like to explore data generation and modeling under
different scenarios. For example, we might want to understand the
operating characteristics of a model given different variance or other
parametric assumptions. There is functionality built into
`simstudy`

to facilitate this type of dynamic exploration.
First, the functions `updateDef`

and
`updateDefAdd`

essentially allow us to edit lines of existing
data definition tables. Second, there is a built-in mechanism - called
*double-dot* reference - to access external variables that do not
exist in a defined data set or data definition.

The `updateDef`

function updates a row in a definition
table created by functions `defData`

or `defRead`

.
Analogously, `updateDefAdd`

function updates a row in a
definition table created by functions `defDataAdd`

or
`defReadAdd`

.

The original data set definition includes three variables
`x`

, `y`

, and `z`

, all normally
distributed:

```
<- defData(varname = "x", formula = 0, variance = 3, dist = "normal")
defs <- defData(defs, varname = "y", formula = "2 + 3*x", variance = 1, dist = "normal")
defs <- defData(defs, varname = "z", formula = "4 + 3*x - 2*y", variance = 1, dist = "normal")
defs
defs
```

```
## varname formula variance dist link
## 1: x 0 3 normal identity
## 2: y 2 + 3*x 1 normal identity
## 3: z 4 + 3*x - 2*y 1 normal identity
```

In the first case, we are changing the relationship of `y`

with `x`

as well as the variance:

```
<- updateDef(dtDefs = defs, changevar = "y", newformula = "x + 5", newvariance = 2)
defs defs
```

```
## varname formula variance dist link
## 1: x 0 3 normal identity
## 2: y x + 5 2 normal identity
## 3: z 4 + 3*x - 2*y 1 normal identity
```

In this second case, we are changing the distribution of
`z`

to *Poisson* and updating the link function to
*log*:

```
<- updateDef(dtDefs = defs, changevar = "z", newdist = "poisson", newlink = "log")
defs defs
```

```
## varname formula variance dist link
## 1: x 0 3 normal identity
## 2: y x + 5 2 normal identity
## 3: z 4 + 3*x - 2*y 1 poisson log
```

And in the last case, we remove a variable from a data set
definition. Note in the case of a definition created by
`defData`

that it is not possible to remove a variable that
is a predictor of a subsequent variable, such as `x`

or
`y`

in this case.

```
<- updateDef(dtDefs = defs, changevar = "z", remove = TRUE)
defs defs
```

```
## varname formula variance dist link
## 1: x 0 3 normal identity
## 2: y x + 5 2 normal identity
```

For a truly dynamic data definition process, `simstudy`

(as of `version 0.2.0`

) allows users to reference variables
that exist outside of data generation. These can be thought of as a type
of hyperparameter of the data generation process. The reference is made
directly in the formula itself, using a double-dot (“..”) notation
before the variable name. Here is a simple example:

```
<- defData(varname = "x", formula = 0,
def variance = 5, dist = "normal")
<- defData(def, varname = "y", formula = "..B0 + ..B1 * x",
def variance = "..sigma2", dist = "normal")
def
```

```
## varname formula variance dist link
## 1: x 0 5 normal identity
## 2: y ..B0 + ..B1 * x ..sigma2 normal identity
```

```
<- 4;
B0 <- 2;
B1 <- 9
sigma2
set.seed(716251)
<- genData(100, def)
dd
<- summary(lm(y ~ x, data = dd))
fit
coef(fit)
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.00 0.284 14.1 2.56e-25
## x 2.01 0.130 15.4 5.90e-28
```

`$sigma fit`

`## [1] 2.83`

It is easy to create a new data set on the fly with a difference variance assumption without having to go to the trouble of updating the data definitions.

```
<- 16
sigma2
<- genData(100, def)
dd <- summary(lm(y ~ x, data = dd))
fit
coef(fit)
```

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.35 0.427 10.19 4.57e-17
## x 2.12 0.218 9.75 4.32e-16
```

`$sigma fit`

`## [1] 4.21`

The double-dot notation can be flexibly applied using
`lapply`

(or the parallel version `mclapply`

) to
create a range of data sets under different assumptions:

```
<- c(1, 2, 6, 9)
sigma2s
<- function(sigma2, d) {
gen_data <- genData(200, d)
dd $sigma2 <- sigma2
dd
dd
}
<- lapply(sigma2s, function(s) gen_data(s, def))
dd_4 <- rbindlist(dd_4)
dd_4
ggplot(data = dd_4, aes(x = x, y = y)) +
geom_point(size = .5, color = "grey30") +
facet_wrap(sigma2 ~ .) +
theme(panel.grid = element_blank())
```

The double-dot notation is also *array-friendly*. For example
if we want to create a mixture distribution from a vector of values
(which we can also do using a *categorical* distribution), we can
define the mixture formula in terms of the vector. In this case we are
generating permuted block sizes of 2 and 4:

```
<- defData(varname = "blksize",
defblk formula = "..sizes[1] | .5 + ..sizes[2] | .5", dist = "mixture")
defblk
```

```
## varname formula variance dist link
## 1: blksize ..sizes[1] | .5 + ..sizes[2] | .5 0 mixture identity
```

```
<- c(2, 4)
sizes genData(1000, defblk)
```

```
## id blksize
## 1: 1 4
## 2: 2 4
## 3: 3 4
## 4: 4 2
## 5: 5 4
## ---
## 996: 996 2
## 997: 997 2
## 998: 998 4
## 999: 999 4
## 1000: 1000 4
```

In this second example, there is a vector variable *tau* of
positive real numbers that sum to 1, and we want to calculate the
weighted average of three numbers using *tau* as the weights. We
could use the following code to estimate a weighted average
*theta*:

```
<- rgamma(3, 5, 2)
tau <- tau / sum(tau)
tau tau
```

`## [1] 0.319 0.550 0.132`

```
<- defData(varname = "a", formula = 3, variance = 4)
d <- defData(d, varname = "b", formula = 8, variance = 2)
d <- defData(d, varname = "c", formula = 11, variance = 6)
d <- defData(d, varname = "theta", formula = "..tau[1]*a + ..tau[2]*b + ..tau[3]*c",
d dist = "nonrandom")
set.seed(1)
genData(4, d)
```

```
## id a b c theta
## 1: 1 1.75 8.47 12.4 6.84
## 2: 2 3.37 6.84 10.3 6.18
## 3: 3 1.33 8.69 14.7 7.13
## 4: 4 6.19 9.04 12.0 8.52
```

We can simplify the calculation of *theta* by using matrix
multiplication:

```
<- updateDef(d, changevar = "theta", newformula = "t(..tau) %*% c(a, b, c)")
d
set.seed(1)
genData(4, d)
```

```
## id a b c theta
## 1: 1 1.75 8.47 12.4 6.84
## 2: 2 3.37 6.84 10.3 6.18
## 3: 3 1.33 8.69 14.7 7.13
## 4: 4 6.19 9.04 12.0 8.52
```

These arrays can also have **multiple dimensions**, as
in a \(2 \times 2\) matrix. If we want
to specify the mean outcomes for a factorial study design with two
interventions \(a\) and \(b\), we can use a simple matrix and draw
the means directly from the matrix, which in this example is stored in
the variable *effect*:

```
<- matrix(c(0, 4, 5, 7), nrow = 2)
effect effect
```

```
## [,1] [,2]
## [1,] 0 5
## [2,] 4 7
```

Using double dot notation, it is possible to reference the matrix cell values directly:

```
<- defData(varname = "a", formula = ".5;.5", variance = "1;2", dist = "categorical")
d1 <- defData(d1, varname = "b", formula = ".5;.5", variance = "1;2", dist = "categorical")
d1 <- defData(d1, varname = "outcome", formula = "..effect[a, b]", dist="nonrandom") d1
```

```
<- genData(8, d1)
dx dx
```

```
## id a b outcome
## 1: 1 2 2 7
## 2: 2 2 2 7
## 3: 3 2 1 4
## 4: 4 2 1 4
## 5: 5 1 1 0
## 6: 6 2 2 7
## 7: 7 2 1 4
## 8: 8 1 2 5
```

It is possible to generate normally distributed data based on these means:

```
<- updateDef(d1, "outcome", newvariance = 9, newdist = "normal")
d1 <- genData(1000, d1) dx
```

The plot shows the individual values as well as the mean values by intervention arm:

```
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
```