# Chapter 7 Creating Fake Data

Fake data are generated by sampling from one of R’s random sampling functions. These functions sample from different distributions including

- uniform – function
`runif(n, min=0, max=1)`

, which samples n continous values between`min`

and`max`

. - normal (Gaussian) – function
`rnorm(n, mean=0, sd=1)`

, which samples n continous values from a distribution with the specified mean and standard deviation. The default is the “standard” normal distribution. - poisson – function
`rpois(n, lambda)`

, which samples`n`

counts from a distribution with mean and variance equal to`lambda`

. - negative binomial –
`rnegbin(n, mu=n, theta)`

, which samples`n`

counts with mean`mu`

and variance`mu + mu^2/theta`

.

### 7.0.1 Continuous X (fake observational data)

A very simple simulation of observational design (the \(X\) are not at “controlled” levels)

```
n <- 25
# the paramters
beta_0 <- 25 # the true intercept
beta_1 <- 3.4 # the true slope
sigma <- 2 # the true standard deviation
x <- rnorm(n)
y <- beta_0 + beta_1*x + rnorm(n, sd=sigma)
qplot(x, y)
```

How well does a model fit to the data recover the true parameters?

```
fit <- lm(y ~ x)
knitr::kable(coefficients(summary(fit)), digits=c(1, 2, 1, 4))
```

Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|

(Intercept) | 25.8 | 0.34 | 75.8 | 0 |

x | 4.2 | 0.40 | 10.4 | 0 |

The coefficient of \(x\) is the “Estimate”. How close is the estimate? Run the simulation several times to look at the variation in the estimate – this will give you a sense of the uncertainty. Increase \(n\) and explore this uncertainty. Increase all the way up to \(n=10^5\). Commenting out the qplot line will make this exploration easier.

### 7.0.2 Categorical X (fake experimental data)

Similar to above but the \(X\) are at controlled levels and so this simulates an experimental design

```
n <- 5 # the sample size per treatment level
fake_data <- data.table(Treatment=rep(c("control", "treated"), each=n))
beta_0 <- 10.5 # mean of untreated
beta_1 <- 2.1 # difference in means (treated - untreated)
sigma <- 3 # the error standard deviation
# the Y variable ("Response") is a function of treatment. We use some matrix
# algebra to get this done.
# Turn the Treatment assignment into a model matrix. Take a peak at X!
X <- model.matrix(~ Treatment, fake_data)
# to make the math easier the coefficients are collected into a vector
beta <- c(beta_0, beta_1)
# you will see the formula Y=Xb many times. Here it is coded in R
fake_data[, Response:=X%*%beta + rnorm(n, sd=sigma)]
# plot it with a strip chart (often called a "dot plot")
ggstripchart(data=fake_data, x="Treatment", y="Response", add = c("mean_se"))
```

```
# fit using base R linear model function
fit <- lm(Response ~ Treatment, data=fake_data)
# display a pretty table of the coefficients
knitr::kable(coefficients(summary(fit)), digits=3)
```

Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|

(Intercept) | 11.626 | 1.097 | 10.601 | 0.000 |

Treatmenttreated | 2.100 | 1.551 | 1.354 | 0.213 |

Check that the intercept is close to beta_0 and the coefficient for Treatment is close to beta_1. This coefficient is the difference in means between the treatment levels. It is the simulated effect. Again, change \(n\). Good values are \(n=20\) and \(n=100\). Again, comment out the plot line to make exploration more efficient.