# Chapter 7 Creating Fake Data

Fake data are generated by sampling from one of R’s random sampling functions. These functions sample from different distributions including

1. uniform – function runif(n, min=0, max=1), which samples n continous values between min and max.
2. normal (Gaussian) – function rnorm(n, mean=0, sd=1), which samples n continous values from a distribution with the specified mean and standard deviation. The default is the “standard” normal distribution.
3. poisson – function rpois(n, lambda), which samples n counts from a distribution with mean and variance equal to lambda.
4. negative binomial – rnegbin(n, mu=n, theta), which samples n counts with mean mu and variance mu + mu^2/theta.

### 7.0.1 Continuous X (fake observational data)

A very simple simulation of observational design (the $$X$$ are not at “controlled” levels)

n <- 25
# the paramters
beta_0 <- 25 # the true intercept
beta_1 <- 3.4 # the true slope
sigma <- 2 # the true standard deviation

x <- rnorm(n)
y <- beta_0 + beta_1*x + rnorm(n, sd=sigma)
qplot(x, y) How well does a model fit to the data recover the true parameters?

fit <- lm(y ~ x)
knitr::kable(coefficients(summary(fit)), digits=c(1, 2, 1, 4))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.8 0.34 75.8 0
x 4.2 0.40 10.4 0

The coefficient of $$x$$ is the “Estimate”. How close is the estimate? Run the simulation several times to look at the variation in the estimate – this will give you a sense of the uncertainty. Increase $$n$$ and explore this uncertainty. Increase all the way up to $$n=10^5$$. Commenting out the qplot line will make this exploration easier.

### 7.0.2 Categorical X (fake experimental data)

Similar to above but the $$X$$ are at controlled levels and so this simulates an experimental design

n <- 5 # the sample size per treatment level

fake_data <- data.table(Treatment=rep(c("control", "treated"), each=n))
beta_0 <- 10.5 # mean of untreated
beta_1 <- 2.1 # difference in means (treated - untreated)
sigma <- 3 # the error standard deviation
# the Y variable ("Response") is a function of treatment. We use some matrix
# algebra to get this done.
# Turn the Treatment assignment into a model matrix. Take a peak at X!
X <- model.matrix(~ Treatment, fake_data)
# to make the math easier the coefficients are collected into a vector
beta <- c(beta_0, beta_1)
# you will see the formula Y=Xb many times. Here it is coded in R
fake_data[, Response:=X%*%beta + rnorm(n, sd=sigma)]
# plot it with a strip chart (often called a "dot plot")
ggstripchart(data=fake_data, x="Treatment", y="Response", add = c("mean_se")) # fit using base R linear model function
fit <- lm(Response ~ Treatment, data=fake_data)
# display a pretty table of the coefficients
knitr::kable(coefficients(summary(fit)), digits=3)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.626 1.097 10.601 0.000
Treatmenttreated 2.100 1.551 1.354 0.213

Check that the intercept is close to beta_0 and the coefficient for Treatment is close to beta_1. This coefficient is the difference in means between the treatment levels. It is the simulated effect. Again, change $$n$$. Good values are $$n=20$$ and $$n=100$$. Again, comment out the plot line to make exploration more efficient.

### 7.0.3 Correlated X (fake observational data)

#### 7.0.3.1 Generating correlated X variables

It’s useful to think about how correlated data are generated because often we want to generate fake_data with an expected correlation. Let’s say we want to generate two X variables that have an expected correlation of 0.6. To generate this, we take advantage of the fact that two variables, X1 and X2, are correlated if they share a “common cause” – a variable Z that effects (or “causes”) both X1 and X2. If the expected variances of X1, X2, and Z are all 1, then the expected correlation between X1 and X2 is the product of the causal effect from Z to each X. The easiest way to implement this is to simply make the effect from Z to both X equal to $$\sqrt(0.6)$$.

n <- 10^3
z <- rnorm(n) # the common cause, with sigma = 1
rho <- 0.6 # the true correlation between X1 and X2
beta_z <- sqrt(rho) # the easiest way to get effects of z on X1 and X2 that generates rho
sigma_x <- sqrt(1 - rho) # we will make the variance of X1 and X2 = 1, so the "explained" variance in X is beta_z^2 = rho so this is the sqrt of the unexplained variance
x1 <- beta_z*z + rnorm(n, sd=sigma_x)
x2 <- beta_z*z + rnorm(n, sd=sigma_x)
# check
cov(data.frame(X1=x1, X2=x2)) # is the diagonal close to 1, 1?
##           X1        X2
## X1 1.0523598 0.5978408
## X2 0.5978408 0.9713292
cor(x1, x2) # is the value close to rho?
##  0.5913168
qplot(x1, x2) Now create fake $$Y$$ that is a function of both $$X_1$$ and $$X_2$$. Create “standardized” fake data, where $$\sigma_Y = 1$$.

beta_0 <- 3.2
beta_1 <- 0.7
beta_2 <- -0.3
explained_sigma <- beta_1^2 + beta_2^2 + 2*beta_1*beta_2*rho # Wright's rules! Compare to Trig!
sigma_Y.X <- sqrt(1 - explained_sigma) # sqrt unexplained variance
y <- beta_0 + beta_1*x1 + beta_2*x2 + rnorm(n, sd=sigma_Y.X)

# check
var(y) # should be close to 1 as n gets bigger
##  0.9791502
# check parameters
coef(summary(lm(y ~ x1 + x2))) # should be near parameters as n gets bigger
##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  3.1887712 0.02536483 125.716255 0.000000e+00
## x1           0.6885565 0.03064195  22.471043 8.422461e-91
## x2          -0.3037457 0.03189446  -9.523465 1.223898e-20

Note that the variance of $$Y$$ is the variance of the explained part due to X1 and X2 and the unexplained part and if the expected variance of $$Y=1$$ then this sets an upper limit for the explained part. This means that

$\begin{equation} \beta_1^2 + \beta_2^2 + 2 \beta_1 \beta_2 \rho < 1 \end{equation}$

which means the magnitude of $$\beta_1$$ and $$\beta_2$$ should generally be less than 1.

#### 7.0.3.2 Creating mulitple X variables using the package mvtnorm

The package mvtnorm provides a function to generate multivariate data (multiple columns) with a specified vector of means (the means of each column) and covariance matrix among the means.

rcov1 <- function(p){
# p is the number of columns or number of variables
pp <- p*(p-1)/2 # number of elements in lower tri
max_r <- 0.7
r <- rexp(pp)
r <- r*max_r/max(r)

# create correlation matrix
R <- matrix(1, nrow=p, ncol=p)
R[lower.tri(R, diag=FALSE)] <- r
R <- t(R)
R[lower.tri(R, diag=FALSE)] <- r

# convert to covariance matrix
L <- diag(sqrt(rexp(p))) # standard deviations
S <- L%*%R%*%L

# check -- these should be the same
# R
# cov2cor(S)
return(S)
}

Now let’s use mvtnorm to generate fake correlated X

p <- 5 # number of X variables
S <- rcov1(p)

# make the fake X
n <- 10^5
mu <- runif(p, min=10, max=100) # vecctor of p means
X <- rmvnorm(n, mean=mu, sigma=S)

# how close? (check the cor as this is easier to scan)
cov2cor(S)
##           [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] 1.0000000 0.15511020 0.11863165 0.48788949 0.14545246
## [2,] 0.1551102 1.00000000 0.70000000 0.03514487 0.04345391
## [3,] 0.1186317 0.70000000 1.00000000 0.05346376 0.54083807
## [4,] 0.4878895 0.03514487 0.05346376 1.00000000 0.44504979
## [5,] 0.1454525 0.04345391 0.54083807 0.44504979 1.00000000
cor(X)
##           [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] 1.0000000 0.15618693 0.12146387 0.48987841 0.14569745
## [2,] 0.1561869 1.00000000 0.69709529 0.03879653 0.03961509
## [3,] 0.1214639 0.69709529 1.00000000 0.05598629 0.54138395
## [4,] 0.4898784 0.03879653 0.05598629 1.00000000 0.44214983
## [5,] 0.1456974 0.03961509 0.54138395 0.44214983 1.00000000

#### 7.0.3.3 The rcov1 algorithm is naive

A problem with generating a fake covariance matrix as above is that it is likely to be singular as $$p$$ get’s bigger. A singular covariance matrix is one where there are fewer orthogonal axes of variation then there are variables. Imagine a multidimensional scatterplot of a data set with the fake covariance matrix. If we zoom around this multidimensional space, we will come across a “view” in which all the points are compressed along a single line – that is there is no variation on the axis orthogonal to this line of points. This is bad, as it means we cannot fit a linear model using least squares (because the inverse of the covariance matrix doesn’t exist).

Let’s explore this. If the a covariance matrix is singular, then at least one eigenvalue of the matrix is negative (eigenvalues is a multivariate term beyond the scope of this text but, effectively, these are the variances of the orthogonal axes referred to above). Here I compute the frequency of covariance matrices with at least one negative eigenvalue as $$p$$ increases

niter <- 1000
p_vec <- 3:10
counts <- numeric(length(p_vec))
i <- 0
for(p in 3:10){
i <- i+1
for(iter in 1:niter){
counts[i] <- ifelse(eigen(rcov1(p))$values[p] < 0, counts[i]+1, counts[i]) } } data.table(p=p_vec, freq=counts/niter) ## p freq ## 1: 3 0.000 ## 2: 4 0.025 ## 3: 5 0.066 ## 4: 6 0.135 ## 5: 7 0.190 ## 6: 8 0.225 ## 7: 9 0.275 ## 8: 10 0.368 #### 7.0.3.4 Generating multiple columns of X variables with a non-singular covariance matrix This section uses some ideas from matrix algebra. The goal is to create a $$n \times p$$ matrix of $$X$$ variables that have some random covariance structure that is full-rank (not singular, or no negative eigenvalues). The algorithm starts with a $$p \times p$$ random eigenvector matrix $$\mathbf{E}$$ and a $$p \times p$$ random eigenvalue matrix $$\mathbf{L}$$ and then computes the random covariance matrix using $$\mathbf{E}\mathbf{L}\mathbf{E}^\top$$ 1. Generate a random $$p \times p$$ random eigenvector matrix from a covariance matrix of $$p \times p$$ matrix of random normal variables. fake.eigenvectors <- function(p){ a <- matrix(rnorm(p*p), p, p) # only orthogonal if p is infinity so need to orthogonalize it a <- t(a)%*%a # this is the sum-of-squares-and-cross-product-matrix E <- eigen(a)$vectors # decompose to truly orthogonal columns
return(E)
}
1. Generate $$p$$ random eigenvalues in descending order and that sum to 1. There are several ways to create this sequence. Here are two:
# force the eigenvalues to descend at a constant rate
fake.eigenvalues <- function(p, m=p, start=1, rate=1){
# m is the number of positive eigenvalues
# start and rate control the decline in the eigenvalue
s <- start/seq(1:m)^rate
s <- c(s, rep(0, p-m)) # add zero eigenvalues
L <- diag(s/sum(s)*m) # rescale so that sum(s)=m and put into matrix,
# which would occur if all the traits are variance standardized
return(L)
}

# random descent
fake.eigenvalues.exp <- function(p, m=p, rate=1){
# exponential distribution of eigenvalues
# m is the number of positive eigenvalues
# start and rate control the decline in the eigenvalue
s <- rexp(m, rate)
s <- s[order(s, decreasing=TRUE)] # re-order into descending order
s <- c(s, rep(0, p-m)) # add zero eigenvalues
L <- diag(s/sum(s)*m) # rescale so that sum(s)=m and put into matrix,
# which would occur if all the traits are variance standardized
return(L)
}
1. Generate the random covariance matrix
fake.cov.matrix <- function(p){
# p is the size of the matrix (number of cols and rows)
E <- fake.eigenvectors(p)
L <- diag(fake.eigenvalues(p))
S <- E%*%L%*%t(E)
return(S)
}
1. Generate the random $$X$$ variables using $$\mathbf{X} = \mathbf{X}' (\mathbf{E}\sqrt{\mathbf{L}})^\top$$
# two functions to compute the random data
fake.X <- function(n,p,E,L){
# n is number of observations
# p is number of variables
X <- matrix(rnorm(n*p),nrow=n,ncol=p) %*% t(E%*%sqrt(L))
return(X)
}

An example

n <- 10^5
p <- 5
E <- fake.eigenvectors(p)
L <- fake.eigenvalues(p, start=1, rate=1)
X <- fake.X(n, p, E, L)
colnames(X) <- paste0("X", 1:p)
E
##            [,1]        [,2]        [,3]        [,4]        [,5]
## [1,]  0.4762049  0.76679336  0.02189702 -0.37886958 -0.20306443
## [2,] -0.3226816  0.26514622 -0.21147219  0.50680583 -0.72387940
## [3,]  0.6343127 -0.07414265  0.40381446  0.65434666  0.03024185
## [4,]  0.2450305 -0.53523178  0.22900376 -0.40906173 -0.65856872
## [5,] -0.4546568  0.22305882  0.85982044 -0.06406746 -0.01166667
E <- eigen(cov(X))\$vectors
scores <- X%*%E
colnames(scores) <- paste0("pc", 1:p)
cor(cbind(X, scores))[1:p, (p+1):(p*2)]
##           pc1         pc2          pc3        pc4          pc5
## X1  0.6381822  0.71536319  0.009665104  0.2581645 -0.119317350
## X2 -0.5701649  0.32632936 -0.213505507 -0.4289443 -0.582102493
## X3  0.8417769 -0.06005797  0.312645510 -0.4358132  0.011149679
## X4  0.4130194 -0.64271492  0.232581875  0.3517727 -0.488358607
## X5 -0.6579685  0.23893819  0.712352851  0.0501983 -0.004418134