Statistical interactions can be challenging to understand and interpret. This simple example illustrates a regression model with an interaction effect.
First I generate some fake data, which allows us to know the underlying structure of the data. Let y
be some generic response variable (e.g., weight). Corresponding to each y
is a sex
variable (M/F) and age
(0 through 4). For each age/sex combination I generate five independent observations from a normal distribution with a standard deviation of one. For the M
data I model the mean of the normals linearly (slope of one), and for the F
observations the mean is held constant at zero. The code for generating the data is below, as are the first 10 observations.
set.seed(123)
sd<-1 # to change the std. deviation if needed
sex <- c(rep('M',25),rep('F',25))
age <- c(rep(c(rep(0:4,each=5)),2))
male <- c(rnorm(5, mean=0, sd=sd),rnorm(5, mean=1, sd=sd),rnorm(5, mean=2, sd=sd),
rnorm(5, mean=3, sd=sd),rnorm(5, mean=4, sd=sd))
female <- rnorm(25,mean=0,sd=sd)
dat <- data.frame(y=c(male,female),age=age,sex=sex)
head(dat,n=10)
## y age sex ## 1 -0.56047565 0 M ## 2 -0.23017749 0 M ## 3 1.55870831 0 M ## 4 0.07050839 0 M ## 5 0.12928774 0 M ## 6 2.71506499 1 M ## 7 1.46091621 1 M ## 8 -0.26506123 1 M ## 9 0.31314715 1 M ## 10 0.55433803 1 M
The plot below shows the data with trend lines. Note that the y-intercept of both lines is near zero, the slope for M
is around one and the slope for F
is around zero. No surprise since this is how the data were generated.
Let’s first work with the data knowing their structure and fit a model with a sex by age interaction.
## ## Call: ## lm(formula = y ~ sex * age, data = dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.9068 -0.5332 -0.1707 0.6498 2.1258 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.22013 0.32366 0.680 0.500 ## sexM 0.08650 0.45772 0.189 0.851 ## age -0.05899 0.13213 -0.446 0.657 ## sexM:age 0.88902 0.18687 4.758 1.97e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9343 on 46 degrees of freedom ## Multiple R-squared: 0.6604, Adjusted R-squared: 0.6382 ## F-statistic: 29.81 on 3 and 46 DF, p-value: 7.381e-11
The coding for the model is as follows. For the F
group the regression line is , where both the slope and intercept are not significantly different than zero (p=0.50 and 0.85 respectively). For the
M
group the regression line is . We would need to test separately of the slope and intercept are different from zero, but the interaction effect is significant (which is really what we care about here).
In terms of statistics, the interaction means both the slope and intercept are different depending on the F
vs M
category.
Suppose we want to work with the main effects of age
and sex
. Doing so is wrong both in terms of the statistical models (the models are incorrect) and in interpretation. That is, we know that the age
effect is different in the M
group vs F
group, so to ignore the interaction conveys incorrect information. But we’ll try anyways.
If we omit the interaction term we assume that age
is not change with sex
. That is, we will estimate one slope for age
, but allow for different y-intercepts for F
and M
. The model with output and corresponding plot are below.
## ## Call: ## lm(formula = y ~ sex + age, data = dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.9963 -1.0246 0.1595 0.7968 2.4347 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.6689 0.3193 -2.095 0.04163 * ## sexM 1.8645 0.3193 5.839 4.72e-07 *** ## age 0.3855 0.1129 3.414 0.00133 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.129 on 47 degrees of freedom ## Multiple R-squared: 0.4932, Adjusted R-squared: 0.4717 ## F-statistic: 22.87 on 2 and 47 DF, p-value: 1.155e-07
We know that this is the wrong model, and can observe the poor fit in the F
category. But this model does capture the age
and sex
effects that are present in the data. With this model we incorrectly infer that y
increases in age
for F
.
Finally, we can consider main effects age
and sex
separately. First we’ll look for a sex
difference in y
. The side-by-side box plots show there is a difference (a t test results in p<0.0001 for the difference in means, output omitted). There is a difference in y
between sex
, but we know that the difference also depends on age
. That is, the difference minimal at age=0
and more pronounced at age=4
.
Next we look for an overall age
effect. The plot below shows the overall regression line for age
. The age slope is estimated to be 0.39 (p=0.012, output omitted). So there appears to be an overall age effect. Again, we are making an error that for both F
and M
that y
is increasing in age
.
I hope this tutorial explained how to read R output for interaction effects, and why when an interaction effect is present that interpretations of main effects are misleading.