TITLE: Figuring out Sums of Squares in ANOVA
DATE: 2018-09-20
AUTHOR: John L. Godlee
====================================================================


I was teaching on a field course and when the students started
analysing their data in R, one of them noticed that if they switched
around the independent variables in an lm() they got different
results with different methods of computing analysis of variance
tables. I wanted to investigate it more, and this is the resulting R
Markdown report that I wrote:

The report can be found [here] and is also pasted below

  [here]: https://johngodlee.github.io/files/anova_ssq/anova_ssq.zip

Issues with lm(), aov(), anova(), Anova() in R, methods of
estimating sums of squares
John Godlee
13/09/2018

The methods by which sums of squares are calculated within a linear
model differ depending on the function used to fit the model (lm(),
aov()) and the function used to calculate the ANOVA table (anova(),
car::Anova()). This results in different test statistic values and
corresponding P-values.
Methods of estimating Sums of Squares (SSQ)
Type I (Sequential)
Steps for partitioning variance among multiple independent
variables:
SS(A) - Main effect of independent variable A.
SS(B | A) - Main effect of B after the main effect of A.
SS(AB | B, A) - Interaction A*B after all main effects.
Because the main factors are tested in a particular order (defined
in the model specification), if the factors are unbalanced
(i.e. different numbers of observations for each level of the
factors) the order of factors in the specification matters.
Often there is a degree of multicollinearity among independent
variables, when this is the case, there is a portion of the variance
in the dependent variable which could potentially be attributed to
multiple independent variables. In this method of partitioning, any
variance that could be attributable to both independent variables is
given to the one entered into the model specification first.

Type II (Partial with separate interactions)
Steps for partitioning variance among multiple independent
variables:
SS(A | B)
SS(B | A)
SS(A*B | A,B)
In this method, the variance is partitioned simultaneously between A
with respect to B and B with respect to A. Any variance that can be
attributed to both independent variables is ignored until the
interaction term is calculated. Unlike Type III, this method doesn’t
adjust SSQ(A) and SSQ(B) in response to the interaction term.

Type III (Partial with combined interactions)
Steps for partitioning variance among multiple independent
variables:
SS(A | B, A*B)
SS(B | A, A*B)
SS(A*B | A,B)
In this method, like Type II, the variance is partitioned
simultaneously between A and B with respect to each other, but this
time it also takes into account interaction terms during this
partitioning. Then, like Type II it calculates the variance
attributed to the interaction term with respect to both A and B.
This method disregards any potentially multi-attributable variance
completely, meaning that SSQ(A) + SSQ(B) != SSQ(total). For this
reason, many statisticians think this method is pretty rubbish
except in very specific hypothesis testing circumstances.

Testing methods in R
All the code tested in this section is done so using data adapted
from the mtcars dataset, in the {datasets} package.
First, getting set up, loading packages, data and some new grouping
variables::

Basic linear model - testing reporting methods
Lets see what happens when I run a basic linear model and then
create summary tables using different methods:
All of these methods give the same P-values for the effect of hp on
wt. The only difference being that summary() gives the standard
error and model fit coefficients for the independent variable while
anova() and Anova() give the Sums of Squares. Basically, summary()
gives a linear regression test output, while anova() and Anova()
give analysis of variance test outputs, no surprise there.

Multivariate models - testing ordering of variables using
summary(lm())
First, create a bunch of models:
Then test the output of summary() on all these models:
In all the above cases, summary(lm()) gives the same outputs,
regardless of the order or type of variables. Because summary(lm())
isn’t necessarily fitting an ANOVA, it doesn’t use sequential sums
of squares (Type I), instead it uses Type III partial sums of
squares.
And just to prove that the cell sizes (i.e. number of data points in
each level of each factor) are different between group and group2:

Comparing methods of fitting ANOVA type linear models, e.g. aov(),
lm()
Create some more models, using lm() and aov():
and comparing model output using summary():
lm() decomposes estimations by group, whereas aov() just gives a
simple F value for that variable. Models with aov() are affected by
the order which factors are entered into the model, resulting in
different Sums of Squares estimates. summary(lm()) uses Type I
(sequential) SSQ while summary(aov()) uses Type III (partial) SSQ.

Comparing lm() models using different ANOVA table methods, anova(),
Anova(), summary()
Now that I’ve tested the model fitting functions, I need to test the
difference in reporting methods.
First, create some models of unequal cell size (unbalanced design):
Fits using Type I Sums of Squares by default, gives possible SSQ to
first factor then leftover to second factor:
Sums of squared estimates change when the order of factors is
changed!
Next, fits using Type II Sums of Squares, simultaneously assigns SSQ
to both factors, not including any shared SSQ:
Note that the Sums of squared estimates don’t change when the order
of factors is changed.

Comparing Type III and Type II SSQ methods from Anova()
Anova() from the {car} package allows specification of the Sums of
squares estimation method, so let’s test how they change the
results.
Create models:
All the above methods give identical results. Phew.

Conclusions
If design of analysis is balanced, method of estimating SSQ doesn’t
matter, but if unbalanced, Type I SSQ estimation will result in
different estimates depending on the order of factors in the model
specification.
Using Type I SSQ estimation is recommended for some types of
analysis, i.e. possibly for ANCOVA where you only want to attribute
variance to the grouping factor that is unable to be attributed to
the continuous factor.
Using Type I SSQ estimation, a second factor may appear less
significant than it actually is, if your hypothesis doesn’t assume
that the second factor is subsidiary.
If running an ANCOVA, generally specify the model as the following,
unless you have a very specific reason to preferentially load
variance onto the grouping variable, or want to disregard
multi-attributable variance: