TITLE: Figuring out Sums of Squares in ANOVA DATE: 2018-09-20 AUTHOR: John L. Godlee ==================================================================== I was teaching on a field course and when the students started analysing their data in R, one of them noticed that if they switched around the independent variables in an lm() they got different results with different methods of computing analysis of variance tables. I wanted to investigate it more, and this is the resulting R Markdown report that I wrote: The report can be found [here] and is also pasted below [here]: https://johngodlee.github.io/files/anova_ssq/anova_ssq.zip Issues with lm(), aov(), anova(), Anova() in R, methods of estimating sums of squares John Godlee 13/09/2018 The methods by which sums of squares are calculated within a linear model differ depending on the function used to fit the model (lm(), aov()) and the function used to calculate the ANOVA table (anova(), car::Anova()). This results in different test statistic values and corresponding P-values. Methods of estimating Sums of Squares (SSQ) Type I (Sequential) Steps for partitioning variance among multiple independent variables: SS(A) - Main effect of independent variable A. SS(B | A) - Main effect of B after the main effect of A. SS(AB | B, A) - Interaction A*B after all main effects. Because the main factors are tested in a particular order (defined in the model specification), if the factors are unbalanced (i.e. different numbers of observations for each level of the factors) the order of factors in the specification matters. Often there is a degree of multicollinearity among independent variables, when this is the case, there is a portion of the variance in the dependent variable which could potentially be attributed to multiple independent variables. In this method of partitioning, any variance that could be attributable to both independent variables is given to the one entered into the model specification first. Type II (Partial with separate interactions) Steps for partitioning variance among multiple independent variables: SS(A | B) SS(B | A) SS(A*B | A,B) In this method, the variance is partitioned simultaneously between A with respect to B and B with respect to A. Any variance that can be attributed to both independent variables is ignored until the interaction term is calculated. Unlike Type III, this method doesn’t adjust SSQ(A) and SSQ(B) in response to the interaction term. Type III (Partial with combined interactions) Steps for partitioning variance among multiple independent variables: SS(A | B, A*B) SS(B | A, A*B) SS(A*B | A,B) In this method, like Type II, the variance is partitioned simultaneously between A and B with respect to each other, but this time it also takes into account interaction terms during this partitioning. Then, like Type II it calculates the variance attributed to the interaction term with respect to both A and B. This method disregards any potentially multi-attributable variance completely, meaning that SSQ(A) + SSQ(B) != SSQ(total). For this reason, many statisticians think this method is pretty rubbish except in very specific hypothesis testing circumstances. Testing methods in R All the code tested in this section is done so using data adapted from the mtcars dataset, in the {datasets} package. First, getting set up, loading packages, data and some new grouping variables:: Basic linear model - testing reporting methods Lets see what happens when I run a basic linear model and then create summary tables using different methods: All of these methods give the same P-values for the effect of hp on wt. The only difference being that summary() gives the standard error and model fit coefficients for the independent variable while anova() and Anova() give the Sums of Squares. Basically, summary() gives a linear regression test output, while anova() and Anova() give analysis of variance test outputs, no surprise there. Multivariate models - testing ordering of variables using summary(lm()) First, create a bunch of models: Then test the output of summary() on all these models: In all the above cases, summary(lm()) gives the same outputs, regardless of the order or type of variables. Because summary(lm()) isn’t necessarily fitting an ANOVA, it doesn’t use sequential sums of squares (Type I), instead it uses Type III partial sums of squares. And just to prove that the cell sizes (i.e. number of data points in each level of each factor) are different between group and group2: Comparing methods of fitting ANOVA type linear models, e.g. aov(), lm() Create some more models, using lm() and aov(): and comparing model output using summary(): lm() decomposes estimations by group, whereas aov() just gives a simple F value for that variable. Models with aov() are affected by the order which factors are entered into the model, resulting in different Sums of Squares estimates. summary(lm()) uses Type I (sequential) SSQ while summary(aov()) uses Type III (partial) SSQ. Comparing lm() models using different ANOVA table methods, anova(), Anova(), summary() Now that I’ve tested the model fitting functions, I need to test the difference in reporting methods. First, create some models of unequal cell size (unbalanced design): Fits using Type I Sums of Squares by default, gives possible SSQ to first factor then leftover to second factor: Sums of squared estimates change when the order of factors is changed! Next, fits using Type II Sums of Squares, simultaneously assigns SSQ to both factors, not including any shared SSQ: Note that the Sums of squared estimates don’t change when the order of factors is changed. Comparing Type III and Type II SSQ methods from Anova() Anova() from the {car} package allows specification of the Sums of squares estimation method, so let’s test how they change the results. Create models: All the above methods give identical results. Phew. Conclusions If design of analysis is balanced, method of estimating SSQ doesn’t matter, but if unbalanced, Type I SSQ estimation will result in different estimates depending on the order of factors in the model specification. Using Type I SSQ estimation is recommended for some types of analysis, i.e. possibly for ANCOVA where you only want to attribute variance to the grouping factor that is unable to be attributed to the continuous factor. Using Type I SSQ estimation, a second factor may appear less significant than it actually is, if your hypothesis doesn’t assume that the second factor is subsidiary. If running an ANCOVA, generally specify the model as the following, unless you have a very specific reason to preferentially load variance onto the grouping variable, or want to disregard multi-attributable variance: