It’s A Math, Math World (ANOVA Part1)
In previous weeks, we learned how to test a single hypothesis of the difference between two population means: i.e., test whether two means u1 and u2 are equal. What if we have more than two populations are we want to see if the means are equal? We want to compare more than 2 population means at the same time. This process is called Analysis of Variance (ANOVA).
Note we could conduct multiple pair-wise tests of the equality of means, but this would multiply the error rate considerably. In the ANOVA case, we test the following hypothesis:
H_{0}: u_{1} = u_{2} = u_{3} = … = u_{k}
H_{A}: not all the means are equal
The methodology is as follows (we are assuming EQUAL sample sizes in this example).This example is from the textbook General Statistics (2000) by Chase and Bown:
A large chemical company uses 4 manufacturing plants to produce the same fertilizer. The plants were built to be equivalent, so the mean output of fertilizer from each plant should be the same and have the same variability. We want to test that the weekly mean output (tons of fertilizer produced) is the same for each plant. This will of course vary week to week, but we are interested in the true mean weekly production for a plant.
H_{0}: u_{1} = u_{2} = u_{3} = u_{4}
H_{a}: Not all means are equal (at least one is different)
Weekly Production Figures for 5 weeks for 4 Fertilizer Plants (weekly production is in tons)
PLANT 1 | PLANT 2 | PLANT 3 | PLANT 4 | |
574 | 546 | 580 | 585 | |
578 | 556 | 570 | 582 | |
573 | 549 | 577 | 581 | |
568 | 551 | 575 | 589 | |
572 | 553 | 573 | 588 | |
Sample mean | 573 | 551 | 575 | 585 |
Sample variance | 13 | 14.5 | 14.5 | 12.5 |
If the sample means are clustered close together, this would tend to support H0.
A great degree of variability among the sample- means would suggest that not all of the population means are equal, thus supporting H_{A}.
The key to testing for equality of several population means is to look at the variability between the sample means. A large amount of variability would suggest that not all of the population means are equal. Therefore, we would reject H_{0} in favor of H_{A}, otherwise we would not reject H0.
“Large” is a relative term and this variability must be measured in terms of something. We will define large as being the condition that the variability between the sample- means is large in relation to the variability within the samples. When this is the case, we reject H_{0} and conclude that the population means are not all the same.
First we assume that the population variance, σ^{2}, is the same for all the plants, whether the means are equal or not. From our sample data, we will calculate 2 estimates:
- The within-sample estimate of σ^{2}
- The between sample estimate of σ^{2}
Estimate #1: Within-Sample Estimate
We pool the estimates the estimates of the sample variances by averaging them:
Estimate 1 = (13+14.5+14.5+12.5)/4 = 13.625
Estimate #2: Between-Sample Estimate
Let us assume for the moment that H0 is true, and then we can view the samples of production figures as 4 samples of size 5 from the same population. The 4 sample means are values of the random variable x_bar. By the Central Limit Theorem, we know that the standard deviation of x_bar is:
σ_{x_bar }= sqrt (σ^{2}/m) or σ2 = m x (σ_{x_bar})^{2}
We use the sample variance of the 4 values of x_bar which I will call s^{2}x_bar as an estimate of this variance we have to find.
We first need to find the grand mean of the 4 sample means which is = (573 + 571 + 575 + 585)/4
= 571
We calculate the sample variance s^{2}x_bar as follows:
Sample Mean | Sample mean – grand mean | (Sample mean – grand mean)^{2} |
573 | 0 | 0 |
551 | -2 | 4 |
575 | 2 | 4 |
585 | 0 | 0 |
Grand mean = 571 | 8 = SUM |
S2x_bar = SUM/(4-1) = 8/3
Estimate 2 = m x (s2x_bar) = 5 x (8/3) = 13.333
We combine the estimates as follows:
F-stat = (Estimate #1)/ (Estimate #2) = 13.625/13.333 = 1.021
The statistic, F-stat, follows an F distribution with df1= k-1 and df2 = n-k degrees of freedom where:
n= # of data values in all the samples.
k = # of populations
We express the degrees of freedom as an ordered pair df = (k-1, n-k)
In our example F-stat = 1.021 and compare it to the F distribution at α=0.05 and df = (3, 16)
Our critical value is 3.24 (from the F distribution tables), since F-stat < critical value, we fail to reject the H0 and we conclude that there is no difference between the mean output of the 4 plants.
Like what you read? Get blogs delivered right to your inbox as I post them so you can start standing out in your job and career. There is not a better way to learn or review college level stats topics than by reading, It’s A Math, Math World
Nice posts indeed
Just commenting on how good the design of your website is, been serching into creating a blog similar to yours and might make mine similar, did you hire a coder or did you create it yourself?
4 sample means which is = (573 + 571 + 575 + 573)/4
why not =(573+551+575+585)/4=571?
Michael –
The grand mean of the sample is 571 not 573.
Best,
Jeff
Thanks, Jeff. I made the correction. Thanks for your eagle eye and fror keeping me honest. All the best, MIke
Yes, you are correct. I made the correction. That will teach me to proof-read better
THanks again.
Very good example. Simple and relevant. Perhaps notations from one step to another are a little bit confusing (σ2 to s2x_bar), but it’s OK. Also, being a tutorial, it would be worthwhile to explain why the F statistic takes this form, what does it mean. In the end, are you sure about the number of degrees of freedom and the F value for those dfs?
Hi Dan, Thanks for the positive note. I am planning to do another post of ANOVA when sample sizes are not equal and maybe I can cover the topics you mentioned at that time. Very good points! As for the F value and the degrees of freedom, I will check it again because I transcribed this example from a text and I could have made a mistake. It would not be the first time! Take Care and thanks again.
Hello Mr. O’Brien:
Thank you for posting such useful material on your blog. It has been very helpful to me so far. I have a small suggestion: Could you give a suitable title (we see just the date now) to each blog post so that in the future, when you have many posts on your blog, it becomes easier to search for a topic one wants easily rather than going through all your posts. Thanks again.
Sincerely,
Sid
Thank you, Mr. O’Brien, it’s perfect. Waiting for your new post, because approaches like this one really help students to better understand how to use data analysis tools.
Dan
Hey very nice blog!! Man .. Beautiful .. Amazing .. I will bookmark your blog and take the feeds also…
Very creative,I like it.
I love the way you write and also the theme on your blog. Did you code this yourself or was it done by a professional? I’m very very impressed.