It’s A Math, Math World (Contingency Tables & Independence)
In this week’s post, we will be analyzing categorical data with contingency tables. We want to see if 2 or more characteristics are related (dependent) or unrelated (independent). The following examples are from the textbook, General Statistics , by Chase and Brown (2000).
What do we mean by the independence of 2 characteristics? Suppose candidate A and candidate B are running for public office and 75% of the voters favor candidate A while 25% favor candidate B. Consider two characteristics: choice of candidate and gender of voter. These characteristics are independent if the percentage of voters following candidate A and following candidate B are the same for both genders (i.e. 75% of men and 75% of women follow candidate A while 25% of men and 25% of women follow candidate B). If for some reason the percentage favoring candidate A was greater in men, then the characteristics would be related or dependent.
We can create the following contingency table of the 4 possible combinations of the 2 factors:
FAVOR CANDIDATE A | FAVOR CANDIDATE B | |
FEMALES | Female and favor candidate A | Female and favor candidate B |
MALES | Male and favor candidate A | Male and favor candidate B |
Suppose 60% of voters in this election are female.
P (A) = Probability of vote for candidate A = 0.75
P (B) = Probability of vote for candidate B = 0.25
P (F) = Probability of female voter = 0.60
P (B) = Probability of male voter = 0.40
If candidate choice and gender of voter are independent, then
P (FA) = Probability of female votes for candidate A = P (F)*P (A) =0.6*0.75 = 0.45
P (FB) = Probability of female votes for candidate B = P (F)*P (B) =0.6*0.25 = 0.15
P (MA) = Probability of male votes for candidate A = P (M)*P (A) =0.4*0.75 = 0.30
P (MB) = Probability of male votes for candidate B = P (M)*P (B) =0.4*0.25 = 0.10
Otherwise they are dependent.
Example: The following are the results of a survey of 100 college students at Framingham State College and we are testing whether their political views are independent of their views on nuclear power.
The following 2 questions were asked:
1) What label most closely describes your political views (Democrat, Republican or Independent)?
2) What is your opinion on the use of nuclear power for the production of consumer energy (Approve, Disapprove or Undecided)?
Students Political Views vs. Their Opinions on Nuclear Power
DEMOCRAT | REPUBLICAN | INDEPENDENT | ROW TOTAL | |
APPROVE | 10 | 15 | 20 | 45 |
DISAPPROVE | 9 | 2 | 16 | 27 |
UNDECIDED | 8 | 2 | 18 | 28 |
COLUMN TOTAL | 27 | 19 | 54 | 100 GRAND TOTAL |
We want to test the following hypothesis:
H_{0}: The two characteristics are independent
H_{A}: The two characteristics are related
As with the goodness of fit test we looked at in the previous post, we want to calculate the Expected frequencies (E), for each cell of the table, from the Observed frequencies (O).
E (cell) = (row total)*(column total)/ (grand total)
Table of Observed Values (Expected Values)
DEMOCRAT | REPUBLICAN | INDEPENDENT | ROW TOTAL | |
APPROVE | 10 (12.15) | 15 (8.55) | 20 (24.30) | 45 |
DISAPPROVE | 9 (7.29) | 2 (5.13) | 16 (14.58) | 27 |
UNDECIDED | 8 (7.56) | 2 (5.32) | 18 (15.12) | 28 |
COLUMN TOTAL | 27 | 19 | 54 | 100 GRAND TOTAL |
We will use the Chi-Square test of Independence which is as follows:
χ^{2} = ∑ ((O-E)^{2}/E) = 11.10
We want to test at α=0.05 level of significance. We use Chi-Square tables with
df = (# of rows – 1)*(# of columns – 1) = 2*2 = 4
χ^{2 }(0.05, df=4) = 9.488
Since test_statistic = 11.10 > 9.488 = critical_value, we reject the null hypothesis and conclude that the 2 characteristics are related.
Like what you read? Get blogs delivered right to your inbox as I post them so you can start standing out in your job and career. There is not a better way to learn or review college level stats topics than by reading, It’s A Math, Math World.
Email Marketing You Can Trust
I was thinking may be one can apply similar strategy for pair trading algorithms, to figure out if given two stocks are related.
I was trying to device a case.
1.pick two stocks (and later keep iterating this loop to filter possible candidates. possibly extend to more than two…)
2.figure out if they are independent.
3. if not, pick them for possible pair.
4. back to step 1