14 Small sample tests

If the sample size n is less than 30 (n < 30) it is known as small sample. For small samples the sampling distributions of statistic commonly used are χ2 (Chi-square), F and t distribution. A study of sampling distribution of statistic for small samples is known as small sample theory.

Small Sample Tests (sample size (n) < 30)

14.1 Tests based on Student t distribution (t-tests)

Assumptions of t-test:

  • The parent population from which the sample is drawn is normal.

  • The sample is a random sample.

  • The population standard deviation, σ is unknown.

14.1.1 Test for a single population mean

Consider there is a population with mean, say μ; where μ is unknown, we will take a random sample of size n from the population and calculate a sample mean, denoted as \(\overline{x}\). We want to test whether the population mean μ, which is unknown is equal to some known constant μ0, based on the sample mean \(\overline{x}\). Here sample size is less than 30.

The null hypothesis to be tested is

H0 : μ = μ0

The alternative hypothesis may be either

H1 : μ < μ0 (called left tailed alternative)

Or

H1 : μ > μ0 (called right tailed alternative)

Or

H1 : μμ0 (called two tailed alternative)

\[t = \frac{\overline{x} - \mu_{0}}{\frac{s}{\sqrt{n}}}\]

Where, \(s^{2} = \frac{\sum_{i = 1}^{n}\left( x_{i} - \overline{x} \right)^{2}}{n - 1}\)

Under null hypothesis t follows a t distribution with n-1 degrees of freedom

14.1.2 Decision rule for t test

Let t be the calculated value, degrees of freedom = n-1, α be the level of significance, then we reject the null hypothesis if

  • |t| > tα/2 ; for two tailed test

  • t > tα ; for right tailed test

  • t < - tα ; for left tailed test

Where tα or tα/2 can be obtained from the table of Student t distribution for the given degrees of freedom, n-1 and level of significance α. If the calculated value of the test statistic is less than critical values from the table. we may reject the null hypothesis. Otherwise, we may accept it.

Example 9:

Based on field experiments, a new variety of green gram is expected to give a yield of 12 quintals per hectare. The variety was tested on 10 randomly selected farmers’ fields. The yields (quintals per hectare) were recorded as 14.3, 12.6, 13.7, 10.9, 13.7, 12, 11.4, 12, 12.6, and 13.1. Do the results confirm the expectation?

Solution:

Null hypothesis, H0 : μ = 12

Alternate hypothesis, H1 : μ ≠ 12; two tailed test

Sample size (n) = 10

Sample mean, \(\overline{x}\) =\(\frac{\sum_{i = 1}^{n}x_{i}}{n} = (14.3+12.6+...+13.1)/10 = 126.3/10=12.63\)

Sample standard deviation (s) = 1.08536

μ0 = 12

Level of significance, α = 0.05

Calculation of sample mean and sample standard deviation

Sl No.  Yield \[\left( \mathbf{x}_{\mathbf{i}}\mathbf{-}\overline{\mathbf{x}}\right)\] \[\left( \mathbf{x}_{\mathbf{i}}\mathbf{-}\overline{\mathbf{x}}\right)^{\mathbf{2}}\]
1 14.30 1.67 2.788900
2 12.60 -0.03 0.000900
3 13.70 1.07 1.144900
4 10.90 -1.73 2.992900
5 13.70 1.07 1.144900
6 12.00 -0.63 0.396900
7 11.40 -1.23 1.512900
8 12.00 -0.63 0.396900
9 12.60 -0.03 0.000900
10 13.10 0.47 0.220900
Sum = \(\sum_{i = 1}^{n}x_{i}\) 126.30 \[\sum_{i = 1}^{n}\left( x_{i} - \overline{x} \right)^{2}\] 10.601000
Mean, \(\overline{x}\) =\(\ \frac{\sum_{i = 1}^{n}x_{i}}{10}\) 12.63 \[s^{2} = \frac{\sum_{i = 1}^{n}\left( x_{i} - \overline{x} \right)^{2}}{n - 1}\] 1.177889
NA \[s = \sqrt{s^{2}}\] 1.085306

\[t = \frac{\overline{x} - \mu_{0}}{\frac{s}{\sqrt{n - 1}}}\]

\[t = \frac{12.63 - 12}{\frac{1.085306}{\sqrt{10 - 1}}} = \frac{0.63}{0.3432} = 1.835\]

Table value for t corresponding to 5% level of significance and 9 degrees of freedom is 2.262 (two tailed test) – see table 1.1 at the end of this chapter.

Since the calculated value (1.835) is less than the table value (2.262), we conclude that, we don’t have enough evidence to reject the null hypothesis. So, it can be stated that mean is 12 quintals per hectare.

Example 10: Try it by yourself

The mean weekly sales of soap bars in departmental stores were 146.3 bars per store. After an advertising campaign the mean weekly sales in 22 stores for a typical week was 153.7 and showed a standard deviation of 17.2. Was the advertisement campaign successful?

14.2 Test for equality of two means

Let there be two normally distributed populations with means µ1 and µ2. Let the population standard deviations be equal and unknown. Let samples of sizes n1 and n2 were taken from these populations. Let the sample means were 𝑥̅1 𝑎𝑛𝑑 𝑥̅2 respectively. We want to test whether these population means are significantly different or not based on the sample means.

There are two cases under this situation

  1. Population variances are equal

  2. Population variances are unequal

Before proceeding to t-test a F test is performed to test homogeneity of population variance (See section).

14.2.1 Case when the population variances are equal (homogenous)

The null hypothesis to be tested is

H0 : μ1 = μ2

The alternative hypothesis may be either

H1 : μ1 < μ2 (called left tailed alternative)

Or

H1 : μ1> μ2 (called right tailed alternative)

Or

H1 : μ1μ2 (called two tailed alternative)

We will calculate test statistic, \(t\) using the following formula.

\[t = \frac{{\overline{x}}_{1} - {\overline{x}}_{2}}{s\sqrt{\left( \frac{1}{n_{1}} + \frac{1}{n_{2}} \right)}}\]

Where, \(s^{2} = \frac{(n_{1}-1)s_{1}^2+(n_{2}-1)s_{2}^2}{n_{1} + n_{2} - 2}\), \(x_{1i}\) and \(x_{2i}\) are sample observations from population 1 & 2, respectively.

Under null hypothesis t follows a t distribution with \(n_{1} + n_{2} - 2\)degrees of freedom. Decision rule is same as that of previous t- test (section 2.1.2).

14.2.2 Case when the population variances are unequal

The Welch t-test is an adaptation of Student’s t-test. It is used to compare the means of two groups, when the variances are different.

The null hypothesis to be tested is

H0 : μ1 = μ2

The alternative hypothesis may be either

H1 : μ1 < μ2 (called left tailed alternative)

Or

H1 : μ1> μ2 (called right tailed alternative)

Or

H1 : μ1μ2 (called two tailed alternative)

We will calculate test statistic, \(t\) using the following formula.

\[t = \frac{{\overline{x}}_{1} - {\overline{x}}_{2}}{\sqrt{\left( \frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}} \right)}}\]

\(s_{1}\)and\(\text{s}_{2}\) are the sample standard deviations from two populations, respectively.

The degrees of freedom of Welch t-test is calculated as follows:

\[\frac{\left( \frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}} \right)^{2}}{\frac{s_{1}^{4}}{n_{1}^{2}\left( n_{2} - 1 \right)} + \frac{s_{2}^{4}}{n_{2}^{2}\left( n_{1} - 1 \right)}}\]

Once the t value is determined, you have to read in the t table the critical value of Student’s t distribution corresponding to the significance level. Decision rule is same as that of previous t- test (section 2.1.2).

Example 11:

In order to compare the effectiveness of two sources of nitrogen, namely ammonium chloride and urea on grain yield of paddy, an experiment was conducted. The results on the grain yield of paddy (kg/plot) under the two treatments are given below.

Ammonium chloride: 13.4, 10.9, 11.2, 11.8, 14, 15.3, 14.2, 12.6, 17, 16.2, 16.5, 15.7

Urea: 12, 11.7, 10.7, 11.2, 14.8, 14.4, 13.9, 13.7, 16.9, 16, 15.6, 16

14.3 Paired t-test

Paired Student’s t-test is used to compare the means of two related samples. That is when you have two values (pair of values) for the same samples. For example, 20 cows received a treatment for 3 months. The question is to test whether the treatment has an impact on the milk yield of the cow at the end of the 3 months treatment. The milk yield of the 20 cows has been measured before and after the treatment. This gives us 20 sets of values before treatment and 20 sets of values after treatment. In this case, in order to test whether there is any significant difference between before and after, paired t-test can be used; as the two sets of values being compared are related. We have a pair of values for each cow (one before and the other after treatment).

Suppose we have two correlated random samples x1, x2, ..., xn and y1, y2, ..., yn. We want to test whether these population means are significantly different.

The Welch t-test is an adaptation of Student’s t-test. It is used to compare the means of two groups, when the variances are different.

The null hypothesis to be tested is

H0 : μ1 = μ2

The alternative hypothesis may be either

H1 : μ1 < μ2 (called left tailed alternative)

Or

H1 : μ1> μ2 (called right tailed alternative)

Or

H1 : μ1μ2 (called two tailed alternative)

We will calculate test statistic, \(t\) using the following formula.

\[t = \frac{|d^̅|}{\frac{s}{\sqrt{n}}}\]

Where \(d_{i} = x_{i} - y_{i}\), \(\overline{d} = \frac{\sum_{i = 1}^{n}d_{i}}{n}\), \(s^{2} = \frac{\sum_{i = 1}^{n}\left( d_{i} - \overline{d} \right)^{2}}{n - 1}\)

Under null hypothesis t follows a t distribution with \(n - 1\) degrees of freedom. Decision rule is same as that of previous t- test (section 2.1.2).

Example 12:

In an experiment the plots were divided into two equal parts. One part received soil treatment A and the second part received soil treatment B each plot was planted with sorghum. The sorghum yield (kg/plot) was observed as shown below. Test the effectiveness of soil treatments on sorghum yield

Soil Treatment A 49 53 51 52 47 50 52 53
Soil Treatment B 52 55 52 53 50 54 54 53

Solution:

Null hypothesis, H0 : μ1 = μ2, , there is no significant difference between the effects of the two soil treatments

Alternate hypothesis, H1 : : μ1μ2; two tailed test, there is significant difference between the effects of the two soil treatments

Level of significance, α = 0.05

\[t = \frac{|d|}{\frac{s}{\sqrt{n}}}\]

Sl No.  \[\mathbf{A}\] \[\mathbf{B}\] \[\mathbf{d}_{\mathbf{i}}\mathbf{= A - B}\] \[\mathbf{d}_{\mathbf{i}}\mathbf{-}\overline{\mathbf{d}}\] \[\left(\mathbf{d}_{\mathbf{i}}\mathbf{-}\overline{\mathbf{d}}\right)^{\mathbf{2}}\]
1 49 52 -3 -1 1
2 53 55 -2 0 0
3 51 52 -1 1 1
4 52 53 -1 1 1
5 47 50 -3 -1 1
6 50 54 -4 -2 4
7 52 54 -2 0 0
8 53 53 0 2 4
NA NA \[\sum_{i = 1}^{8}d_{i}\] -16 \[\sum_{i = 1}^{8}\left( d_{i} - \overline{d} \right)^{2}\] 12
NA NA \[\overline{d} = \frac{\sum_{i = 1}^{8}d_{i}}{n} = \frac{- 16}{8}\] \(\overline{d} =\)-2 \[s^{2} = \frac{\sum_{i = 1}^{8}\left( d_{i} - \overline{d} \right)^{2}}{n - 1}\] \(s^{2}\)=1.7143
NA NA \(s = \sqrt{1.7143}\) =1.309

\[t = \frac{| - 2|}{\frac{1.309}{\sqrt{8}}}\]

\[= \frac{2}{\frac{1.309}{2.828}}\]

\[= \frac{2}{0.4629}\]

\[= 4.321\]

Table value of t for 7 degrees of freedom at 5% level of significance is 2.365

As calculated value (4.321) is greater than table value (2.365). We reject the null hypothesis H0. We conclude that the is significant difference between the two soil treatments between A and B. Soil treatment B increases the yield of sorghum significantly.

Example 13: Try it by yourself

A certain stimulus administered to each of 12 patients resulted in the following increase of Blood pressure: 5, 2, 8, -1, 3, 0, -2, 1, 5, 0, 4, 6. Can it be concluded that the stimulus will, in general be accompanied by an increase in blood pressure? (tip: difference \(d_{i}\) is given)

14.4 Testing the significance of correlation coefficient

Let there be two normally distributed populations with means µ1 and µ2 and standard deviations be σ1 and σ2 respectively. Let the correlation between two populations be ρ. We want to test the null hypothesis that population correlation coefficient is zero (ρ =0). We can use t- test for the purpose. If we don’t have enough evidence from our sample to reject the null hypothesis, we may conclude that there is a significant correlation between populations ((ρ ≠ 0).

The null hypothesis to be tested is

H0 : ρ = 0

The alternative hypothesis

H1 : ρ ≠ 0 (two tailed alternative)

\[t = \frac{r\sqrt{n - 2}}{\sqrt{1 - r^{2}}}\]

Under null hypothesis t follows a t distribution with \(n - 2\)degrees of freedom. We reject the null hypothesis, if the calculated value is greater than table value of t corresponding to \(n - 2\)degrees of freedom and level of significance (α). our case α = 0.05

Example 14:

A coefficient of correlation of 0.2 is derived from a random sample of 625 pairs of observations. Test whether the population correlation coefficient is significant or not.

Solution:

Null hypothesis, H0 : ρ = 0 (Population correlation coefficient is zero)

Alternative hypothesis, H1 : ρ ≠ 0 (Population correlation coefficient is not zero)

Sample correlation coefficient (\(r\)) = 0.2

Number of pairs (n) = 625

\[t = \frac{r\sqrt{n - 2}}{\sqrt{1 - r^{2}}}\]

\[= \frac{0.2\sqrt{625 - 2}}{\sqrt{1 - 0.04}}\ = 5.095\]

Sample size is so large (>30) t distribution can be approximated to a z distribution. Critical value for two tailed test at 5% level of significance is 1.96. So the calculated value is more than 1.96, we reject the null hypothesis and conclude that, there is a significant correlation in population.

14.5 Chi square test (χ2)

Chi-square tests are based on the sampling distribution called chi-square distribution (χ2 distribution). χ2tests are based on the following assumptions

  1. The sample observations are independent.

  2. The total frequency should be reasonably large, say, greater than 50.

  3. The theoretical cell frequencies should not be less than 5. If any theoretical cell frequency is less than 5, then for the application χ2tests, it is pooled with the preceding or succeeding frequencies so that the pooled frequency is more than 5 and finally adjust the degrees of freedom lost in pooling.

  4. Constraints on the cell frequencies should be linear. (eg., ∑ 𝑂𝑖 = ∑ 𝐸𝑖 (where O and E represents the observed and expected frequencies)

Note:

The χ2 tests do not make any assumptions regarding the parent population from which the observations are taken. Such tests do not involve any population parameter. Hence these tests are known as non-parametric tests or distribution free tests.

Degrees of freedom in χ2 tests: Degrees of freedom in χ2 tests refers to the number of independent variates which make up the statistic. The degrees of freedom in general is the total number of observations less the number of independent constraints imposed on the observations. For example, if k is the number of independent constraints in a set of data on n observations, then degrees of freedom = n-k.

Three important chi-square tests:

  • Chi-square test for goodness of fit

  • Chi-square test for independence of attributes

  • Chi-square test for a variance.

14.5.1 Chi square test (χ2) for goodness of fit

A very powerful test for testing the significance of the discrepancy between theory and experiment was given by Prof. Karl Pearson in 1900 and is known as “χ2 tests of goodness of fit”.

We want to test the null hypothesis, H0: There is no significance between the theory and experiment

Against the alternative hypothesis H1: There is significance between the theory and experiment

If Oi (i=1,2,...,n) is a set of observed frequencies and Ei (i=1,2,...,n) is the corresponding set of expected (theoretical) frequencies, then Karl Pearson’s chi-square test statistic is given by

\[\chi^{2} = \sum_{i = 1}^{n}\frac{\left( O_{i} - E_{i} \right)^{2}}{E_{i}}\]

Here Oi represents the ith observed frequency and Ei represents the corresponding expected frequency according to the assumption regarding the theory behind the data. Under null hypothesis chi-square follows chi-square distribution with n-1 degrees of freedom.

14.5.1.1 Decision rule for goodness of fit test

Let \(\chi_{\text{cal}}^{2}\) be the calculated value, degrees of freedom = n-1, α be the level of significance, then we reject the null hypothesis if \(\chi_{\text{cal}}^{2}\) > \(\chi_{\text{tab}}^{2}\); where \(\chi_{\text{tab}}^{2}\) is the table value of \(\chi^{2}\)at n-1 degrees of freedom. In case of \(\chi^{2}\) test only one tailed test is used.

Example 15:

In plant genetics, our interest may be to test whether the observed segregation ratios deviate significantly from the mendelian ratios. In such situations we want to test the agreement between the observed and theoretical frequency, such test is called as test of goodness of fit. In a cross between parents of the genetic constitution AAbb and aaBB, the phenotypes in the sample is classified as follows:

AB Ab aB ab Total
87 29 32 12 160

They are expected to occur in a 9: 3: 3: 1 ratio. Do the data agree with the theoretical ratio?

Solution:

Phenotypes AB Ab aB ab Total
Observed (Oi) 87 29 32 12 160.000
Expected (Ei) \(\frac{9}{16}\ \times 160\) =90 \(\frac{3}{16}\ \times 160\) =30 \(\frac{3}{16}\ \times 160\) =30 \(\frac{1}{16}\ \times 160\) =10 160.000
\[\mathbf{O}_{\mathbf{i}}\mathbf{-}\mathbf{E}_{\mathbf{i}}\] -3 -1 2 2 NA
\[\left( \mathbf{O}_{\mathbf{i}}\mathbf{-}\mathbf{E}_{\mathbf{i}} \right)^{\mathbf{2}}\] 9 1 4 4 NA
\[\frac{\left( \mathbf{O}_{\mathbf{i}}\mathbf{-}\mathbf{E}_{\mathbf{i}} \right)^{\mathbf{2}}}{\mathbf{E}_{\mathbf{i}}}\] 0.11 0.033 0.133 0.4 0.676

\[\chi^{2} = \sum_{i = 1}^{n}\frac{\left( O_{i} - E_{i} \right)^{2}}{E_{i}}\]

\[\chi^{2} = 0.676\]

\(\chi_{\text{cal}}^{2}\)= 0.676, table value of chi-square for 4-1=3 degrees of freedom and 5% level of significance is 7.815. We won’t reject the null hypothesis, H0: There is no significance between the theory and experiment. Conclude that data follows 9:3:3:1

Example 16: Try by yourself

The number of yeast cells counted in a haemocytometer is compared to the theoretical value is given below. Does the experimental result support the theory.

Yeast per cell Observed Frequency Expected Frequency
0 103 106
1 143 141
2 98 93
3 42 41
4 8 14
5 6 5

14.5.2 Chi square test (χ2) for independence of attributes

The Chi-square test of independence checks whether two attributes are likely to be related or not. For example, chemical treatment and germination can be two attributes. If we want to know whether chemical treatment has any influence on germination, we can use chi-square test. For this purpose, we need the data arranged in the form of a contingency table.

14.5.2.1 Contingency table

A contingency table consists of a collection of cells containing counts. A contingency table is a tabular representation of categorical data. A contingency table usually shows frequencies for particular combinations of values of two discrete random variables X and Y. Each cell in the table represents a mutually exclusive combination of X-Y values.

Example 17: Contingency table

In order to determine the possible effect of a chemical treatment on the rate of germination of cotton seeds a pot culture experiment was conducted. The results are given below in the form of a contingency table is given below. (X = Germination, Y = Chemical Treatment). Attribute X has two class X1 = Germinated, X2 = Not germinated. Attribute Y has two class Y1 = Treated, Y2 = Untreated.

X=Germination
Y= Chemical treatment X/Y Germinated(X1) Not Germinated(X2) Total
Treated(Y1) 118 22 140
Untreated(Y2) 120 40 160
Total 238 62 300

Let us consider two attributes A & B, A divided into r classes A1, A2, ..., Ar and B divided into s classes B1, B2, ..., Bs. The various cell frequencies can be expressed in the form of a table (called r × s contingency table) as shown below.

A.B A1 A2 ….. Ar Total
B1 (A1 B1) (A2 B1) . . . (Ar B1) (B1)
B2 (A1 B2) (A2 B2) . . . (Ar B2) (B2)
. . . . . .
. . . . . .
. . . . . .
Bs (A1 Bs) (A2 Bs) . . . (Ar Bs) (Bs)
Total (A1) (A2) . . . (Ar)

(AiBj) = The number of persons (items) possessing attributes Ai (i =1,2,..., r) and Bj (j =1,2,...,s)

(Ai) = The number of persons (items) possessing attribute Ai ( i =1,2,..., r)

(Bj) = The number of persons (items) possessing attribute Bj (j =1,2,..., s)

∑(A)𝑖 = ∑(B)𝑗 = 𝑁, is the total frequency.

14.5.2.2 Expected frequencies

The expected frequencies corresponding to each observed frequency (AiBj) are calculated from the formula,

\[E_{\text{ij}} = \frac{\left( A_{i} \right)\left( B_{j} \right)}{N}\]

14.5.2.3 Degrees of freedom

Degrees of freedom for an r × s contingency table = (r – 1)(s – 1)

Test procedure

The null hypothesis to be tested is H0: The two attributes under consideration are independent.

The alternative hypothesis is H1: The two attributes under consideration are not independent.

Test statistic used is

\[\chi^{2} = \sum_{i = 1}^{r}{\sum_{j = 1}^{s}\frac{\left( O_{\text{ij}} - E_{\text{ij}} \right)^{2}}{E_{\text{ij}}}}\]

Where,

\(O_{\text{ij}}\) = observed frequencies

\(E\_{\text{ij}}\)= Expected frequencies

s = number of rows

r = number of columns

It can be verified that \(\sum_{i = 1}^{r}{\sum_{j = 1}^{s}O_{\text{ij}}} = \sum_{i = 1}^{r}{\sum_{j = 1}^{s}E_{\text{ij}}}\)

Under null hypothesis test statistic follows a chi-square distribution with (r – 1)×(s – 1) degrees of freedom. Decision rule is same as Chi square for goodness of fit.

Example 18:

In a survey, a random sample of 198 farms were classified in to three classes according to tenure status as: owned, rented and mixed. They were also classified according to the level of soil fertility as: high fertile, moderately fertile and low fertile farms. The results are given below. Test whether tenure status depends on soil fertility

Tenure Status
Soil fertility Owned Rented Mixed Total
High 40 12 10 62
Moderate 22 10 14 46
Low 22 26 42 90
Total 84 48 66 198

Solution:

Calculation of expected values (\(E_{\text{ij}})\) for each cell by multiplying corresponding row total and column total divided by total frequency in the above table

X Owned Rented Mixed
High \(\frac{62\ \times 84}{198} =26.3\) \(\frac{62\ \times 48}{198} =15.0\) \(\frac{62\ \times 66}{198} =20.7\)
Moderate \(\frac{46\ \times 84}{198} =19.5\) \(\frac{46\ \times 48}{198} =11.2\) \(\frac{46\ \times 66}{198} =15.3\)
Low \(\frac{90\ \times 84}{198} =38.2\) \(\frac{90\ \times 48}{198} =21.8\) \(\frac{90\ \times 66}{198} =30.0\)
\[O_{\text{ij}}\] \[E_{\text{ij}}\] \[O_{\text{ij}} - E_{\text{ij}}\] \[\left(O_{\text{ij}} - E_{\text{ij}} \right)^{2}\] \[\frac{\left( O_{\text{ij}} - E_{\text{ij}} \right)^{2}}{E_{\text{ij}}}\]
40 26.3 13.7 187.6 7.1
12 15.0 -3.0 9.2 0.6
10 20.7 -10.7 113.8 5.5
22 19.5 2.5 6.2 0.3
10 11.2 -1.2 1.3 0.1
14 15.3 -1.3 1.8 0.1
22 38.2 -16.2 261.9 6.9
26 21.8 4.2 17.5 0.8
42 30.0 12.0 144 4.8
NA NA NA \[\chi_{\text{cal}}^{2} =\] 26.3

\(\chi_{\text{cal}}^{2}\)= 26.3, table value of chi-square for (3-1)(3-1) = 4 degrees of freedom and 5% level of significance is 9.488. Since the calculated value is greater than table value, we reject the null hypothesis, and conclude that the two attributes under consideration are not independent.

14.5.3 Chi-square test for 2×2 contingency table

2 x 2 contingency table

When the number of rows and number of columns are equal to 2; it is termed as 2 x 2 contingency table. It will be in the following form as shown in example 17. General form can be represented as shown below. Consider two attributes A and B with classes A1, A2 and B1, B2 respectively. a, b, c, d are the frequencies in each cell

A1 A2 Row Total
B1 a b R1= a+b
B2 c d R2 = c+d
Column Total C1= a+c C2 = b+d n = a+b+c+d

R1, R2 and C1, C2 are row totals and column totals respectively. n is the total number of observations.

In case of 2 x 2 contingency table \(\chi\^{2}\)can be directly found using the short cut formula.

The null hypothesis to be tested is H0: The two attributes under consideration are independent.

The alternative hypothesis is H1: The two attributes under consideration are not independent.

\[\chi^{2} = \frac{n\left( ad - bc \right)^{2}}{C_{1}C_{2}R_{1}R_{2}}\]

Under null hypothesis test statistic follows a chi-square distribution with (2 – 1) × (2 – 1) = 1 degrees of freedom.

14.5.3.1 Yate’s correction for continuity

In a 2 X 2 contingency table, the number of degrees of freedom is (2-1) × (2-1) = 1. If any one of the cell frequencies is less than 5, then, use of pooling method results in \(\chi^{2}\)with 0 degrees of freedom (1 degrees of freedom is lost due to pooling) which is meaningless. In this case we apply a correction due to Yates which is usually known as Yates’ correction for continuity. The Yate’s correction is made by adding 0.5 to the least cell frequency and adjusting the other cell frequencies so that the column and row totals remain same. The formula for the test statistic in equation (15) is now modified and is given as below.

Test statistic used is

\[\chi^{2} = \frac{{n\left( \left| ad - bc \right| - \frac{n}{2} \right)}^{2}}{C_{1}C_{2}R_{1}R_{2}}\]

Solution to Example 17

H0: The treatment does not improve the germination rate of cotton seeds. (independent)

H1: The chemical treatment improves the germination rate of cotton seeds.

\[\chi^{2} = \frac{{300\left( \left| 118 \times 40 - 22 \times 120 \right| - \frac{300}{2} \right)}^{2}}{238 \times 62 \times 140 \times 160}\]

\[= 3.927\]

\(\chi_{\text{cal}}^{2}\)= 3.927, table value of chi-square for (2-1) × (2-1) = 1 degrees of freedom and 5% level of significance is 3.841. Since the calculated value is less than table value, we don’t have enough evidence to reject the null hypothesis. The chemical treatment will not improve the germination rate of cotton seeds significantly.

Example 19: Try it for yourself

In an experiment on the effect of a growth regulator on fruit setting in muskmelon, the following results were obtained. Test whether the fruit setting in muskmelon and the application of growth regulator are independent at 5% level.

Fruit set Fruit not set
Treated 16 9
Control 4 21

14.5.4 Chi-square test for a population variance

Consider there is a normal population with mean, say μ and variance σ2, where μ and σ2 are unknown, we will take a random sample of size n from the population. We want to test whether the population variance σ2, which is unknown is equal to some known constant σ20, based on the sample variance.

Null hypothesis H0: σ2 = σ20

Against the alternative hypothesis H1: σ2 > σ20

The test statistic is

\[\chi^{2} = \frac{ns^{2}}{\sigma_{0}^{2}}\]

Where \(s^{2} = \frac{\sum_{i = 1}^{n}\left( x_{i} - \overline{x} \right)^{2}}{n - 1}\)is the sample variance

Under null hypothesis test statistic follows a chi-square distribution with n-1 degrees of freedom. Decision rule is same as in section 3.1.1

Example 20: Try it for yourself

Test the null hypothesis that σ2 = 0.16 against the alternative hypothesis σ2 > 0.16, given that \(s^{2}\) = 0.01719 for a random sample of size 11 from a normal population.

14.6 F - test for testing equality of two population variances

Let there be two normally distributed populations with means µ1 and µ2 and variances be σ12 and σ22 respectively. Let samples of sizes n1 and n2 were taken from these populations. We want to test whether these population variances are significantly different or not based on the sample variances.

Null hypothesis H0: σ21 = σ22

Against the alternative hypothesis H1: σ21 > σ22

Test statistic is

\[F = \frac{s_{1}^{2}}{s_{2}^{2}}\]

Under null hypothesis test statistic follows a F distribution with \(n_{1} - 1\) and \(n_{2} - 1\) degrees of freedom.

14.6.1 Decision rule for F - test

If the calculated value is greater than table value of F at specified level of significance and two degrees of freedom (i.e. \(n_{1} - 1\) and \(n_{2} - 1\)) we reject the null hypothesis.

Note:

If \(s_{2}^{2} >\) \(s_{1}^{2}\) the test statistic will be

\[F = \frac{s_{2}^{2}}{s_{1}^{2}}\]

Under null hypothesis test statistic follows a F distribution with \(n_{2} - 1\) and \(n_{1} - 1\) degrees of freedom.

Example 20: Try it for yourself

For a random sample representing one normal population, we have \(n_{1}\) = 11, and \(s_{1}^{2}\) = 21.87. For another random sample representing the second normal population, we have\(\ n_{2}\)= 8 and \(s_{2}^{2}\) = 15.36. Test the equality of variances.

 
 
 

“Like dreams, statistics are a form of wish fulfillment”:-Jean Baudrillard