[A, SfS] Chapter 5: Confidence Intervals: 5.5: CI for proportion
Confidence Interval for a Population Proportion
Confidence Interval for a Population Proportion
In this lesson, you will learn how to estimate the proportion of a population having a specified characteristic.
#\text{}#
Now suppose #X# is a binary variable measured on a population in which an unknown proportion #p# of the population meets some condition of interest, with #X = 1# if the subject meets that condition and #X = 0# otherwise. Thus: \[X \sim B(1,p)\]
Let #X_1,...,X_n# denote planned measurements of #X# on a random sample of size #n# from the population. Then: \[S = X_1 + \cdot\cdot\cdot + X_n \sim B(n,p)\]
Let #\hat{p} = \cfrac{S}{n}#. We saw previously that: \[E(\hat{p}) = p \,\,\,\,\,\,\,\,\,\,\,\,\,\, \text{and} \,\,\,\,\,\,\,\,\,\,\,\,\,\, V(\hat{p}) = \cfrac{p(1 - p)}{n}\]
Moreover, we learned previously that when #n# is large, the Central Limit Theorem implies that #\hat{p}# has an approximate \[N\bigg(p,\cfrac{p(1-p)}{n}\bigg)\] distribution.
Unfortunately, we cannot use this distribution to make a CI for #p#, because the variance of #\hat{p}# depends on the unknown value of #p#.
But now consider #\tilde{p} = \cfrac{S + 2}{n + 4}#.
It has been found (Agresti and Coull, 1998) that when #n# is large, the distribution of #\tilde{p}# is well-approximated by the \[N\bigg(p,\cfrac{\tilde{p}(1 - \tilde{p})}{n + 4}\bigg)\] distribution.
Confidence Interval for a Population Proportion
Suppose #X\sim B(1,p)#.
Let #X_1,...,X_n# denote planned measurements of #X# on a random sample of size #n# from the population. Then: \[S = X_1 + \cdot\cdot\cdot + X_n \sim B(n,p)\]
Furthermore, let: \[\tilde{p} = \cfrac{S + 2}{n + 4}\]
When #n# is large, an approximate #(1 - \alpha)100\%# confidence interval for the population proportion #p# is:
\[(l,u) = \Bigg(\tilde{p} - z_{\alpha /2}\sqrt{\cfrac{\tilde{p}(1-\tilde{p})}{n + 4}},\,\,\,\,\,\tilde{p} + z_{\alpha /2}\sqrt{\cfrac{\tilde{p}(1-\tilde{p})}{n + 4}}\Bigg)\]
If the lower limit is negative, we replace it with #0#, and if the upper limit is larger than #1#, we replace it with #1#, since #p# must fall into the interval #(0,1)#.
A study in 2008 investigated the use of nicotine patches among HIV-positive smokers. They surveyed a random sample of #444# HIV-positive smokers and found that #170# reported that they used a nicotine patch.
Thus:
\[\tilde{p} = \cfrac{S + 2}{n + 4}=\cfrac{170 + 2}{444 + 4}= \cfrac{172}{448}\approx 0.384\]
The margin of error of a #95\%# CI for the proportion of all HIV-positive smokers who use a nicotine patch is then:
\[z_{\alpha /2}\sqrt{\cfrac{\tilde{p}(1-\tilde{p})}{n + 4}}= 1.96\sqrt{\cfrac{0.384(1 - 0.384)}{448}} \approx 0.045\]
(Here we recalled that #z_{0.05/2} = 1.96#.)
So the #95\%# CI for #p# is then:
\[(l,u) = (0.384 - 0.045,\,\,\,\,\,0.384 + 0.045) = (0.339,\,\,\,\,\,0.429)\]
#\text{}#
If you are planning a study and you prefer a CI for #p# to have a margin of error no larger than some #K > 0# while maintaining the same confidence level #1 - \alpha#, you must determine the minimum sample size #n# necessary to achieve this goal.
However, the variance of #\tilde{p}# depends on #\tilde{p}#, which is unknown to you before data are collected. But this variance is maximized when #\tilde{p}(1-\tilde{p})# is maximized, that is, when #\tilde{p} = 0.5#.
Controlling the Margin of Error
To guarantee that the margin of error of a confidence interval for a population proportion will not exceed #K# and ensure at least #(1 - \alpha)100\%# confidence that your CI includes #p#, you can assume #\tilde{p} = 0.5#.
That is
\[z_{\alpha /2}\sqrt{\cfrac{\tilde{p}(1 - \tilde{p})}{n + 4}} \leq z_{\alpha /2}\sqrt{\cfrac{0.5(1 - 0.5)}{n + 4}} = \cfrac{0.5 z_{\alpha /2}}{\sqrt{n + 4}} \leq K\]
requires that
\[n \geq \bigg(\cfrac{0.5 z_{\alpha /2}}{K}\bigg)^2 - 4\]
You would then round up to the nearest integer.
Continuing the example above, if we want the margin of error of the #95\%# confidence interval to be no larger than #0.04#, then we would need to survey a random sample of at least:
\[n \geq \bigg(\cfrac{0.5 z_{\alpha /2}}{K}\bigg)^2 - 4 = \bigg(\cfrac{(0.5)(1.96)}{0.04}\bigg)^2 - 4 = 596.25\] HIV-positive smokers, i.e., at least #597# HIV-positive smokers.
#\text{}#
One-sided Confidence Intervals for a Population Proportion
It is also possible to construct one-sided confidence intervals for a population proportion.
A lower #(1 - \alpha)100\%# CI for #p# would be:
\[\Bigg(\tilde{p} - z_{\alpha}\sqrt{\cfrac{\tilde{p}(1 - \tilde{p})}{n + 4}},\,\,\,\,\,1\Bigg)\]
where #z_{\alpha}# is the quantile of the #N(0,1)# distribution such that #\mathbb{P}(Z > z_{\alpha}) = \alpha# when #Z \sim N(0,1)#.
An upper #(1 - \alpha)100\%# CI for #p# would be:
\[\Bigg(0,\,\,\,\,\,\tilde{p} + z_{\alpha}\sqrt{\cfrac{\tilde{p}(1 - \tilde{p})}{n + 4}}\Bigg)\]
Or visit omptest.org if jou are taking an OMPT exam.