Sample Size Calculation - Continuous Endpoint

Yanhong Zhou, Ying Yuan, J. Jack Lee, and Haitao Pan

Department of Biostatistics, MD Anderson Cancer Center, Houston, TX 77030


PID: 966 ; v2.1.0.0 ; Last Updated: 01/28/2022

Error: (\(\mu_t-\mu_c\)) must be less than equivalence limit.
Error: Absolute value of (\(\mu_t-\mu_c\)) must be less than equivalence limit.
Error: (\(\mu_t-\mu_c\)) must be greater than superiority margin.
Error: Absolute Difference in Mean must be less than Noninferiority margin.
Error: Difference in Mean must be greater than superiority margin.




1. Input

Mean for Historical Control (\(\mu_c\)): true mean response for the historical control.

Mean for Treatment (\(\mu_t\)): true mean response for the treatment.

Standard Deviation (\(s\)): standard deviation calculated by \(s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^2}\), where \(\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}\) and \(x_{i}\) is observed response for the \(j\)th patient obtained from previous research or literature, \(i=1,\cdots,n\).

Type I Error Rate (\(\alpha\)): false positive rate.

Power \((1-\beta)\): where \(\beta\) is type II error rate (i.e., false negative rate).

2. Example (equality test for a one-group design)

Suppose one is interested in detemining whether a new treatment is better than a historical control in terms of mean response. The mean response for the historical control is 0.2. The standard deviation of response is approximately 1. If an increase of 0.3 in the mean response is clinically meaningful, how many subjects are needed to detect the difference with a power of 0.8?

Input: \(\mu_c=0.2, \mu_t=0.2+0.3=0.5, s=1, \alpha=0.05, 1-\beta=0.8\), and assume a one-sided test.

Output:

In a one-sided t-test for one-sample mean, at the significance level of 0.05, a sample size of 71 is needed to achieve 80% power when the mean for the historical control is 0.2 and the mean for the treatment is 0.5.

1. Input

Difference in Mean (\(\mu_t-\mu_c\) ): \(\mu_t\) and \(\mu_c\) are true mean response for the treatment and control, respectively.

Standard Deviation (\(s\)): standard deviation calculated by \(s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^2}\), where \(\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}\) and \(x_{i}\) is observed response for the \(j\)th patient obtained from previous research or literature, \(i=1,\cdots,n\).

Subject Allocation Ratio (\(k=n_t/n_c\)): the ratio of number of subjects assigned to treatment to the number of subjects in control where \(n_t, n_c\) are sample size for treatment and control, respectively.

Type I Error Rate (\(\alpha\)): false positive rate.

Power \((1-\beta)\): where \(\beta\) is type II error rate (i.e., false negative rate).

2. Example (equality test for a two-group design)

Suppose one is interested in detemining whether a new treatment is better than a standard control in terms of mean response. The mean response for the control is 0.2. The standard deviation of response is approximately 1. If an increase of 0.3 in the mean response is clinically meaningful, how many subjects are needed to detect the difference with a power of 0.8 given equal allocation?

Input: \(\mu_t-\mu_c=0.3, s=1, k=1,\alpha=0.05, 1-\beta=0.8\), and assume a one-sided test.

Output:

In a one-sided t-test for two-sample mean, at the significance level of 0.05, 139 subjects for treatment group and 139 subjects for control group are needed to achive 80% power to detect the mean difference of 0.3 between treatment and control.

1. Input

Equivalence Limit (\(\delta>0\)): \(\delta\) is length of margin, which is called (1) equivalence limit in equivalence test, (2) noninferiority margin in nonferiority test, (3) supriority margin when supriority test is of interest. The difference of the three types of tests can be shown intuitively in the following Figure.

Mean for Historical Control (\(\mu_c\)): true mean response for control treatment.

Mean for Treatment (\(\mu_t\)): true mean response for experimental treatment.

Standard Deviation (\(s\)): standard deviation calculated by \(s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^2}\), where \(\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}\) and \(x_{i}\) is observed response for the \(j\)th patient obtained from previous research or literature, \(i=1,\cdots,n\).

Type I Error Rate (\(\alpha\)): false positive rate.

Power \((1-\beta)\): where \(\beta\) is type II error rate (i.e., false negative rate).

2. Example (equivalence test for a one-group design)

Suppose an investigator is interested in detemining whether a new treatment is better than a historical control in terms of mean response. The mean response for the historical control is 0.2. The standard deviation of response is approximately 0.1. The equivalence limit is 0.2. The investigator believes that the new treatment has a mean response of 0.35. How many subjects are needed to have a power of 0.8 to determine that the new treatment is equivalent to the historical control?

Input: \(\delta=0.2, \mu_c=0.2, \mu_t=0.35, s=0.1, \alpha=0.05, 1-\beta=0.8\).

Output:

At the significance level of 0.05 , with an equivalence limit of 0.2, a sample size of 27 is required to achieve 80 % power when the absolute mean difference between treatment and the historical control is 0.15.

1. Input

Equivalence Limit (\(\delta>0\)): \(\delta\) is length of margin, which is called (1) equivalence limit in equivalence test, (2) noninferiority margin in nonferiority test, (3) supriority margin when supriority test is of interest. The difference of the three types of tests can be shown intuitively in the following Figure.

Difference in Mean (\(\mu_t-\mu_c\) ): \(\mu_t\) and \(\mu_c\) are true mean response for the treatment and control, respectively.

Standard Deviation (\(s\)): standard deviation calculated by \(s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^2}\), where \(\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}\) and \(x_{i}\) is observed response for the \(j\)th patient obtained from previous research or literature, \(i=1,\cdots,n\).

Subject Allocation Ratio (\(k=n_t/n_c\)): the ratio of number of subjects assigned to treatment to the number of subjects in control where \(n_t, n_c\) are sample size for treatment and control, respectively.

Type I Error Rate (\(\alpha\)): false positive rate.

Power \((1-\beta)\): where \(\beta\) is type II error rate (i.e., false negative rate).

2. Example (equivalence test for a two-group design)

Suppose an investigator is interested in detemining whether a new treatment is better than a standard control in terms of mean response. The mean response for the control is 0.2. The standard deviation of response is approximately 0.1. The equivalence limit is 0.2. The investigator believes that the new treatment has a mean response of 0.35. How many subjects are needed to have a power of 0.8 to determine that the new treatment is equivalent to the control given equal patient allocation?

Input: \(\delta\)=0.2, \(\mu_t-\mu_c=0.35-0.2=0.15, s=0.1, k=1, \alpha=0.05, 1-\beta=0.8\).

Output:

At the significance level of 0.05 , with an equivalence limit of 0.2, 51 subjects for treatment group and 51 subjects for control group are needed to achieve 80% power when the mean response difference between treatment and control is 0.15.

1. Input

Noninferiority margin (\(\delta>0\)): \(\delta\) is length of margin, which is called (1) equivalence limit in equivalence test, (2) noninferiority margin in nonferiority test, (3) supriority margin when supriority test is of interest. The difference of the three types of tests can be shown intuitively in the following Figure.

Mean for Historical Control (\(\mu_c\)): true mean response for control treatment.

Mean for Treatment (\(\mu_t\)): true mean response for experimental treatment.

Standard Deviation (\(s\)): standard deviation calculated by \(s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^2}\), where \(\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}\) and \(x_{i}\) is observed response for the \(j\)th patient obtained from previous research or literature, \(i=1,\cdots,n\).

Type I Error Rate (\(\alpha\)): false positive rate.

Power \((1-\beta)\): where \(\beta\) is type II error rate (i.e., false negative rate).

2. Example (noninferiority test for a one-group design)

Suppose an investigator is interested in claiming that a new treatment is not worse than a historical control in terms of mean response. The mean response for the historical control is 0.3. The standard deviation of response is approximately 1. The noninferiority margin is 0.2. The investigator believes that the new treatment has a mean response of 0.2. How many subjects are needed to have a power of 0.8to claim that the new treatment is not worse than the historical control?

Input: \(\delta=0.2, \mu_c=0.3, \mu_t=0.2, s=1, \alpha=0.05, 1-\beta=0.8\).

Output:

At the significance level of 0.05 , with a noninferiority margin of 0.2, a sample size of 620 is required to achieve 80 % power when the mean difference between treatment and the historical control is 0.1.

1. Input

Noninferiority margin (\(\delta>0\)): \(\delta\) is length of margin, which is called (1) equivalence limit in equivalence test, (2) noninferiority margin in nonferiority test, (3) supriority margin when supriority test is of interest. The difference of the three types of tests can be shown intuitively in the following Figure.

Difference in Mean (\(\mu_t-\mu_c\) ): \(\mu_t\) and \(\mu_c\) are true mean response for the treatment and control, respectively.

Standard Deviation (\(s\)): standard deviation calculated by \(s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^2}\), where \(\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}\) and \(x_{i}\) is observed response for the \(j\)th patient obtained from previous research or literature, \(i=1,\cdots,n\).

Subject Allocation Ratio (\(k=n_t/n_c\)): the ratio of number of subjects assigned to treatment to the number of subjects in control where \(n_t, n_c\) are sample size for treatment and control, respectively.

Type I Error Rate (\(\alpha\)): false positive rate.

Power \((1-\beta)\): where \(\beta\) is type II error rate (i.e., false negative rate).

2. Example (noninferiority test for a two-group design)

Suppose an investigator is interested in claiming that a new treatment is not worse than a standard control in terms of mean response. The mean response for the control is 0.4. The standard deviation of response is approximately 0.1. The noninferiority margin is 0.2. The investigator believes that the new treatment has a mean response of 0.3. How many subjects are needed to have a power of 0.8 to determine that the new treatment is not worse than the control given equal patient allocation?

Input: \(\delta\)=0.2, \(\mu_t-\mu_c=0.3-0.4=-0.1, s=0.1, k=1, \alpha=0.05, 1-\beta=0.8\).

Output:

At the significance level of 0.05, with a noninferiority margin of 0.2, 14 subjects for treatment group and 14 subjects for control group are needed to achieve 80% power when the mean response difference between treatment and control is -0.1.

1. Input

Superiority margin (\(\delta>0\)): \(\delta\) is length of margin, which is called (1) equivalence limit in equivalence test, (2) noninferiority margin in nonferiority test, (3) supriority margin when supriority test is of interest. The difference of the three types of tests can be shown intuitively in the following Figure.

Mean for Historical Control (\(\mu_c\)): true mean response for control treatment.

Mean for Treatment (\(\mu_t\)): true mean response for experimental treatment.

Standard Deviation (\(s\)): standard deviation calculated by \(s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^2}\), where \(\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}\) and \(x_{i}\) is observed response for the \(j\)th patient obtained from previous research or literature, \(i=1,\cdots,n\).

Type I Error Rate (\(\alpha\)): false positive rate.

Power \((1-\beta)\): where \(\beta\) is type II error rate (i.e., false negative rate).

2. Example (superiority test for a one-group design)

Suppose an investigator is interested in determine whether a new treatment superior to a historical control in terms of mean response. The mean response for the historical control is 0.3. The standard deviation of response is approximately 0.5. The superiority margin is 0.15. The investigator believes that the new treatment has a mean response of 0.5. How many subjects are needed to have a power of 0.8 to claim that the new treatment is superior to the historical control?

Input: \(\delta=0.15, \mu_c=0.3, \mu_t=0.5,s=0.5, \alpha=0.05, 1-\beta=0.8\).

Output:

At the significance level of 0.05 with a superiority margin of 0.15, a sample size of 620 is required to achieve 80 % power when the mean difference between treatment and the historical control is 0.2.

1. Input

Superiority margin (\(\delta>0\)): \(\delta\) is length of margin, which is called (1) equivalence limit in equivalence test, (2) noninferiority margin in nonferiority test, (3) supriority margin when supriority test is of interest. The difference of the three types of tests can be shown intuitively in the following Figure.

Difference in Mean (\(\mu_t-\mu_c\) ): \(\mu_t\) and \(\mu_c\) are true mean response for the treatment and control, respectively.

Standard Deviation (\(s\)): standard deviation calculated by \(s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^2}\), where \(\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}\) and \(x_{i}\) is observed response for the \(j\)th patient obtained from previous research or literature, \(i=1,\cdots,n\).

Subject Allocation Ratio (\(k=n_t/n_c\)): the ratio of number of subjects assigned to treatment to the number of subjects in control where \(n_t, n_c\) are sample size for treatment and control, respectively.

Type I Error Rate (\(\alpha\)): false positive rate.

Power \((1-\beta)\): where \(\beta\) is type II error rate (i.e., false negative rate).

2. Example (superiority test for a two-group design)

Suppose an investigator is interested in determine whether a new treatment is superior to a standard control in terms of mean response. The mean response for the control is 0.3. The standard deviation of response is approximately 0.1. The superiority margin is 0.25. The investigator believes that the new treatment has a mean response of 0.6. How many subjects are needed to have a power of 0.8 to determine that the new treatment is superior to the control given equal patient allocation?

Input: \(\delta\)=0.25, \(\mu_t-\mu_c=0.6-0.3=0.3, s=0.1, k=1, \alpha=0.05, 1-\beta=0.8\).

Output:

At the significance level of 0.05, with a superiority margin of 0.25, 51 subjects for treatment group and 51 subjects for control group are needed to achieve 80% power when the mean response difference between treatment and control is 0.3.

1. Input

Correlation Coefficient \((r)\) under Alternative Hypothesis: the correlation coefficient expected.

Type I Error Rate \((\alpha)\): false positive rate.

Power \((1-\beta)\): where \(\beta\) is type II error rate (i.e., false negative rate).

2. Example

Suppose that an investigator is interested in testing if two groups are correlated in terms their outcome values. The null hypothesis is that there is no correlation between the two group while the alternative is that the correlation is 0.4. At the significance level of 0.05, how many subjects are required to have a power of 0.8 for the test?

Input: \(r=0.4,\alpha=0.05, 1-\beta=0.8\), assume a two-sided test.

Output:


Result based on t-test:

In a two-sided t-test, at the significance level of 0.05, a sample size of 46 is needed to achieve 80% power when the correlation coefficient under the alternative is 0.4.


Result based on z-test:

In a two-sided z-test, at the significance level of 0.05, a sample size of 47 is needed to achieve 80% power when the correlation coefficient under the alternative is 0.4.

1. Input

Number of Groups \((m)\): the number of experimental groups considered.

Effect size \((f)\): defined as

\(f=\sqrt{\frac{\sigma^2_m}{\sigma^2}}=\frac{\sigma_m}{\sigma}\). Enter this value directly or calculate it using the App. Details for \(f\) are available in Document .

\(\sigma^2\): the variance of the outcome values within the populations (i.e., common variance for the groups).

\(\sigma^2_m\): the variance of the \(m\) true means, calculated by \(\sum_{i=1}^{m}(\mu_i-\bar{\mu})^2/m\), where \(\bar{\mu}=\sum_{i=1}^{m}\mu_i/m\) with \(\mu_i, i=1,\cdots, m\) is the true mean response for the \(i\)th group.

Type I Error Rate \((\alpha)\): false positive rate.

Power \((1-\beta)\): where \(\beta\) is type II error rate (i.e., false negative rate).

2. Example

Suppose that an investigator is interested in conducting a four-arm (\(m=4\)) parallel, double-blinded, equal-randomized clinical trial to compare four treatments. The effect size is assumed to be 0.25. At the significance level of 0.05, how many subjects are required to have a power of 0.8 for the investigation?

Input: \(m=4, f=0.25, \alpha=0.05, 1-\beta=0.8\).

Output:

In a one-way ANOVA test for a 4 -arm design, at the significance level of 0.05, 45 subjects per group are needed to achieve 80% power to detect the effect size of 0.25.

1. Input

Effect size \((\Delta_d)\): effect size defined as \(\frac{|\mu_1-\mu_2|}{\sigma_d}\), where

\(\mu_1\) is pre-study mean;

\(\mu_2\) is post-study mean;

\(\sigma_d\) is the standard deviation of pre-post difference within each subject.

Type I Error Rate (\(\alpha\)): false positive rate.

Power \((1-\beta)\): where \(\beta\) is type II error rate (i.e., false negative rate).

2. Example (paired t-test)

Suppose one is interested in determining the effect of an experimental treatment. Given that the pre-study mean is known as 0.3, the post-study mean is assumed to be 0.5, and the standard deviation of the mean difference is 0.5. In a two-sided test, how many subjects are required to have a 90% power test at the significance level of 0.01?

Input: We know that \(\mu_1=0.3, \mu_2=0.5, \sigma_d=0.5,\alpha=0.05, 1-\beta=0.8\), and the test is two-sided. So we can select "Calculate effect size \(\Delta_d=|\mu_1-\mu_2|/\sigma_d\) to type in the known values. Alternatively, we can also caluated the effect size |0.3-0.5|/0.5=0.4 by hand and enter it directly.

Output:

If effect size is entered directly:

In a two-sided paired t-test, at the significance level of 0.05, 52 subjects are needed to achieve 80% power to detect the effect size of 0.4.


If effect size is calculated by the App:

In a two-sided paired t-test, at the significance level of 0.05, 52 subjects are needed to achieve 80% power to detect the effect size of 0.4 which is calculated given the pre-study mean 0.3, post-study mean 0.5 and a standard deviation of the mean difference: 0.5.