1 Supplemental Materials: An Ethical Approach to Peeking at Data Brad J. Sagarin, James K. Ambler, and Ellen M. Lee Northern Illinois University paugmented paugmented represents the Type I error rate implied by the augmentation decisions and observed outcomes (both interim and final) of a study. Consider, for example, a study in which a researcher runs 100 subjects (N1 = 100), producing a p-value of .15 (p1 = .15). The researcher evaluates this against the usual .05 criterion for significance (pcrit = .05). Because the results are non-significant, the researcher augments the dataset with another 50 subjects (N2 = 50), producing a p-value in the combined dataset of .09 (p12 = .09). Because the results are only marginally significant, the researcher then augments the dataset a second time with another 50 subjects (N3 = 50), producing a p-value in the combined dataset of .03 (p123 = .03). The results are now significant, so the researcher stops data collection. As discussed in the main article, paugmented represents a range of Type I error rates. The lower bound of paugmented represents the best-case scenario in which pmax is the maximum p-value observed on any of the interim steps. Thus, in this case, pmax = .15 for the lower bound of paugmented. The upper bound of paugmented represents the worst-case scenario in which pmax = 1. ETHICAL DATASET AUGMENTATION 2 Because a Type I error rate represents the probability of a significant result, given that the null hypothesis is true, the distributions of p-values for N1, N2, and N3 used to calculate paugmented are uniform distributions. For each round of data collection (the initial round: N1, the first round of augmentation: N2, the second round of augmentation: N3), we divide the distribution of possible p-values into 10000 slices of width .0001, with each slice represented by a p-value from .00005 to .99995 (dividing the distribution into 100,000 slices of width .00001 or 1,000,000 slices of width .000001 does not change the results appreciably). Then, a particular combination of slices from N1, N2, and N3 would produce a significant result within the lower bound if (a) the p-value in the initial N1 participants < .05, or (b) the p-value in the initial N1 participants < .15 and the p-value in the combined N1+N2 participants < .05, or (c) the p-value in the combined N1+N2 participants < .15 and the p-value in the combined N1+N2+N3 participants < .03. As can be seen in (a) and (b) the criterion for significance is pcrit (.05, in this case) for the initial round of data collection and for all but the last round of augmentation. As can be seen in (c), the criterion for significance is the final observed p-value (.03, in this case) for the last round of augmentation. As can be seen in (b) and (c), for the lower bound, pmax = .15 (the highest interim p-value). For the upper bound, pmax = 1. Therefore, a particular combination would produce significant results within the upper bound if (a) the p-value in the initial N1 participants < .05, or (b) the p-value in the combined N1+N2 participants < .05, or (c) the p-value in the combined N1+N2+N3 participants < .03. The p-values for the combined subsamples (N1+N2, N1+N2+N3) are calculated using the equation for the weighted Z-method (Whitlock, 2005): ETHICAL DATASET AUGMENTATION 3 k åw Z i Zw = i i=1 k åw 2 i i=1 in which Zi is the Z-score associated with the p-value for Ni, and wi = . The most straightforward method to calculate paugmented would be to test all combinations of possible p-values for each of the rounds of data collection. The proportion of combinations that produce significant results would represent paugmented. With 10000 slices, there are 100003 combinations of possible p-values for N1, N2, and N3. Because the number of combinations is an exponential function of the number of rounds of data collection, the calculations become extremely lengthy with more than two rounds of augmentation. To mitigate this, we made two changes to the algorithm to increase its efficiency: (a) When incorporating a new round of dataset augmentation, we combined earlier rounds of data collection into a single distribution, thus changing the algorithm from an exponential function to a linear function, and (b) When incorporating the final round of dataset augmentation, we used the cumulative distribution function of the normal distribution to avoid the need to explicitly combine slices from the earlier rounds of data collection with slices from the final round of dataset augmentation. Instead, for each slice from the earlier rounds of data collection, the proportion of the distribution from the final round of dataset augmentation that would produce significant results is calculated directly. Below, we derive the formulas for making this latter calculation. The formulas are presented in a general form in which Nprev is the sample prior to the last round of augmentation, Nnew is the supplemental sample added as the last round of augmentation, ETHICAL DATASET AUGMENTATION 4 and pcrit is the criterion for significance. From the example above (the study in which the researcher runs 100 subjects and then augments the dataset twice with 50 additional subjects each time), these formulas would be applied for the second round of dataset augmentation. Thus, for that example, Nprev = 150 (because the sample prior to the last round of augmentation would consist of the combination of the initial 100 subjects plus the 50 subjects from the first round of dataset augmentation), Nnew = 50 (because the final round of dataset augmentation contained 50 subjects), and pcrit = .03 (because the criterion for significance after the final round of dataset augmentation is .03). This calculation begins with the equation for the weighted Z-method (Whitlock, 2005): k åw Z i Zw = i i=1 k åw 2 i i=1 Because there are two samples (Nprev and Nnew), k = 2: Here, Zw = Zcrit (the Z-score associated with pcrit), wprev = , wnew = , Zprev is the Z-score associated with the significance test of the Nprev participants, and Znew is the Z-score associated with the significance test of the supplemental Nnew participants. Substituting these values and solving for Znew yields the Z-score associated with the exact level of significance that must appear in the supplemental Nnew participants such that the significance of the combined Nprev+Nnew sample will be equal to pcrit: ETHICAL DATASET AUGMENTATION 5 For a one-tailed test with the region of significance appearing in the upper tail of the normal distribution, Znew represents the minimum Z-score that would produce a significance level of the combined Nprev+Nnew sample ≤ pcrit. Using the R functions qnorm(p) and pnorm(z) (which return, respectively, the Z-score associated with a p-value for a normal distribution with a mean of 0 and standard deviation of 1, and the cumulative distribution function for a particular Z-score for a normal distribution with a mean of 0 and a standard deviation of 1), the value assigned for the slice of Nprev with a pvalue of p is: 1- pnorm( qnorm(1- pcrit ) N prev + N new - N prev (qnorm(1- p)) N new ) For a two-tailed test with half of the region of significance appearing in the upper tail of the normal distribution and half appearing in the lower tail, Zcrit represent the positive and negative Z-scores associated with pcrit/2. For +Zcrit, Znew represents the minimum Z-score that would produce a significance level in the combined Nprev+Nnew sample ≤ pcrit. For –Zcrit, Znew represents the maximum Z-score that would produce a significance level in the combined Nprev+Nnew sample ≤ pcrit. Using the same R functions, the value assigned for the slice of Nprev with a p-value of p is: qnorm(11- pnorm( +pnorm( pactual -qnorm(1- pcrit ) N prev + N new - N prev (qnorm(1- p)) 2 ) N new pcrit ) N prev + N new - N prev (qnorm(1- p)) 2 ) N new ETHICAL DATASET AUGMENTATION 6 The calculation of pactual is the same as the calculation of paugmented with two exceptions: (a) The value of pmax is specified explicitly, producing a single value for pactual (rather than a range for paugmented), and (b) The criterion for significance after the final round of dataset augmentation is pcrit (rather than the observed final p-value for paugmented). Adjusted pcrit The calculation of the adjusted pcrit necessary to maintain a desired Type I error rate while allowing for dataset augmentation is done through an iterative process with repeated calculations of pactual made, adjusting pcrit up or down so that pactual converges on the desired Type I error rate.

© Copyright 2020