HOW TO GO ABOUT DATA GENERATION AND ANALYSIS // INDEFINITE GAMES AND DISOUNT RATES ELICITATION (Lecture 12, 2009_03_11) ### Please recall my remarks in Lecture 1 about the nature of the lecture notes. ### Outlines of „lit reviews” ? ### Review (in somewhat reversed order): On to public good provision (in the lab) 1 2 3 4 Results see lectures notes … 5 6 This is a chapter in Cherry et al (2008), Environmental Economics, Experimental Methods, Routlege. - Typically, parameterized/designed so that each player has a dominant strategy of not contributing (to the public account) In one-shot (single-round) VCM experiments, subjects contribute – contrary to the theoretical prediction – about 40% - 60 % In finitely-repeated VCM experiments, subjects contribute about the same initially but contributions then decline towards zero (but rarely ever zero) “Thus, there seem to be motives for contributing that outweigh the incentive to free ride” (CFV 194) Possible “motives”: “pure altruism”, “warm-glow” (also called, “impure altruism”), “conditional cooperation”, “confusion” “Confusion” describes individuals’ failure to identify (in the laboratory setup) the dominant strategy of no contribution, - Finding: every study that looks for confusion finds that it plays a significant role in observed contributions. o “The level of confusion in all experiments is both substantial and troubling.” (p. 196) o “The experiments provide evidence that confusion is a confounding factor in investigations that discriminate among motives for public contributions, … “ (p. 196) - Solutions: o Increase monetary rewards in VCM experiments ! (inadequate monetary rewards having been identified as potential cause of contributions provided out of confusion) o Make sure instructions are understandable ! (poorly prepared instructions having been identified as possible source of confusion) o Make sure, more generally, that subjects manage to identify the dominant strategy ! (the inability of subjects to decipher the dominant strategy having been identified as a possible source of confusion) o “Our results call into question the standard, “context-free” instructions used in public good games.” (p. 208) 7 - Some graphs: 8 A SHORT LIST OF PRACTICAL POINTS ON HOW TO GO ABOUT DATA GENERATION AND ANALYSIS (see p. 64 of Friedman & Cassar whose chapter 5 I will send you if you send me message): All true! Also true (well, at least in my view): You have a problem if you have to torture the data to make them confess! Ideally, you should be able to tell a story (your story!) by way of descriptive data (graphs, summary statistics). Tufte (1983) is a good book indeed. But even simpler, take note of what you like in articles you read. And, yes, it is very important to go through the „qualitative phase“ and really understand the data. Such an analysis will help you spot unexpected (irr)regularities (recording mistakes, computer malfunctions, human choice idiosyncracies, etc.) [Because of the ease with which programs like STATA can be used nowadays, some experimenters are tempted to skip this step. Bad idea!] Importantly, IT IS WORTH BEARING IN MIND THAT STATISTICAL TESTS ARE UNABLE TO DETECT FLAWS IN EXPERIMENTAL DESIGN AND IMPLEMENTATION … (Ondrej’s write-up p. 7) A little excursion of a more general nature: The following text has been copied and edited from a wonderful homepage that I recommend to you enthusiastically: http://www.socialresearchmethods.net/ It’s an excellent resource for social science reseach methods. Four interrelated components that influence the conclusions you might reach 9 from a statistical test in a research project. <snip> The four components are: sample size, or the number of units (e.g., people) accessible to the study effect size, or the salience of the treatment relative to the noise in measurement alpha level (α, or significance level), or the odds that the observed result is due to chance power, or the odds that you will observe a treatment effect when it occurs Given values for any three of these components, it is possible to compute the value of the fourth. <snip> Figure 1 shows the basic decision matrix involved in a statistical conclusion. All statistical conclusions involve constructing two mutually exclusive hypotheses, termed the null (labeled H0) and alternative (labeled H1) hypothesis. Together, the hypotheses describe all possible outcomes with respect to the inference. The central decision involves determining which hypothesis to accept and which to reject. For instance, in the typical case, the null hypothesis might be: H0: Program Effect = 0 while the alternative might be H1: Program Effect <> 0 <snip> Figure 1 below is a complex figure that you should take some time studying. <snip> Type I [false positive] is the same as the α or significance level and labels the odds of finding a difference or effect by chance alone. (Is there a psychological, or reporting, bias here?) Type II [false negative] suggest that you find that the program was not demonstrably effective. (There may be a psychological bias here too but probably a healthy one.) 10 H0 (null hypothesis) true H0 (null hypothesis) false H1 (alternative hypothesis) false H1 (alternative hypothesis) true In reality... In reality... There is no relationship There is no difference, no gain Our theory is wrong We accept the null hypothesis (H0) We reject the alternative hypothesis (H1) We say... "There is no relationship" "There is no difference, no gain" "Our theory is wrong" There is a relationship There is a difference or gain Our theory is correct 1-α β (e.g., .95) (e.g., .20) THE CONFIDENCE LEVEL TYPE II ERROR The odds of saying there is no relationship, difference, gain, when in fact there is none The odds of saying there is no relationship, difference, gain, when in fact there is one The odds of correctly not confirming our theory The odds of not confirming our theory when it’s true 95 times out of 100 when there is no effect, we’ll say there is none 20 times out of 100, when there is an effect, we’ll say there isn’t 11 We reject the null hypothesis (H0) We accept the alternative hypothesis (H1) We say... "There is a relationship" "There is a difference or gain" "Our theory is correct" α 1-β (e.g., .05) (e.g., .80) TYPE I ERROR POWER (SIGNIFICANCE LEVEL) The odds of saying that there is an relationship, difference, gain, when in fact there is one The odds of saying there is an relationship, difference, gain, when in fact there is not The odds of confirming our theory incorrectly 5 times out of 100, when there is no effect, we’ll say there is on We should keep this small when we can’t afford/risk wrongly concluding that our program works The odds of confirming our theory correctly 80 times out of 100, when there is an effect, we’ll say there is We generally want this to be as large as possible Figure 1. The Statistical Inference Decision Matrix See worksheet for L 11. the lower the α, the lower the power; the higher the α, the higher the power the lower the α, the less likely it is that you will make a Type I Error (i.e., reject the null when it’s true) the lower the α, the more "rigorous" the test an α of .01 (compared with .05 or .10) means the researcher is being relatively careful, s/he is only willing to risk being wrong 1 in a 100 times in rejecting the null when it’s true (i.e., saying there’s an effect when there really isn’t) an α of .01 (compared with .05 or .10) limits one’s chances of ending up in the bottom row, of concluding that the program has an effect. This means that both your statistical power and the chances of making a Type I Error are lower. an α of .01 means you have a 99% chance of saying there is no difference when there in fact is no difference (being in the upper left box) increasing α (e.g., from .01 to .05 or .10) increases the chances of making a Type I Error (i.e., saying there is a difference when there is not), decreases the chances of making a Type II Error (i.e., saying there is no difference when there is) and decreases the rigor of the test increasing α (e.g., from .01 to .05 or .10) increases power because one will be rejecting the null more often (i.e., accepting the alternative) and, consequently, when the alternative is true, there is a greater chance of accepting it (i.e., power) 12 A little excursion of a more general nature: Robert M. Becker at Cornell University illustrates these concepts masterfully, and entertainingly, by way of the OJ Simpson trial (note that this is actually a very nice illustration of the advantages of contextualization although there may be order effects here /): http://www.socialresearchmethods.net/OJtrial/ojhome.htm H0: OJ Simpson was innocent (although our theory is that in fact he was guilty as charged) HA: Guilty as charged (double murder) Can H0 be rejected, at a high level of confidence? I.e. … Type I error? Returning a guilty verdict when the defendant is innocent. Type II error? Returning a not guilty verdict when the defendant is guilty. The tradeoff (The Jury’s Dilemma): Do we want to make sure we put guilty people in jail (that would mean, to choose a higher α = to have less stringent demands on evidence needed) or we keep innocent people out of jail (that would mean , to choose a lower α = to have higher demands on evidence needed) 13 14 Ondrej’s appendix to my book … 1. Introduction 1.1 Descriptive statistics Descriptive statistics - tools for presenting various characteristics of subjects’ behavior as well as their personal characteristics in the form of tables and graphs, and with methods of summarizing the characteristics by measures of central tendency, variability, and so on. One normally observes variation in characteristics between (or across) subjects, but sometimes also within subjects – for example, if subjects’ performance varies from round to round of an experiment. Inferential statistics - formal statistical methods of making inferences (i.e., conclusions) or predictions regarding subjects’ behavior. Types of variables (Stevens 1946) categorical variables (e.g., gender, or field of study) ordinal variables (e.g., performance rank) interval variables (e.g., wealth or income bracket) ratio variables (e.g., performance score, or the number of subjects choosing option A rather than option B). Different statistical approaches may be required by different types of variables. 1.1.1 Measures of central tendency and variability Measures of central tendency - the (arithmetic) mean (the average of a variable’s values) - the mode (the most frequently occurring value(s) of a variable) - the median (the middle-ranked value of a variable) – useful when the variable’s distribution is asymmetric or contains outliers Measures of variability - the variance (the average of the squared deviations of a variable’s values from the variable’s arithmetic mean) - an unbiased estimate of the population variance, ŝ2=ns2/(n-1), where s2 is the sample variance as defined in words directly above, and n is the number of observations on the variable under study) - the standard deviation (the square root of the variance) - the range (the difference between a variable’s highest and lowest value) - the interquartile range (the difference between a variable’s values at the first quartile (i.e., the 25th percentile) and the third quartile (i.e., the 75th percentile) - Furthermore, … measures assessing the shape of a variable’s distribution – such as the degree of symmetry (skewness) and peakedness (kurtosis) of the distribution – useful 15 when comparing the variable’s distribution to a theoretical probability distribution (such as the normal distribution, which is symmetric and moderately peaked). 1.1.2 Tabular and graphical representation of data ALWAYS inspect the data by visual means before conducting formal statistical tests! And do it on as disaggregated level as possible! 1.2 Inferential statistics We use a sample statistic such as the sample mean to make inferences about a (unknown) population parameter such as the population mean.1 Difference between the two is the sampling error, it decreases with larger sample size. Sample statistics draw on measures of central tendency and variability, so the fields of descriptive and inferential statistics are closely related: A sample statistic can be used for summarizing sample behavior as well as for making inferences about a corresponding population parameter. 1.2.1 Hypothesis testing (as opposed to estimation of population parameters – see 2.1.1.) classical hypothesis testing model H0, of no effect (or no difference) versus H1, of the presence of an effect (or presence of a difference) where H1 is stated as either nondirectional (two-tailed) if no prediction about the direction of the effect or difference, or directional (one-tailed) if prediction (researchers sometimes speak of two-tailed and one-tailed statistical tests, respectively). A more conservative approach is to use a nondirectional (two-tailed) H1. Can we reject H0 in favor of H1? Example: Two groups of subjects facing different experimental conditions: Does difference in experimental conditions affects subjects’ average performance? H0: µ1 = µ2 and H1: µ1 ≠ µ2, or H1: µ1 > µ2 or H1: µ1 < µ2, if we have theoretical or practical reasons for entertaining a directional research hypothesis, where µI denotes the mean performance of subjects in Population i from which Sample i was drawn. How confident are we about our conclusion? 1 As further discussed below, random sampling is important for making a sample representative of the population we have in mind, and consequently for drawing valid conclusions about population parameters based on sample statistics. Recall the problematic recruiting procedure in Hoelzl Rustichini (2005) and Harrison’s et al (2005) critique of the unbalanced subject pools in Holt & Laury (2002). 16 1.2.2 The basics of inferential statistical tests - compute a test statistic based on sample data compare to the theoretical probability distribution of the test statistic constructed assuming that H0 is true If the computed value of the test statistic falls in the extreme tail(s) of the theoretical probability distribution – the tail(s) being delimited from the rest of the distribution by the so called critical value(s) – conclude that H0 is rejected in favor of H1; otherwise conclude that H0 of no effect (or no difference) cannot be rejected. By rejecting H0, we declare that the effect on (or difference in) behavior observed in our subject sample is statistically significant, meaning that the effect (or difference) is highly unlikely due to chance (i.e., random variation) but rather due to some systematic factors. By convention, level of statistical significance (or significance level), α, often set at 5% (α=.05), sometimes at 1% (α=.01) or 10% (α=.10). Alternatively, one may instead (or additionally) wish to report the exact probability value (or p-value), p, at which statistical significance would be declared. The significance level at which H0 is evaluated and the type of H1 (one-tailed or twotailed) ought to be chosen (i.e., predetermined) by the researcher prior to conducting the statistical test or even prior to data collection. The critical values of common theoretical probability distributions of test statistics, for various significance levels and both types of H1, are usually listed in special tables in appendices of statistics (text) books and in Appendix X of Ondrej’s chapter. 1.2.3 Type I and Type II errors, power of a statistical test, and effect size Lowering α (for a given H1) - increases the probability of a Type II error, β, which is committed when a false H0 is erroneously accepted despite H1 being true. - decreases the power of a statistical test, 1- β, the probability of rejecting a false H0. Thus, in choosing a significance level at which to evaluate H0, one faces a tradeoff between the probabilities of committing the above statistical errors. Other things equal, the larger the sample size and the smaller the sampling error, the higher the likelihood of rejecting H0 and hence the higher the power of a statistical test. The probability of committing a Type II error as well as the power of a statistical test can only be determined after specifying the value of the relevant population parameter(s) under H1. 17 Other things equal, the test’s power increases the larger the difference between the values of the relevant population parameter(s) under H0 and H1. This difference, when expressed in standard deviation units of the variable under study, is sometimes called the effect size (or Cohen’s d index). Especially in the context of parametric statistical tests, some scientists prefer to do a power-planning exercise prior to conducting an experiment: After specifying a minimum effect size they wish to detect in the experiment, they determine such a sample size that yields what they deem to be sufficient power of the statistical test to be used. Note, however, that one may not know a priori which statistical test is most appropriate and thus how to perform the calculation. In addition, existing criteria for identifying what constitutes a large or small effect size are rather arbitrary (Cohen (1977) proposes that d greater than 0.8 (0.5, 0.2.) standard deviation units represents a large (medium, small) effect size). Other things equal, however, the smaller the (expected) effect size, the larger the sample size required to yield a sufficiently powerful test capable of detecting the effect. See, e.g., [S] pp. 164-173 and pp. 408-412 for more details. Criticisms of the classical hypothesis testing model: Namely, with a large enough sample size, one can almost always obtain a statistically significant effect, even for a negligible effect size (by similar token, of course, a relatively large effect size may turn out statistically insignificant in small samples). Yet if one statistically rejects H0 in a situation where the observed effect size is practically or theoretically negligible, one is in a practical sense committing a Type I error. For this reason, one should strive to assess whether or not the observed effect size – i.e., the observed magnitude of the effect on (or difference in) behavior – is of any practical or theoretical significance. To do so, some researchers prefer to report what is usually referred to as the magnitude of treatment effect, which is also a measure of effect size (and is in fact related to Cohen’s d index). We discuss the notion of treatment effect in Sections 2.2.1 and 2.3.1, and see also [S] pp.1037-1061 for more details. Another criticism: improper use, particularly in relation to the true likelihood of committing a Type I and Type II error. Within the context of a given research hypothesis, statistical comparisons and their significance level should be specified prior to conducting the tests. If additional unplanned tests are conducted, the overall likelihood of committing a Type I error in such an analysis is inevitably inflated well beyond the α significance level prespecified for the additional tests. For explanation, and possible remedies, see Ondrej’s text, Alternatives: the minimum-effect hypothesis testing model the Bayesian hypothesis testing model 18 See Cohen (1994), Gigerenzer (1993) and [S] pp. 303-350 for more details, also text. 1.3 The experimental method and experimental design How experimental economists and other scientists design experiments to evaluate research hypotheses. Proper design and execution of your experiment ensure reliability of your data and hence also the reliability of your subsequent statistical inference. (Statistical tests are unable to detect flaws in experimental design and implementation.) A typical research hypothesis involves a prediction about a causal relationship between an independent and a dependent variable (e.g.. effect of financial incentives on risk aversion, or on effort, etc.) A common experimental approach to studying the relationship is to compare the behavior of two groups of subjects: the treatment (or experimental) group and the control (or comparison) group. The independent variable is the experimental conditions – manipulated by the experimenter – that distinguish the treatment and control groups (one can have more than one treatment group and hence more than two levels of the independent variable). The dependent variable is the characteristic of subjects’ behavior predicted by the research hypothesis to depend on the level of the independent variable (one can also have more than one dependent variable). In turn, one uses an appropriate inferential statistical test to evaluate whether there indeed is a statistically significant difference in the dependent variable between the treatment and control groups. What we describe above is commonly referred to as true experimental designs, characterized by a random assignment of subjects into the treatment and control groups (i.e., there exists at least one adequate control group) and by the independent variable being exogenously manipulated by the experimenter in line with the research hypothesis. These characteristics jointly limit the influence of confounding factors and thereby maximize the likelihood of the experiment having internal validity, which is achieved to the extent that observed differences in the dependent variable can be unambiguously attributed to a manipulated independent variable. Confounding factors are variables systematically varying with the independent variable (e.g., yesterday’s seminar?), which may produce a difference in the dependent variable that only appears like a causal effect. Unlike true experiments, other types of experiments conducted outside the laboratory – such as what is commonly referred to as natural experiments exercise less or no control over random assignment of subjects and exogenous manipulation of the independent variable, and hence are more prone to the potential effect of confounding variables and have lower internal validity. 19 Random assignment of subjects conveniently maximizes the probability of obtaining control and treatment groups equivalent with respect to potentially relevant individual differences such as demographic characteristics. As a result, any difference in the dependent variable between the treatment and control groups is most likely attributable to the manipulation of the independent variable and hence to the hypothesized causal relationship. Nevertheless, the equivalence of the control and treatment groups is rarely achieved in practice, and one should control for any differences between the control and treatment groups if deemed necessary (e.g., as illustrated in Harrison et al 2005, in their critique of Holt & Laury 2002). Similarly, one should not simply assume that a subject sample is drawn randomly and hence is representative of the population under study. Consciously or otherwise, we often deal with nonrandom samples. Volunteer subjects, or subjects selected based on their availability at the time of the experimental sessions, are unlikely to constitute true random samples but rather convenience samples. As a consequence, the external validity of our results – i.e., the extent to which our conclusions generalize beyond the subject sample(s) used in the experiment – may suffer.2 Choosing an appropriate experimental design often involves tradeoffs. One must pay attention to the costs of the design in terms of the number of subjects and the amount of money and time required, to whether the design will yield reliable results in terms of internal and external validity, and to the practicality of implementing the design. In other words, you may encounter practical, financial or ethical limitations preventing you from employing the theoretically best design in terms of the internal and external validity. 1.4 Selecting an appropriate inferential statistical test - Determine whether the hypothesis (and hence your data set) involves one or more samples. Single sample: use a single-sample statistical test to test for the absence or presence of an effect on behavior, along the lines described in the first example in Section 1.2.1. Two samples: use a two-sample statistical test for the absence or presence of a difference in behavior, along the lines described in the second example in Section 1.2.1. Most common single- and two-sample statistical tests in Sections 2 to 6; other statistical tests and procedures intended for two or more samples are not discussed in this book but can be reviewed, for example, in [S] pp. 683-898. 2 In the case of convenience samples, one usually does not know the probability of a subject being selected. Consequently one cannot employ methods of survey research that use known probabilities of subjects’ selection to correct for the nonrandom selection and thereby to make the sample representative of the population. One should rather employ methods of correcting for how subjects select into participating in the experiment (see, e.g., Harrison et al., UCF WP 2005, forthcoming in JEBO?). 20 When making a decision on the appropriate two-sample test, one first needs to determine whether the samples – usually the treatment and control groups/conditions in the context of the true experimental design described in Section 1.3 – are independent or dependent. Independent samples design (or between-subjects design, or randomized-groups design) – where subjects are randomly assigned to two or more experimental and control groups – one employs a test for two (or more) independent samples. Dependent samples design (or within-subjects design, or randomized-blocks design) – where each subject serves in each of the k experimental conditions, or, in the matchedsubjects design, each subject is matched with one subject from each of the other (k-1) experimental conditions based on some observable characteristic(s) believed to be correlated with the dependent variable – one employs a test for dependent samples. One needs to ensure internal validity of the dependent samples design by controlling for order effects (so that differences between experimental conditions do not arise solely from the order of their presentation to subjects), and, in the matched-subjects design, by ensuring that matched subjects are closely similar with respect to the matching characteristic(s) and (within each pair) are assigned randomly to the experimental conditions. Finally, in factorial designs, one simultaneously evaluates the effect of several independent variables (factors) and conveniently also their interactions, which usually requires using a test for factorial analysis of variance or other techniques (which are not discussed in this book but can be reviewed, e.g., in [S] pp.900-955). Sections 2 to 6: we discuss the most common single- and two-sample parametric and nonparametric inferential statistical tests. The parametric label is usually used for tests that make stronger assumptions about the population parameter(s) of the underlying distribution(s) for which the tests are employed, as compared to non-parametric tests that make weaker assumptions (for this reason, the non-parametric label may be slightly misleading since nonparametric tests are rarely free of distributional and other assumptions). Some researchers instead prefer to make the parametric-nonparametric distinction based on the type of variables analyzed by the tests, with nonparametric tests analyzing primarily categorical and ordinal variables with lower informational content (see Section 1.1). Behind the alternative classification is the widespread (but not universal) belief that a parametric test is generally more powerful than its nonparametric counterpart provided the assumption(s) underlying the former test are satisfied, but that a violation of the assumption(s) calls for transforming the data into a format of (usually) lower informational content and analyzing the transformed data by a nonparametric test. 21 Alternatively, … use parametric tests even if some of their underlying assumptions are violated, but make adjustments to the test statistic to improve its reliability. While the reliability and validity of statistical conclusions depends on using appropriate statistical tests, one often cannot fully validate the assumptions underlying specific tests and hence faces the risk of making wrong inferences. For this reason, one is generally advised to conduct both parametric and nonparametric tests to evaluate a given statistical hypothesis, and – especially if results of alternative tests disagree – to conduct multiple experiments evaluating the research hypothesis under study and jointly analyze their results by using meta-analytic procedures. See, e.g., [S] pp.1037-1061 for further details. Of course, that’s only possible if you have enough resources. 2. t tests for evaluating a hypothesis about population mean(s) 2.1 Single-sample t test (and z test) The single-sample t test and z test are similar parametric statistical tests usually employed for interval or ratio variables (see Section 1.1). They evaluate the null hypothesis of whether, in the population that our sample represents, the variable under study has a population mean equal to a specific value, µ. If the value of the variable’s population variance is known and the sample size is relatively large (usually deemed at least n>25 but sometimes n>40) one can employ a single-sample z test, the test statistic of which follows the standard normal probability distribution. However, since the population variance is rarely known (and hence must be estimated using the sample standard deviation) and experimental datasets tend to be rather small, it is usually more appropriate to use the t test which relaxes the two assumptions and is based on the t probability distribution (also called the Student’s t distribution, which approaches the standard normal probability distribution as n approaches infinity). The assumptions behind the t test are that (1) the subject sample has been drawn randomly from the population it represents, and (2) in the population the sample represents, the variable under study is normally distributed. When the normality assumption is violated (see Section 6.1 for tests of normality), the test’s reliability may be compromised and one may prefer to instead use alternative nonparametric tests, such as the Wilcoxon signed-ranks test (see Section 3.1) or the binomial sign test (see section 3.3 where we also discuss its application referred to as the single-sample test for the median). 2.1.1 Estimation of confidence intervals for the single-sample t test As mentioned in Section 1.2.1, another methodology of inferential statistics besides hypothesis testing is estimation of population parameters. A common method is interval 22 estimation of the so called confidence interval for a population parameter – a range of values that, with a high degree of confidence, contains the true value of the parameter. For example, a 95% confidence interval contains the population parameter with the probability of 0.95. Confidence interval estimation uses the same sample information as hypothesis testing (and in fact tends to be viewed as part of the classical hypothesis testing model). 2.2 t test for two independent samples The assumptions behind the t test for two independent samples are that (1) each sample has been drawn randomly from the population it represents, (2) in the populations that the samples represent, the variable under study is normally distributed, and (3) in the populations that the samples represent, the variances of the variable are equal (also known as the homogeneity of variance assumption). … When the normality assumption underlying the t test is violated, one may prefer to instead use a nonparametric test, such as the Mann-Whitney U test (see Section 4.1) or the chi-square test for r x c tables (see section 4.3, where we also discuss its application referred to as the median test for two independent samples. However, nonparametric tests usually sacrifice information by transforming the original interval or ratio variable into an ordinal or categorical format. For this reason, some researchers prefer to use the t test even when the normality assumption is violated (also because the t test actually tends to perform relatively well even with its assumptions violated) but use more conservative (i.e., larger in magnitude) critical values to avoid inflating the likelihood of committing a Type I error. Nevertheless, one should attempt to understand why and to what extent the t test’s assumptions are violated. For example, the presence of outliers may cause violation of both the normality and variance homogeneity assumption. Before using the t test, one may wish to verify the homogeneity of variance assumption. To do so, one can use an F test for two population variances, or Hartley’s Fmax test for homogeneity of variance, or other tests. Both the abovementioned tests rest on normality of the underlying distributions from which the samples are drawn, and the former is more appropriate in the case of unequal sample sizes. See, e.g., [S] pp.403-408 and 722-725 for a detailed discussion. One may alternatively prefer to use nonparametric tests, such as the Siegel-Tuckey test for equal variability (see, e.g., [S] pp.485-498) or the Moses test for equal variability (see, e.g., [S] pp.499-512). When the t test is employed despite the homogeneity of variance assumption being violated – which may be a key result in itself, among other things possibly signaling the presence of outliers, or that the control and treatment groups differ along some underlying dimension, or that the treatment group subjects react heterogeneously to the treatment – the likelihood of committing a Type I error increases. The literature offers several remedies that make the t test more conservative, such as adjusting upwards the 23 (absolute value of) critical value(s) of the t test, or adjusting downwards the number of degrees of freedom. Alternatively, one may prefer to instead use a nonparametric test that does not rest on the homogeneity of variance assumption, such as the KolmogorovSmirnov test for two independent samples (see Section 4.2). 2.2.1 Measuring the magnitude of treatment effect As discussed in Section 1.2.3, one should be cautious when the observed effect on (or difference in) behavior is of little practical or theoretical relevance and statistical significance of the effect arises primarily from having a large enough sample size. Besides judging the practical (economic) significance of an observed effect, one may wish to evaluate its magnitude, in a manner more or less independent of sample size, by determining what is referred to as the magnitude of treatment effect – the fraction of variation in the dependent variable (i.e., the variable under study) attributable to variation in the independent variable (usually the treatment variation in experimental conditions between the treatment and control groups). For a given experimental design, there frequently exist multiple measures of the magnitude of treatment effect which tend to yield different results. … 2.3 t test for two dependent samples 2.3.1 Measuring the magnitude of treatment effect 2.4 Examples of using t tests Example 1: Kovalchik et al. (2005) compare the behavior between two samples of young students (Younger) and elderly adults (Older). In one of the tasks, the authors elicit willingness-topay (WTP) for a group of subjects in the role of buyers and willingness-to-accept (WTA) for another group of subjects in the role of sellers of a real mug – see the data in the last column of their Table 2. The authors claim that there is no significant difference between WTP and WTA for either the Younger or the Older (and also for the pooled sample), and that there is no significant difference between the Younger and the Older in either their WTP or their WTA. Although the authors give no details as to how they conducted the statistical inferential tests, the third column of Table 2 offers us sufficient information to evaluate the hypotheses for the mug’s WTP/WTA using the t test for two independent samples. Before doing so, it is worth noting that although the we do not have full information on the shape of the WTP and WTA distributions, the normality assumption underlying the t test might well be violated especially for the WTP of the Older and the WTA of the Younger, since for either group there is a large difference between the sample mean and the sample median, which indicates distribution asymmetry. Similarly, judging from the rather large differences in the four standard deviations, the homogeneity of variance 24 assumption behind the t test might also be violated. We nevertheless calculate the sequence of t tests for illustration purposes. Evaluating first the null hypothesis that the mean WTP and WTA do not differ from each other for the population of the Younger (Older), say, against a two-tailed H1 at the 5% significance level, we calculate the tYounger (tOlder) test statistic as follows (using the formula for equal sample sizes, disregarding that there is an extra subject in the third group since this will not affect the t test statistic in any essential way): tYounger = (3.88 − 2.24) 4.88 2 1.75 2 + (26 − 1) (25 − 1) = 1.578 t Older = (2.48 − 3.25) 1.7 2 3.04 2 + (25 − 1) (25 − 1) = −1.083 Next, we evaluate the null hypothesis that the mean WTP (WTA) does not differ between the Younger and Older populations, again against a two-tailed H1 at the 5% significance level. We calculate the tWTP (tWTA) test statistic as follows (using again the formula for equal sample sizes): tWTA = (2.48 − 3.88) 2 2 1.7 4.88 + (25 − 1) (26 − 1) = −1.352 tWTP = (3.25 − 2.24) 3.04 2 1.75 2 + (25 − 1) (25 − 1) = 1.411 We compare the four computed t tests statistics with the two-tail upper and lower critical values, t0.975(df) and t0.025(df), where there are either 49 or 48 degrees of freedom depending on the particular comparison (remember that df=n1+n2–2). Based on Table A2 in Appendix X, t0.975(49)=- t0.025(49) and t0.975(48)=- t0.025(48) lie between 2.000 and 2.021, so we can safely reject the null hypothesis of no difference in all four cases above. This confirms the conclusions of Kovalchik et al. (2005). [Or does it?] 3. Nonparametric alternatives to the single-sample t test 3.1 Wilcoxon signed-ranks test 3.2 Chi-square goodness-of-fit test 3.3 Binomial sign test for a single sample 3.3.1 Single-sample test for the median 3.3.2 z test for a population proportion 3.4 Examples of using single-sample tests Example 1: 25 Ortmann et al. (2000) study trust and reciprocity in an investment setting and how it is affected by the form of presenting other subjects’ past behavior and by a questionnaire prompting strategic thinking. The authors do not find significant differences in investment behavior across their five between-subjects treatments (i.e., they have one control group called the Baseline and four treatment groups varying in history presentation and in the presence or absence of the questionnaire), using the two-tailed Mann-Whitney U test (see Table 3). We can entertain an alternative research hypothesis and test whether, in the authors’ fifth treatment which would be theoretically most likely to decrease trust, investment differs from the theoretical prediction of zero. For this purpose, we analyze the data from the authors’ Tables A5 and A5R for two different subject groups participating in the fifth treatment, which, as the authors report, differ in their behavior as indicated by the twotailed Mann-Whitney U test (see Table 3, p=0.02). In the left and right bar graphs below, you can see a clear difference in the shape of the sample distributions of investment for the A5 and A5R subjects, respectively, namely that the investment of zero (ten) is much more common for A5 (A5R) subjects. For that reason (and predominantly for illustration purposes), we focus on whether investment behavior of the A5 subjects adheres to the above stated theoretical prediction. 1 0 Density .5 0 0 5 10 0 5 10 T5and5R Graphs by Treatment We know from Ortmann et al.’s (2000) Table 2 that the (sample) mean and median investment of the A5 subjects is 2.2 and 0.5 units, respectively. For start, we can conduct a two-sided single-sample t test at the 5% significance level (using, for example, the “ttest” command in Stata) to investigate whether, in the population that the A5 subjects 26 represent, the mean investment is zero. The t test yields a highly significant result (p=0.011). However, we see from the above bar graph that the normality assumption underlying the t test is very likely violated. Since the above displayed distribution of investment is so asymmetric, one would also have troubles justifying the use of the Wilcoxon signed-ranks test. Yet a more general problem is the corner-solution nature of the theoretical prediction, implying that ΣR– calculated for the Wilcoxon signed-ranks test would inevitably be zero, which ultimately makes the test unsuitable for evaluating the research hypothesis. Similarly, note that, based on the corner-solution theoretical prediction, the expected cell frequencies for the chi-square goodness-of-fit test would be zero for all positive-investment categories (cells), which clearly makes the chi-square test unsuitable, too (not only due to the statistical problem that assumption (3) of the test would be strongly violated). A similar problem arises also for the binomial sign test (and its large-sample approximations) where it would have been impossible to evaluate the null hypothesis of whether, in the population that A5 subjects represent, the proportion of zero investments is equal to π=1 (to see why, check the formula for calculating the binomial probabilities). One could adopt an alternative (and often appealing) research hypothesis to determine whether subjects’ investment behavior in fact deviates from what would be expected under random choice, meaning that all 11 choice categories (0,1,…,10) would be chosen equally often in the population. In principle, one could test this hypothesis using the chisquare goodness-of-fit test. However, note that even if we pooled the A5 and A5R subject groups – giving a total of 34 subjects (see Table 2), this would give us only slightly above 3 subjects in each of the eleven expected frequency cells (i.e., 34/11). Hence the test reliability could be compromised due to the violation of its assumption (3). 4. Nonparametric alternatives to the t test for two independent samples 4.1 Mann-Whitney U test (Wilcoxon rank-sum test)3 4.2 Kolmogorov-Smirnov test for two independent samples4 4.3 Chi-square test for r x c tables5 4.3.1 Fisher exact test 3 Note that there exist two versions of the test that yield comparable results: we describe the version developed by Mann and Whitney (1947) which is also referred to as the Mann-Whitney-Wilcoxon test, while the other version, usually referred to as the Wilcoxon rank-sum test or the Wilcoxon-Mann-Whitney test, was developed independently by Wilcoxon (1949). 4 The test was developed by Smirnov (1939) and for that reason is sometimes referred to as the Smirnov test, but because of its similarity to the Kolmogorov-Smirnov goodness-of-fit test for a single sample (see Section 6.1), the test described here is most commonly named the Kolmogorov-Smirnov test for two independent samples. 5 The test is an extension of the chi-square goodness-of-fit test (see Section 3.2) to two-dimensional contingency tables. 27 4.3.2 z test for two independent proportions 4.3.3 Median test for independent samples 4.3.4 Additional notes on the chi-square test for r x c tables 4.4. Computer-intensive tests Computer-intensive (or data-driven) tests have become a widely used alternative to traditional parametric and nonparametric tests. The tests are also referred to as permutation tests, randomization (or rerandomization) tests, or exact tests. They employ resampling of data – a process of repeatedly randomly drawing subsamples of observations from the original dataset – to construct the sampling distribution of a test statistic based directly on the data, rather than based on an underlying theoretical probability distribution as in the classical hypothesis testing approach. As a consequence, computer-intensive tests have the appeal of relying on few if any distributional assumptions. The tests differ from each other mainly in the nature of their resampling procedure. Most computer-intensive tests are employed for comparing behavior between two independent random samples and hence are discussed briefly in this section (you can find more details in the references cited below, in modern statistical textbooks, or in manuals of statistical packages in which the tests are frequently pre-programmed). - randomization test for two independent samples (sampling without replacement) bootstrap ((re)sampling with replacement) jackknife ((re)sampling with replacement) 4.5 Examples of using tests for two independent samples Example 1: Abbink and Hennig-Schmidt (2006) use the Mann-Whitney U test in the context of a bribery experiment to compare behavior of two independent groups of subjects under neutral and loaded (corruption-framed) experimental instructions. They find that bribe offers (averaged across 30 rounds of the experiment) do not significantly differ across the two groups of subjects, with a one-tail p-value of 0.39. This result may look surprising given that the median (averaged) bribe offers differ rather widely – 1.65 for the loaded group and 3.65 for the neutral group, as we calculate from the authors’ Table 2. We plot the two bribe offer distributions below (using the command “kdensity” in the Stata software) to illustrate that the difference in medians results from the bi-modal nature of the distributions (note that the medians fall in the trough of the distributions) as well as their different shape (compare the two humps). 28 .06 .08 Density .1 .12 .14 .16 Kernel density estimate -2 0 2 4 6 8 loaded kernel = epanechnikov, bandwidth = 1.15 .05 Density .1 .15 Kernel density estimate -2 0 2 neutral 4 6 kernel = epanechnikov, bandwidth = 1.05 If the distributions’ shape is indeed different, this would violate assumption (3) of the Mann-Whitney U test. Conducting the F test for two population variances (for example, using the command “sdtest” in Stata), we find that the variances of the two bribe offer 29 distributions do not differ significantly from each other, with a two-tail p-value of 0.72. However, note that, given the bi-modality of the sample distributions, first, the normality assumption of the F test is likely violated, so the reliability of the test might be compromised, and second, variance is unlikely to be the most appropriate criterion when comparing the shape of the two distributions. One might wish to compare the above distributions of averaged briber offers using the Kolmogorov-Smirnov test for two independent samples which does not rely on the shape of the population distributions being identical. When we conduct the two-tail test (for example, using the command “ksmirnov” in Stata), it yields no significant difference between the two distributions at any conventional significance level (p=0.936). Given the end-game effect noted by the authors as well as a potential “warming-up” effect in early rounds of the experiment (when different forms of learning might be going on as compared to later rounds), we examine whether leaving out the first and last five rounds from the above analysis influences the results. Although the medians are now somewhat different – 1.175 for the loaded group and 3.425 for the neutral group – the bimodal shape of the distributions is again present and the difference in their shape – as judged by either the Mann-Whitney U test or the Kolmogorov-Smirnov test for two independent samples – remains statistically highly insignificant (p-values not reported). Example 2: Kovalchik et al.’s (2005) compare the behavior between two samples of young students (Younger) and elderly adults (Older). In their second gambling experiment, subjects sequentially make six choices between two decks of cards, one of which contains cards with lower average payoff and higher variance – the authors call the card deck risky. The sample distributions for the Younger and Older of the risky deck (ranging from 0 to 6) are depicted in the authors’ Figure 3, where only the percentages of subjects choosing the risky deck six times differ (at the 10% significance level) across the two groups of subjects, as revealed by the Mann-Whitney U test. Based on the figure and Table 1, the authors conclude that there is no difference in the choice behavior of the Younger and the Older. We use the data from Figure 3 to illustrate that, instead of using the Mann-Whitney U separately for each given number of risky deck choices, one can make an alternative (and perhaps more suitable) comparison of the entire distributions of the number of risky deck choices by using the chi-square test for r x c tables. In particular, the chi-square test for homogeneity can be used to evaluate the null hypothesis of whether the two independent samples of the Younger and Older, each having the variable’s values organized in the seven categories represented by the number of risky deck choices, are homogenous with respect to the proportion of observations in each category. In the tables below, we display, first, the observed frequencies of risky deck choices, then the expected frequencies calculated as outlined in Section 4.3, and finally the calculation of the chi-square test statistic using the formula given in Section 4.3. Note that since 4 out 30 of the 14 expected frequencies in the middle table fall below 5, one could argue that assumption (3) of the chi-square test is not met, but two of the 4 cases are only marginal and so we proceed with conducting the chi-square test, if only for illustration purposes. Observed frequencies: # risky choices Younger Older Column total 6 11 4 15 5 10 9 19 4 9 8 17 3 8 4 12 2 4 5 9 1 4 0 4 0 5 6 11 Row total 51 36 87 Expected frequencies: # risky choices Younger Older Column total 6 8.7931 6.2069 15 5 11.138 7.8621 19 4 9.9655 7.0345 17 3 7.0345 4.9655 12 2 5.2759 3.7241 9 1 2.3448 1.6552 4 0 6.4483 4.5517 11 Row total 51 36 87 (Obseved-Expected)^2/Expected 6 5 # risky choices Younger 0.5539 0.1163 Older 0.7847 0.1647 Column total 1.3386 0.281 4 0.0935 0.1325 0.2261 3 0.1325 0.1877 0.3203 2 0.3085 0.4371 0.7456 1 1.1684 1.6552 2.8235 0 0.3253 0.4608 0.7861 Row total 2.6983939 3.8227247 6.5211185 We compare the computed value of the χ2 test statistic (in yellow, i.e., the sum of the 14 cells in the last contingency table) with the appropriate critical value from the χ2(6) probability distribution since df=(2-1)(7-1)=6, as tabulated in Table A4 in Appendix X. Since the two-tail critical value for the 5% significance level, χ2.95(6)=12.59, is far greater than the χ2 test statistic of 6.52 (rounded to 2 decimal places), we reject at the 5% significance level the null hypothesis of no difference between the distributions of risky deck choices for the Younger and Older (the same conclusion would clearly be reached also at the 10% significance level for which the two-tail critical value is χ2.90(6)=10.64). Example 3: Kovalchik et al.’s (2005) compare the behavior between two samples of young students (Younger) and elderly adults (Older). In their last task, subjects play the p-beauty contest game with p=2/3. Subjects’ choices (guesses) are displayed in the authors’ Figure 5 using a stem-and-leaf diagram, based on which the authors argue that the Younger and Older subjects behave similarly in the game. We show how one can compare the distribution of choices of the Older and Younger more formally by using the Kolmogorov-Smirnov test for two independent samples (as done for the same p-beauty contest game, for example, in Grosskopf and Nagel, forthcoming in GEB). Before we do that, we plot the sample distributions of choices for the Younger and Older subjects, noting informally that normality is likely violated especially for the latter group. Also, verifying the homogeneity assumption by conducting the F test for two population variances (for example, using the command “sdtest” in Stata), we find that the variances 31 of the two choice distributions differ significantly from each other, with a two-tail pvalue of 0.036. Thus neither the t test for two independent samples nor the MannWhitney U test would be entirely appropriate, while the Kolmogorov-Smirnov test is more appropriate in that it does not rely on the shape of the population distributions being identical. 0 .01 Density .02 .03 .04 Kernel density estimate 0 20 40 Younger 60 80 kernel = epanechnikov, bandwidth = 4.25 0 .005 Density .01 .015 .02 .025 Kernel density estimate 0 20 40 60 80 100 Older kernel = epanechnikov, bandwidth = 7.29 32 When comparing the choice distributions of the Younger and Older using a two-tailed Kolmogorov-Smirnov test for two independent samples (for example, using the command “ksmirnov” in Stata), the test indicates no significant difference between the two choice distributions at any conventional significance level (p=0.337), which confirms the argument of Kovalchik et al. (2005). We note that a similar result would be obtained using the Mann-Whitney U test (using, for example, the Stata command “ranksum”) for which the two-tail p-value is 0.745. 5. Nonparametric alternatives to the t test for two dependent samples 5.1 Wilcoxon matched-pairs signed-ranks test6 5.2 Binomial sign test for two dependent samples7 5.3 McNemar test 5.4 Examples of using tests for two dependent samples Example 1: Blume and Ortmann (2005) study the effect of pre-play communication on behavior in median- and minimum-effort games. Here we evaluate a “convergence” research hypothesis in the loose sense of whether, in the two treatments with pre-play communication (i.e., Median sessions M1Me-M8Me and Minimum sessions M1MinM8Min in Figures 1 and 2, respectively), there is an overall upward drift in subjects’ choices towards the pareto-efficient choice between the first and last round. For various reasons, this alternative definition of choice convergence may be less appropriate than that used by Blume and Ortmann, but it will serve the purpose of illustrating the use of tests for two dependent samples. 6. A brief discussion of other statistical tests 6.1 Tests for evaluating population skewness and kurtosis 6.2 Tests for evaluating population variability 6 7 This test is a two-sample extension of the Wilcoxon signed-ranks test (see Section 3.1). This test is a two-sample extension of the binomial sign test for a single sample (see Section 3.3). 33 A LITTLE ASIDE OF RELEVANCE HERE: From: "dan friedman" <[email protected]> To: "ESA Experimental Methods Discussion" <[email protected]> Hi Timothy (and Karim and Karl)-As Karim says, this is an old and sensitive topic. Your reasons for randomly pairing are sensible. Karim's suggestion on smaller groups does help increase the number of (pretty much) independent observations, but at the same time it tends to undermine your goal to study experienced players in a one-shot game. (with smaller groups, the repeated interactions might alter incentives.)* It is a dilemma, especially for experimentalists with limited budgets. Independence is a strong assumption in general. When your observations are not independent, but you run the usual parametric or nonparametric stats as if they were independent, then you still get unbiased estimates but the significance level is overstated. How overstated it is hard to say. My own standard approach to such matters is to report bounds: run the tests with individual actions, report the significance as a lower bound on the pvalue, and also run tests on session averages, providing an upper bound on the p-value. In some cases, you can run regressions with individual subject effects and/or session effects (fixed or random) that plausibly capture most of the problem, but I would still regard these as overstating the significance level. I'd regard the position taken by Karl and Karim as fairly conservative, but not the most conservative. There might be lab or experimenter or instruction or subject pool effects, so even session averages aren't quite guaranteed to be independent. Most people don't worry about that, at least formally, but everyone is happier to see a result replicated in a different lab. My bottom line: think through the econometrics before you start, but in the end, you must create a lab environment that corresponds to what you want to test. If that requires random matching (or mean-matching), so be it. Just be prepared to deal with conservative referees afterward. --Dan *of course, you can mislead your Ss into believing that they might be matched with any of the 23 others when in fact they are matched only with 5 others ... but some referees will condemn even that mild sort of deception. Dilemmas abound! From: "John Kagel" <[email protected] To: [email protected], "ESA Experimental Methods Discussion" <[email protected]> 34 I am loath to reply to the group as a whole on anything but I could not disagree more strongly with Karim's remarks. Notice that the operative word with respect to full and complete contamination (so that the outcome reduces to SINGLE observation) is MAY so where's the proof that it happens?? It's pretty thin as far as I can tell. And there are tradeoffs as I outline below: Each of XX's sessions had 12 subjects, who were told they would be randomly matched with another participant. In practice, the set of 12 subjects was divided into 3 groups of 4 subjects each with rotation within each group. This was done in an effort to obtain "three independent sets of observations per session instead of only one" as the unit of observation employed in the analysis is primarily session level data. The idea behind "only one" independent observation per session if randomly rotating among all 12 bidders in the session is given the repeated interactions between subjects this generates session level effects that will dominate the data. In this regard he is among a growing number of experimenters who believe this, and who break up their sessions into smaller subgroups in an effort to obtain more "independent" observations per session. This practice [of what Karim calls below, sterile subgroups] ignores the role of appropriate panel data techniques to correct for dependencies across and between subjects within a given experimental session. There are several important and unresolved issues in choosing between these two procedures. In both cases experimenters are trying to squeeze as much data as they can from a limited subject-payment budget. As experimenters who have consistently employed random rematching between all subjects recruited for a given session, and applied panel data analysis to appropriately account for the standard errors of the estimates, we are far from unbiased with respect to this issue. With this in mind we point out several things: First, advocates of repeated matching of the same small sbset of subjects within an experimental sessions to generate more "independent" observations ignore the fact that there is no free lunch as: (i) they are implicitly lying to/deceiving subjects by not reporting the rotation rule employed and (ii) if subjects are as sensitive to repeated matching effects as they seem to assume under random matching between all subjects in a given experimental session, it seems plausible that repeated play within a small subset might generate super-game effects that will contaminate the data. Second, and more importantly, there have been a few experiments which have devoted treatments to determine the severity of possible session level effects from random rematching for the group as a whole. More often than not these studies find no differences, e.g. Cooper et al. (1993; footnote 13, p. 1308), Duffy and Ochs (2006). Also see Walker et al. (1987) and Brosig and Rei (2007) who find no differences when comparing bids in auctions with all human bidders against humans bidding against computers who follow the RNNE bidding strategy. For more on the econometrics of this issue see Frechette (2007). JK On Mar 9, 5:45 am, karim <[email protected]> wrote: 35 > Dear Timothy, > > these discussions are very ancient and most of us were hoping that they will never come back. > If I'm getting you right, you have 24 subjects interacting over the > course of your experiment. Obviously, if one of them does freaky > things, this "virus" may spread through your entire subject pool. Just > imagine it is really a virus: Anybody who has contacted the person > with the virus may be contaminated. Anybody who has contacted anybody > who has contacted the person with the virus may also be contaminated. > Anybody ... etc. > > Since everyone in your experiment has interacted - i.e. there are no > steril subgroups - you end up with a single independent observation. > Analyzing the subjects on an individual level is a good idea, but it > doesn't make them independent, because they have been interacting, > i.e. contaminating, each other in the course of the experiment. > > It is a good idea to let the subjects gain experience by playing > randomly matched games over and over for a couple of rounds, but you > don't have to mix them all with one another. You can take 24 subjects > and separate them into 2 or 3 independent subgroups (i.e. "steril" > subgroups) and then rematch in the subgroups. Add 2 sessions of this > kind to your experiment and you will have a total of at least 5 and at > most 7 independent observations. That sounds like a good minimum > number of ind obs for an experimental paper. > > When analyzing the data, you can then use non-parametric statistics or > run regressions, in which you explicitly take the random effects that > pertain to the subgroup membership into account. So you see, once you > have enough observations, all types of statistical testing are > available. > > If you don't trust me on all this (and why should you?!), please, look > it up in one of the many textbooks on experimental economics or on > experimental methods in general (e.g. in social psychology). > > Sorry folks, for putting this on the list. I really don't often bug > you, by speaking up on the list. But, this time I felt I had to say > something out loud to remind us all of the importance of sticking to > the word "science" in ESA, instead of replacing it with "speculation." > It certainly will be better for the reputation of experimental > economics, if we stick to the minimum requirements that were hammered > out many years ago. Letting go of what was once a general agreement > amongst experimental economists will deteriorate our credibility and 36 > our standing within economics and amongst the sciences. > > Greetings, > karim > > On 9 Mrz., 08:42, Timothy Dang <[email protected]> wrote: > > > Hello Karl et al> > > On Sun, Mar 8, 2009 at 8:10 AM, Karl Schlag <[email protected]> wrote: > > > I would prefer to discuss either why you thought this is the right decision > > > or what kind of parametric model you could apply that allows for > > > dependencies. > > > First, an off-list reply prompts me to be a bit more specific. I have > > 24 subjects--12 in each of two roles--playing a 2x2 game for 50 > > periods, with random re-matching every period. I'm not aiming to treat > > each of the 600 game-plays as an observation. Rather, for most of my > > analysis, I'm planning to treat a player as an observation. For > > instance, one data point would be "How many times did subject 5 play > > Up?" > > > My motivation for random re-matching was the traditional motivation: I > > wanted to get play which is strategically close to one-shot. But I > > also want the subjects to have an opportunity to learn the game, and > > I'd like data with a bit less variance than I'd to get from truly > > one-shot play. > > > > In such a forum I would not be in favor of discussing how to justify > > > treating observations independently when these are not independent. > > > I hope others don't share your qualms ;). I hope it's clear that I'm > > not looking for a way to mis-represent my data. I made a pragmatic > > decision, and I'm looking for the best way to present the results > > which is both informative and forthright. > > > -Timothy > > > > Timothy Dang wrote: > > >> Hello ESA> > > >> I've recently run a 50-period 2x2 game experiment, with random > > >> re-matching of players each period. Now I need to report the results. > > >> I know random re-matching has tradition behind it, and I also know > > >> that it's been subject to some good criticism when statistics get > > >> applied as if the games are independent. In spite of those critiques, 37 > > >> it seemed the right decision to me. > > > >> But now, what's the right way to actually report the results? My best > > >> feeling is that I should go ahead with the stats as if the > > >> observations were independent, but with the conspicuous caveat that > > >> this isn't truly legit. I've been having trouble finding clean > > >> examples of how this is handled in recent papers. > > > >> Thanks > > > >> -Timothy > > > > -> > > --------------------------------------------------------------------> > > Karl Schlag > > > Professor Tel: +34 93 542 1493 > > > Department of Economics and Business Fax: +34 93 542 1746 > > > Universitat Pompeu Fabra email: [email protected] > > > Ramon Trias Fargas 25-27 www.iue.it/Personal/Schlag/ > > > Barcelona 08005, Spain room: 20-221 Jaume I > > > -> > -----------------------------> > Timothy O'Neill Dang / Cretog8 > > 520-884-7261 > > One monkey don't stop no show. 38 ### Harrison, G.W. & Lau, M.I. 2005. Is the evidence for hyperbolic discounting in humans just an experimental artifact? Behavioral and Brain Sciences 28(5): 657 - What’s the basic point? The elicitation of the time value of payments at different points in time (specifically at t = 0, and t > 0) may be confounded by transaction costs (including experimenter commitment/credibility issues) afflicting the later payments in the typical comparison scheme; these transaction costs may lead subjects to discount the time value of those later payments. As Harrison & Lau say, “(t)he subject is being asked to compare ‘good apples today’ with ‘bad apples tomorrow’”. This effect induces deviations from the exponential curve towards a curve that’s more bowed / present – oriented (“hyperbolic discounting”, or “time inconsistency”) Conceptually, a FED (front end delay) can be used to understand the importance of this confound (e.g., is the discount rate for a given horizon and elicited with a FED different than the discount rate for the same horizon and elicited with no FED?) In fact, Harrison, Lau and Williams (AER 2002) use a FED in a field experiment in Denmark and find that elicited discount rates are “proximately invariant” wrt to horizon. The problem: There are settings where payments at t = 0 (“money today”) have to be compared to payments at t > 0 (“money in the future”); the Harrison, Lau and Williams result seems to suggest that transaction costs (lack of credibility etc.) are the source for present bias. Similarly, Harrison & Lau (2005) suggest that the results in Coller & Williams (EE 1999) suggest that much. The work of Coller, Harrison & Rutstroem (wp 2003) suggest that it takes as little as a 7-day FED to overcome the effects of subjective transaction costs. Related manuscript: - Andersen, S., Harrison, G.W., Lau, M.I. & Rutstrom, E.E. 2008. Eliciting risk and time preferences. Econometrica. Key result: Eliciting risk and time preference together reduces time discount rates dramatically, (Very important paper.) 39 Infinite repetition: [Drawing also on Kreps, Binmore, and also MCWG 12D and 12App] The main idea: If a game is played repeatedly then the mutually desirable outcome (Nash equilibrium) *may* differ from that of the “stage game”. Imagine that A plays the 1sPDG not just against one P but many P’s indexed P1, P2, ... Each Pn is only interest in the payoff from his interaction for A. For A, however, an outcome is now a sequence of results -- what happens with P1, what happens with P2, etc. ... – and hence the sum of payoffs u1, u2, etc., duly discounted: u1 + *u2, + *2 u3 + ... where * , (0,1) could be the discount factor associated with the fixed interest rate r. Crucially, prospective employee Pn, when deciding whether to take employment is aware of A’s history of treatment of workers. This set-up changes the decision problem for A who now has to choose the payoffs of exploiting and not exploiting. Whether the former payoff-dominates the latter is a function of the strategies being used. For example, let’s assume that all workers have the following decision rule, “Never seek employment with an employer who in the past exploited a worker”, then the employer would face a “grim” or “trigger” strategy. (This is nothing but what MCWG, p. 401, call the “Nash reversion strategy: Firms cooperate until someone deviates, and any deviation triggers a permanent retaliation in which both firms thereafter set their prices equal to cost, the one-period Nash strategy.”) In terms of our example, the decision problem of the employer becomes 1 + * + *2 + ... > 2 + 0 + 0 + ... which holds if * > 1/2. So, for all * > ½, the “grim” or “trigger” strategy constitutes a subgame perfect equilibrium in the PA game as parameterized above. (Compare to equations 12.D.1 and 12.D.2 in MCWG: note that the above result is nothing but Proposition 12.D.1 for the 1sPDG. Proposition 12.D.1 deals with the 2sPDG case of the indefinitely repeated Bertrand duopoly game.) Note: A cool result here (which is highly applicable to experimental work): Whenever two people interact in that kind of scenario and you watch them for just one instance, you might think they are really nice, altruistic, what not, when in fact they are just self-interested utility maximizers (!) 40 Continuing with in(de)finite repetitions – see Binmore 2007, pp. 328 - 346: Unfortunately, the “grim” or “trigger” strategy is one of many, many strategies. Let’s focus, as Binmore does, on those strategies that can be represented by finite automata (“idealized computing machines”) that can remember only a finite number of things (actions0 and therefore cannot keep track of all possible histories in a long repeated game. Figure 11.5. shows all 26 one-state and twostate finite automata that can play the PDG: 41 42 Folk theorem of fundamental importance for political philosophy. But it creates an equilibrium selection problem: Which of the many equilibria will ultimately be selected? (Well, it depends.) 43 ### Pedro Dal Bo, Cooperation under the Shadow of the Future: Experimental Evidence from Infinitely Repeated Games - guiding questions (and some answers) infinitely or indefinitely? - What’s the purpose of this paper? (see abstract, intro, conclusion) - What is “the shadow of the future”? - What is an innovation of this paper? (p. 1591) Or, in other words, why is it problematic to assume, without further controls, that an increase in cooperation brought about by an increase in the probability continuation, is due solely to the increase in the probability of continuation? (e.g., p. 1594) - What are the simple stage games used in this article? - What’s the difference between the two games called PD1 and PD2 and shown in Table 2? (p. 1595) Intuitively, for which of these two games would you expect more collaboration? Why? - Does Table 3 confirm your intuition? See also hypotheses 3 and 4. - What exactly was the design? (pp. 1595 – 1597) How did subjects interact? What exactly was the “matching procedure” (p. 1595, p. 1596) Was there are trial period? (pp. 1598 – 1599) What were the players’ earnings? (p. 1595, p. 1598; see also the first paragraph in section III. on p. 1597) What were the three important new elements of the experimental design? (p. 1595 - 1597) What exactly is the relation of the “Dice” sessions to the “Finite” sessions? (Make sure you understand fn 16 well!) What distinguishes “Normal” from “UD” sessions?) Explain exactly why this experiment consisted of eight sessions with three treatments each. (p. 1597) - What exactly are the theoretical predictions? (pp. 1597 – 1598) Explain what Table 3 says ! 44 - Explain why the first two hypotheses are fairly general while the last two are specific. (Make sure to connect your answer to what you see in Table 3.) - How exactly where the experiments implemented? (p. 1598) - What exactly were the results of the experiment? Discuss Table 4 in detail - Discuss Table 5 in detail. What are some of the key results? - Does the shadow of the future increase cooperation (as suggested by theory)? (pp. 1599 – 1600) If so, how large is the effect? Discuss Table 6 in detail - 45 - How do levels of cooperation differ between Dice and Finite sessions ? (pp. 1600 – 1601) Pay attention to the analysis of individual actions (“strategies”). Do payoff Details Matter? (pp. 1601 – 1602) and how much do they matter? - What are the conclusions (the key findings) that can be drawn from the study? strong support for the theory of infinitely repeated games the shadow of the future matters it significantly reduces opportunistic behavior more coop as cont prob goes up more coop in indefinitely repeated games than finitely repeated games with the same expected length behavioral differences in reaction to seemingly small payoff changes interesting (and unpredicted) effects of experience, might be of interest for theories of equilibrium selection in indefinitely repeated games. 46

© Copyright 2020