10/8/19

Recently, three scientists jointly won the 2019 Nobel Prize for Medicine for their pioneering work on how human cells respond to changing oxygen levels: William Kaelin Jr, Sir Peter Ratcliffe, and Gregg Semenza. I understand that science is certainly __much more__ than just statistics, but I thought to myself, "We hear from critics how p-values, statistical significance, frequentism, etc., are supposedly bad for science, but they never seem to talk about the good of these techniques or outright deny there is any. What if these Nobel Prize winners use p-values and statistical significance?"

So I checked some of the papers from the past to the present from the Nobel laureates in the general area of research their Nobel Prize was in. Note that this is not to say the researchers never use any Bayesian techniques or that they always use p-values and statistical significance. In fact, I found a small amount of papers using only Bayesian techniques, frequentist and Bayesian techniques together, and some papers using none at all. Rather, this is to show that scientists of the highest caliber, doing some of the most important work, use p-values and statistical significance. I also didn't compare much dates, but I'm pretty sure some of their papers were even published after the ASA II and Nature pieces saying to not say significant and to not dichotomize.

Here is a small sampling of the results:

- "P values from pairwise Wilcoxon's rank-sum tests"
- "P value was calculated by Pearson's correlation"
- "P values were calculated by Students' t test"
- "P values were estimated from the empirical Bayes moderated t statistics, and q values were estimated using the Benjamini-Hochberg method"
- "Pairwise comparisons between groups (experimental versus control) were performed using an unpaired 2-tailed Student’s t test or Kruskal-Wallis test as appropriate. P < 0.05 was considered to be statistically significant."

- "For all panels, data presented are means ± SD; *p < 0.05; **p < 0.01; ***p < 0.001. Two-tailed p values were determined by unpaired t test."
- "Two-tailed p values were determined by unpaired t test. n.s. = nonsignificant."

- "P value was determined by a mean-based longitudinal mixed-effects model to accommodate repeated measurements within animals."
- "P value was determined by log-rank test."
- "Two-tailed P values were determined by unpaired t test. n.s., nonsignificant."
- "P values for all comparisons other than those pertaining to tumor growth, survival, and gender composition of mouse cohorts were calculated by unpaired two-tailed t test. For comparisons of two groups with significantly different variances, Welch’s t test was used. For comparisons of two groups without significant differences in variances, Student’s t test was used."
- "...was performed using Fisher’s exact test. Statistical significance for all comparisons was determined using a nominal P value <0.05"

- "P
_{vi}is the P-value for gene i" - "(P = 2 x 10
^{-16}, Wilcoxon signed-rank test)" - "(P-value < 0.0001)"

- "Data are shown as the mean ± SEM. Statistical analyses were performed using unpaired Student's t tests. For repeated measures, data were analysed by ANOVA followed by Tukey's multiple comparison test or t test with Holm–Sidak correction for multiple comparisons as appropriate and as described in Hodson et al. (2016). P < 0.05 was considered statistically significant."
- "Significance was tested using two-way ANOVAs (right hand column P value = chronic hypoxia factor; bottom row P value = genotype factor; right column, bottom row P value = chronic hypoxia/genotype interaction factor), followed by t tests (with Holm–Sidak correction) for analysis of individual time points; P < 0.05 highlighted in bold."

- "p=6.33e–14"
- "p<0.0001"
- "Mann–Whitney U-test or analysis of variance (ANOVA) followed by Bonferroni post-test for multiple comparisons was used to determine p-values."

- "*P<0.05, **P<0.01, ***P<0.001 compared to normoxia, unpaired Student’s t-test. n=3 independent experiments from 3 biological replicates"

- "Kaplan–Meier curves were generated using Kaplan-Meier plotter (kmplot.com) and the log-rank test was performed. For tumorigenicity assays, the Fisher exact test was performed. For all other assays, differences between two groups were analyzed by Student t test, whereas differences between multiple groups were analyzed by ANOVA with Bonferroni posttest. P values < 0.05 were considered significant for all analyses."
- "P < 0.0001"

A multifaceted program causes lasting progress for the very poor: Evidence from six countries

- "One year after the end of the intervention, 36 months after the productive asset transfer, 8 out of 10 indices still showed statistically significant gains, and there was very little or no decline in the impact of the program on the key variables (consumption, household assets, and food security). Income and revenues were significantly higher in the treatment group in every country."
- "All treatment effects are presented as standardized z-score indices and 95% confidence intervals"
- "The aggregate test, reported in Panel C, finds that we are not able to reject equality of means across all ten measures (p-value = 0.689)"
- "Second, given that multiple families of outcomes are being reported, we correct for the potential issue of simultaneous inference using multiple inference testing."
- "An exception is Peru, where we see three results out of ten statistically significant at the 5% level."
- "P-value from t-test of equality of means"

- "Finally, for each of these outcomes, we report both the standard p-value and the p-value adjusted for multiple hypotheses testing across all the indices."
- "* significant at the 10% level, ** at the 5% level, *** at the 1% level."
- "Confidence intervals are cluster-bootstrapped at the neighborhood level."
- "F-statistics (and corresponding p-values) are from a joint test of significance in a regression of treatment on all eight variables in each round."

"Within economics, Duflo and her colleagues are sometimes referred to as the randomistas. They have borrowed, from medicine, what Duflo calls a "very robust and very simple tool": they subject social-policy ideas to randomized control trials, as one would use in testing a drug. This approach filters out statistical noise; it connects cause and effect. The policy question might be: Does microfinance work? Or: Can you incentivize teachers to turn up to class? Or: When trying to prevent very poor people from contracting malaria, is it more effective to give them protective bed nets, or to sell the nets at a low price, on the presumption that people are more likely to use something that they've paid for? (A colleague of Duflo's did this study, in Kenya.) As in medicine, a J-PAL trial, at its simplest, will randomly divide a population into two groups, and administer a "treatment" - a textbook, access to a microfinance loan - to one group but not to the other. Because of the randomness, both groups, if large enough, will have the same complexion: the same mixture of old and young, happy and sad, and every other possible source of experimental confusion. If, at the end of the study, one group turns out to have changed—become wealthier, say - then you can be certain that the change is a result of the treatment. A researcher needs to ask the right question in the right way, and this is not easy, but then the trial takes over and a number drops into view. There are other statistical ways to connect cause and effect, but none so transparent, in Duflo's view, or so adept at upsetting expectations. Randomization "takes the guesswork, the wizardry, the technical prowess, the intuition, out of finding out whether something makes a difference," she told me. And so: in the Kenya trial, the best price for bed nets was free."

This was for some 2019 laureates, but what about for some past years' Nobel Prizes?

- "Each curve represents 3 independent experiments of 10 mice per group. P values were calculated using the Log-rank (Mantel-Cox) test (* - p<=0.05, ** - p<=0.01, ***-p<0.001)"
- "We conclude that the combination therapy creates higher CD8/Treg ratios than CTLA-4 blockade alone (23:1 versus 11:1, p=0.0002), while also providing a significantly higher CD4 T-effector/Treg ratio compared to either a4-1BB alone (2.8:1 versus 2:1, p=0.027) or to FVAX alone (2.8:1 versus 1.8:1, p=0.0077) which a4-1BB therapy alone lacks."
- "Student’s t-tests were performed to determine statistical significance between samples (* - p<=0.05, ** - p<=0.01, ***-p<0.001)"
- "Mice receiving FVAX and both aCTLA-4 and a4-1BB show enhanced ratios of CD8+ T-cells relative to CD11b+GR-1+ MDSC when compared to either FVAX (3.9 vs. 0.7, p<0.0001) or aCTLA-4 alone (3.9 vs. 2.2, p=0.03)"
- "P values were calculated using the Log-rank (Mantel-Cox) test (* - p<=0.05, ** - p<=0.01, ***-p<0.001)"

- "The treated mice showed significant increases (P=0.02) in the numbers of LCMV-specific CD8 T cells, as measured by three different MHC class I tetramers"
- "As shown in Fig. 3h, there were significant reductions in virus levels in the spleen (P=0.008), liver (P < 0.0001), lung (P=.0002) and serum (P=0.003) in the treated mice."

- "Values compared by using paired t test."
- Table 2 shows p-values

- "The graph indicates the percentage of surviving mice over time. Significant (P=0.04, log rank test) difference was found between B16GM-CSF and anti–CTLA-4–treated mice injected with control antibody and mice injected with CD4-depleting Ab"
- "Significant (A, P = 0.025; B, P = 0.0004, log rank test) differences were found between B16-GM-CSF vaccinated mice that received either anti–CTLA-4 Ab or were injected with anti-CD25 Ab plus anti-CTLA-4 Ab"
- "For each group the mean is shown +- SEM. Significant (P < 0.02, Student’s t test) difference was found between B16-GM-CSF vaccinated mice that received either anti–CTLA-4 Ab alone or in combination with anti-CD25 Ab"
- "Significant difference (P = 0.03, Student’s t test) was found between mice from groups 3 and 4"

- "Significance was determined by one-way ANOVA with Tukey post hoc analysis"
- "Data are represented as mean +- SEM. Unless otherwise noted, significance was determined using t tests (*p<0.05, **p<0.01, ***p<0.001, **** p<0.0001). ns, not significant"
- "Significance was assayed using chi-square tests"
- "Significance was determined using t tests"
- "Statistical analysis of pathological scores, flow cytometry, and immunohistochemical quantifications were performed by using Student's t test, one-way ANOVA, or Fisher's exact test with GraphPad Prism (GraphPad Software). Limiting dilution assay was evaluated using SPSS (SPSS) with chi-square tests. For survival analysis, Kaplan-Meier plots were drawn and statistical differences evaluated using the log rank Mantel-Cox test. A p value < 0.05 was considered statistically significant."

- "In order to avoid overfitting the data, p-value testing is used to determine which fragments make a significant contribution to predicting chimera folding status"
- "Blocks 1, 5, 7, and block pair 1–7 remained highly significant in the second round, whereas pairs 1–5 and 5–8 dropped in significance to p > 10
^{-3}, a threshold established previously."

- "We observed that lower consensus energies are associated with higher T
_{50}values (Fig. 2a; Pearson r = -0.58, P << 10^{-9}). Furthermore, folded proteins tend to have lower consensus energies than unfolded ones (Fig. 2b; Wilcoxon signed rank test P << 10^{-9})."

- "The random field’s expected fraction of functional sequences shows quantitative agreement with experimental results (r=0.95 with p<0.005). Error bars represent the binomial 95% confidence intervals calculated using the Clopper-Pearson method. (B) The expected additivity agrees well with experimentally determined values (r=0.78 with p=0.21). While the small data set limits the statistical significance of this correlation, all E[A]s are large and within the ranges that are observed experimentally."
- "A three-way analysis of variance shows the protein fold (p<0.001), specific breakpoints (p<0.001), and parent sequence identity (p<0.001) all make significant contributions to the E[f
_{F}]." - the paper also references "Fisher's fundamental theorem of natural selection", the same R.A. Fisher of statistics fame

- "This observation, which has a significance of 5.9 standard deviations, corresponding to a background fluctuation probability of 1.7x10
^{-9}, is compatible with the production and decay of the Standard Model Higgs boson." - "95% confidence level (CL)"
- "Figure 1 shows the expected local p-values ..."
- "Both the local and global p-values can be expressed as a corresponding number of standard deviations using the one-sided Gaussian tail convention."

Also, I and others (Mayo comes to mind) are not convinced that researchers are *not* using things like p-values and statistical significance if they don't include them in their published papers. Researchers would have to use *some*thing like this if they are making claims of strength (or not) of relationships, terms in models, differences in distributions, and so on. In other words, researchers could still be checking p-values and statistical significance "offline" or "behind the scenes", and then write-up their paper and have it published without mentioning p-values and statistical significance, according to their personal tastes and/or arbitrary journal standards. Therefore, it is not inconceivable that there is an *under*counting of examples of using p-values and statistical significance language here.

Thanks for reading.

If you enjoyed *any* of my content, please consider supporting it in a variety of ways:

- Check out a random article at http://statisticool.com/random.htm
- Read a randomly selected poem at http://www.statisticool.com/poetrysample.htm
- Take my "Five Poem Challenge" at http://www.statisticool.com/fivepoemchallenge.htm
- Become a Patreon here
- Follow me on Twitter here
- Buy what you need on Amazon from my affiliate link
- Visit my Amazon author page and read my books
- Share my Shutterstock photo gallery
- Sign up to be a Shutterstock contributor