(Updated 11/16/18)

8/14/18

Tweet |

This article is my response to various arguments against "frequentism", which is roughly defined as:

**frequentist definition of probability**- Implicit in Aristotle, but see
- System of Logic by John Stuart Mill
- The Theory of Probability and The Concept of Probability in the Mathematical Representation of Reality by Reichenbach
- The Logic of Chance by Venn
- Probability, Statistics, and Truth and Mathematical Theory of Probability and Statistics by von Mises

- Two main types, hypothetical (includes infinite), and finite

- Implicit in Aristotle, but see
**frequentist concept/practice of statistics**- sample space/distribution
- fixed/constant parameters
- hypothesis testing, commonly called "null hypothesis significance testing" or "NHST"
- p-values, confidence intervals
- simulation, permutations, bootstrap

- 15 Arguments Against Finite Frequentism and 15 Arguments Against Hypothetical Frequentism by Hajek
- The fallacy of the null-hypothesis significance test Rozeboom
- The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives by Ziliak and McCloskey
- A Litany of Problems With p-values, My Journey From Frequentist to Bayesian Statistics, Null Hypothesis Significance Testing Never Worked by Harrell
- A Dirty Dozen: Twelve P-Value Misconceptions by Goodman
- Everything Wrong With P-Values Under One Roof
and Is Presuming Innocence A Bayesian Prior? by
Briggs
- My response to the latter: Boole's Probability of Judgments

**Frequentism does not take all types of uncertainty into account, so it cannot hold as a concept of probability**Frequentism is*the definition*of__probability__. Much like a specific science limits their field of study to be well-defined, frequentism*purposefully limits*probability to be long-term relative frequency instead of admitting any type of general uncertainty to be the same status as probability. A frequentist certainly*could*model some of these types of uncertainty, but understands it becomes an exercise of modelling using strong assumptions, not studying probability per se. This is not to say that studying non-relative frequency interpretations is unimportant, or that*you*shouldn't learn about it or do it, however.**Frequentism cannot handle n = 1 or one-time events.**No approaches of probability or statistics have very satisfactory answers for n = 1 or small sample or one-time events. For n = 1 you can__only__have 0% or 100% if using relative frequency to define probability. However, a relative frequency would change with more data as n increases away from 1. In some cases we*can*assign probability to single events using a prediction rule. For example, P(A_{n+1}) = xbar_{n}, where it is just a matter of choosing an appropriate statistical model, as Spanos notes, and making your assumptions known. There is also a "many worlds" interpretation of frequentism, and that refers to the "sci-fi" idea that say for a one-time event with probability p = X/N, the event occurred in X worlds out of the N worlds, and this one-time event just happened to occur in our world. To some (but not to me) this "answers" the paradox of trying to supply a probability for one-time events using frequentism.**Strong Law of Large Numbers (SLLN)**says that it is almost certain that between the m^{th}and n^{th}observations in a group of length n, the relative frequency of Heads will remain near the fixed value p, whatever p may be (ie. doesn't have to be 1/2), and be within the interval [p-e, p+e], for__any__small e > 0, provided that m and n are sufficiently large numbers. That is, P(Heads) in [p-e, p+e] > 1 - 1/(m*e^{2}). von Mises talked about such sequences, Wald proved their existence, and Kolmogorov even rested his axiomatic probability on it.**Referring to coin flip experiments is too simplistic to be useful for real life**On the contrary, these are the simplest experiments to discuss probability and statistics so we don't get bogged down in details/weeds and get off course. Note that in a coin flip experiment (I'm not talking about statistics or mathematics here, but just the experiment) one does not need to refer to any likelihood or prior.**How do you know frequencies are stable/converging?**The Strong Law of Large Numbers (SLLN) provides the mathematical theory, but one can simply observe, in coin flip experiments for example, the relative frequency of heads settling down to a horizontal line, and it gets closer as the number of flips increase. What is this limiting behavior if not "probability"? It certainly isn't a subjective belief.**There are repetitive events that have probabilities that don't converge, and this refutes the frequentist notion of probability.**Actually, this refutes that these specific sequences have anything to do with probability, which was never claimed in the first place by frequentists. As von Mises and Wald detail, the properties of the sequences are very specific, for example convergence and the irrelevance of place selection (randomness).**You can never observe an infinite amount of trials**Well one can never actually observe an infinite amount of infinitely skinnier rectangles under a curve, but we are confident integration (area under a curve) works. The long-term relative frequency "settles down" in [p-e, p+e] by the Strong Law of Large Numbers (SLLN) for any small e>0. We can get closer and closer to p, whatever p is. In our finite world, we can say that for any very small d>0, if, at the end of n trials, |f_{n}-p| < d, then we are justified in saying f_{n}~p (read "f_{n}is approximately p") for all intents and purposes. If the "true" p is .5, for example, do you worry if the observed relative frequency is .4999999999 or .500000001? Engineers don't need to take all digits of pi (= 3.14159...) into account to do engineering. There are, however, also finite versions of the laws of large numbers. I'd say if infinity is too large, how about we agree on 1,000,000,000 (much less than infinity)? Why, if we already have at least 1 trial, and we actively plan for replication in science, is the notion of repeating trials, even hypothetically, unbelievable?- The "relies on infinite number of trials" and "bad for one-time events" charges both ignore the middle-ground of finite frequentism.
**Bayesian updates probability, frequentism doesn't**The Bayesian saying is "today's posterior is tomorrow's prior", even if that is rarely actually done in practice. However, a cumulative relative frequency "updates" itself over trials, and not using any beliefs. See Streaming mean and standard deviation, which discussesrelfreq(Heads)

_{t+1}= ((t-1)*relfreq(Heads)_{t}+ I_{t})/t, where

I_{t}= 1 if Heads is observed on the t^{th}trial, 0 otherwiseOf course, the lessons learned and results from experiments are used to inform future experiments and projects. Are there examples of Bayesian updating (posterior

_{t}used as prior_{t+1}) being done long-term? I'd only rely on the results if they had good long-term frequentist properties. Updating will be bad if GI (from GIGO). Do Bayesians guarantee that at any time t along the way there will not be GI? See Compounding Errors showing the general idea of how small errors now can create big errors later on in a process. Owhadi wrote*How do you make sure that your predictions are robust, not only with respect to the choice of prior but also with respect to numerical instabilities arising in the iterative application of the Bayes rule?***The probability, as frequentists define it, can only be in the form a/b, where a and b are natural numbers**This is not an issue, as probability is in the limit. We could also argue no one would ever actually observe a probability of 1/pi, for example, but only what the digits of what their measuring device is showing them with respect to pi. Additionally, I'd also rather be confined to ratios of natural numbers from experiments, and their limits, than allowing probability based on subjective beliefs.**There is Bayesian uncertainty, propensity, and other definitions of probability or approaches**Yes, and these are all inferior definitions or approaches (in my opinion). I address Bayesian primarily on this webpage. For propensity, it relies on frequentism so it is redundant. For likelihood, see Why I am Not a Likelihoodist by Gandenberger. My summary is that it says that likelihoodism gives no good guidance for belief or action.**Clearly parameters are random variables (Bayesian) and not fixed constants (frequentism)**I'd say we are, of course, intuitively "uncertain" about the values of most parameters, but also that they are fixed constants, at least at a given time t. For example, what is the total weight for everyone in the United States*right now*(time = 1)? It is W_{1}. Rather, I should say it*was*W_{1}, but*right now*(time = 2) it is W_{2}. The W_{1}and W_{2}were (and still are) certainly unknown constants. Is c, the speed of light,*really*a constant forever, or does it change over time, and we are just witnessing c_{t}for the time period we are in?

**My students or colleagues or clients get confused with the definition of p-values, hypothesis testing, etc**Students getting confused, or being an ineffective teacher, is not__any__justification for concluding frequentism is flawed. I've personally advised a variety of people, groups, students, professionals, and have never had much problem communicating these concepts. With Bayesian credible intervals you are not*really*saying P(mu in interval) = .80, in my opinion, but are instead saying something like P(mu in interval | my personal beliefs/strong assumptions) = .80, or equivalently Belief(mu in interval) = .80, or Chance(mu in interval) = .80, or Uncertainty(mu in interval) = .80. Frequentism can also be easy to understand. Relative frequencies converge to probability, and we can do experiments to show this. We can make errors when reasoning from data. P-values are just test statistics expressed on another scale. P-values and confidence intervals over time make for good science. Here are some graphs that can be used for teaching these concepts:**Bayesian is "natural", we have "Bayesian brains"**Is it natural to be forced to use Markov Chain Monte Carlo (MCMC) to solve problems? Is it natural to think of improper priors? Natural may simply not be a well-defined concept, but more like a preference. Keeping track of the number of times an event A occurs in N trials as N increases is*more*natural, in my opinion. Counts and histograms are examples of frequencies that are totally natural in probability and statistics.**Bayesian mathematics is harder and frequentists just don't want to put in the effort**Many frequentists*have*put in the effort and found that Bayesian was over-promising and therefore they weren't "getting the bang for the buck", especially if for a lot of cases the two approaches give similar answers. Note this contradicts the "nuisance parameters are harder to deal with in frequentism" charge**It is too easy to get a small p-value.**This contradicts the "difficult to replicate small p-values" charge.**It is too difficult to replicate small p-values that others found.**This contradicts the "too easy to get a small p-value" charge.**Frequentist terms are too confusing. We should switch to using terms that align with Bayesian ideals.**Some Bayesians, such as McElreath in his Bayesian Statistics without Frequentist Language talk, would like to make the following changes in our statistical vocabularyConvention Proposal Data Observed variable Parameter Unobserved variable Likelihood Distribution Prior Distribution Posterior Conditional distribution Estimate *banished*Random *banished*This would be a mistake because data and parameters differ in more aspects than just observed and unobserved, likelihoods and priors are

*very*different and have different uses even if both are "just distributions", and wanting to banish use of the terms "estimate" and "random" is just silly. One could probably argue that Bayesians may want to blur the differences between likelihoods and priors, and banish the words estimate and random, to blunt the criticism against problematic but fundamental Bayesian concepts and simultaneously diminish frequentist contributions. McElreath adds, however, that at times he uses these terms, that sometimes their use is OK, so determining exactly what he is proposing is rather confusing.**Frequentists use randomness to avoid dealing with hard problems**Modern science can use randomization to make inferences of cause and effect and infer from samples to populations. Just these two examples have revolutionized science and our understanding of the world. One can also use randomness in spicing up exercise routines, overcoming boredom, choosing a restaurant to eat out at, making flash cards for studying any topic, revitalizing chess with randomized starting positions, casinos, lotteries, making fair decisions, making scatterplots more readable by jittering, making video game experiences different with each play, generating strong passwords, shuffling the music you listen to, in endeavors such as poetry and art, and on and on. Random numbers play a huge role in modern life. See The Drunkard's Walk: How Randomness Rules Our Lives by Mlodinow.**The American Statistical Association (ASA) wrote a document against p-values**It is important to correct critics' misinformation, over and over again, that the ASA report is not anti p-values, but is only saying to not use a p-value, or any other single measure, as the only deciding factor in an analysis. Here is a quote from a critic as the type of misinformation I am speaking about:As mentioned, the ASA document was not against p-values but against the

*misunderstanding and misuse*of p-values. In that document they wrote that other approaches, like Bayesian, "...have further assumptions". I was*always*taught to not just do p < .05 and leave it at that, but to have good experimental or survey design, give confidence intervals, graphs, not have arbitrary cutoffs, and so on. See Regarding the ASA Statement on P-Values and The Statistical Sleuth by Ramsey and Schafer. Mayo writes*"Misinterpretations and abuses of tests, warned against by the very founders of the tools, shouldn't be the basis for supplanting them with methods unable or less able to assess, control, and alert us to erroneous interpretations of data."***By the way, these warnings about p-values are all things that we have known since Fisher's time.****Experiments**Frequentism lends itself to experiments really well. It is especially good at discovering probabilities. See Flipping Tacks, Probability of Finding Money, How Many Cars Have Old Antennas?, and Probability of Finding Sticks for Self Defense.**History**The book Games, Gods And Gambling: The Origins And History Of Probability And Statistical Ideas From The Earliest Times To The Newtonian Era by Florence Nightingale David, explains how the*origins*of probability and statistics were based on games of chance with simple frequency interpretations. Because the origin of probability was based on frequency concepts, one can fairly conclude frequency concepts are natural.**Lindley said the future is Bayesian**He is a great statistician (understatement), but this might be wishful thinking. For example, it is known (now?) that Bayesian is "brittle". See On the Brittleness of Bayesian Inference by Owhadi, Scovel, and Sullivan and Qualitative Robustness in Bayesian Inference by Owhadi and Scovel. Also, Judea Pearl does not think Bayesian is good for causality (presumably he does not think frequentism is either) See Bayesianism and Causality, or, Why I am Only a Half-Bayesian**Maximum likelihood estimation is also "brittle" because it does not provide the full picture of the parameter surface.**You might just be getting a full picture of your beliefs, which might not be too useful because Bayesian is brittle as already discussed. Brittle here refers to a specific mathematical definition. Additionally, frequentists can use more than just maximum likelihood estimation, for example, method of moments and bootstrapping.**Bayesian is the new probability and statistics, replacing the old frequentism style of probability and statistics**Actually, most people used to be Bayesian (Laplacian!) until results (as in, getting results) from frequentism took over in science. Bayesian is making a comeback due to computation being better now. Bayesian statistics is now a "pop culture" thing being rediscovered and popularized mostly in communities*outside*of statistics proper, like machine learning, artificial intelligence, etc.**Everyone should be Bayesian**See Efron's Why Isn't Everyone a Bayesian and Bayes Theorem in the Twenty-First Century. Also see Senn's You May Believe You Are a Bayesian But You Are Probably Wrong. Also, Mayo details that Gelman has made the remark that*a Bayesian wants everybody else to be a non-Bayesian*That way, he wouldn't have to divide out others' priors before he does his own Bayesian analysis.**Bayesian credible interval interpretation is more natural, and it is what everyone using frequentist confidence intervals wants to say anyway.**It*is*much easier to say "the probability mu is in the interval is 80%" than to reason "if we repeated this process X times, the true mu would be in 80% of the intervals", but it may not be correct, since your credible interval can be strongly influenced by subjective beliefs, and the "probability" Bayesians talk about may not be probability as properly defined but rather "chance", "uncertainty", or "personal belief". In science, we are interested in replication and objectivity, which the frequentist confidence interval gives a nod to.One could argue the other way, that Bayesians

*really*want to say that their procedures have good long-term performance.**Frequentism is too indirect. Direct statements are better.**The logic of modus tollens (MT) says P->Q, and if we observe not Q, therefore we conclude not P. Note that P is the null hypothesis H_{0}, and Q is what we'd expect the test statistic T to be under H_{0}. A concrete example is, we agree on assuming a fair coin model. We therefore expect about 50 heads if we flip a coin 100 times. However, we observe 96 heads (96 put on a p-value scale would be an extremely small p-value). Therefore, we conclude the fair coin model is not good. This type of logical argument is valid and essential for falsification and good science a la Popper.**Modus tollens (MT) is false when put in probability terms**No. It is still valid, but we of course always have risk when making decisions based on data. Modus tollens and modus ponens logic put in terms of probability effectively introduce bounds, much like in linear programming. See Boole's Logic and Probability by Hailperin, and Modus Tollens Probabilized by Wagner.**Bayesian deals with nuisance factors easier**Note this contradicts the "frequentists don't want to deal with hard math" charge. Nuisance parameters are a~~nuisance~~problem for statistics based on profile likelihood ratios, but the distribution can become independent of the nuisance parameters in the limit.**Multiple testing is confusing, and the outcome shouldn't depend on the number of comparisons**However, recognizing and adjusting for multiple comparisons is in line with good understanding of probability and science. Note, this contradicts "a lot of experiments leads to spurious results" and contradicts the "frequentists don't want to deal with hard math" charges. Because frequentists are often willing to adjust alpha, it also slightly contradicts the "using alpha=.05 is arbitrary" charge.

**Strong Law of Large Numbers (SLLN) requires infinity**Actually, finite versions of laws of large numbers exist. See The Laws of Large Numbers Compared by Verhoeff. Also, consider the argument of agreeing upon using an n much less than infinity. Let's just agree on using n = 1,000,000. Do you truly believe you*wouldn't*learn a lot about a coin (phenomenon, claim) from that many flips (experiments, trials)?**Sample space, hypothetical repeated experiments is bad, nonsensical, etc**We literally learn by sampling from the world. Also, if you obtained one or a few samples, it is not outrageous to suggest you can get another sample. There are*many*sample surveys, for example, that have been going on for a long time, any many that are not only done every year, but every quarter, or and even every month. Simulation, Monte Carlo, and bootstrap are done in science all the time, but this is not actually observed data. Counterfactuals are also used, and even essential, in studying causality. See The Book of Why: The New Science of Cause and Effect and Causal Inference in Statistics: A Primer and Causality: Models, Reasoning and Inference by Pearl, for the importance of counterfactual reasoning. Also, counterfactual reasoning is used in science, in the notion of severity and how well a claim has been probed. See Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction by Mayo and Spanos and Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars by Mayo.**Let's look at assumptions. Bayesian: Distributional + prior assumption. Frequentism: Distributional + sampling distribution assumption. You don't need a prior to be 'true', you need it to be defendable. "Given this prior uncertainty, what do the data suggest?" Can you defend the existence of a sampling distribution?**How do you "defend the existence" of a subjective prior that can be anything you believe in your mind? There's a reason sampling distributions do not have a separate variety called "subjective" like priors do. Sampling distributions have to be tied to the real world via sampling, that is, they cannot just be anything.**Bayesians can write down their prior while frequentists can't even write down their sample space**One can write down the sample space say for N flips of a coin. Consider 1 flip, the sample space is S = {H,T}. Consider 2 flips, the sample space is S = {HH,TT,HT,TH}, etc. A computer does sample space enumeration easily. Consider the Monty Hall Let's Make a Deal problem. If we don't switch doors, the sample space is S = {(1,2,1,WIN), (1,3,1,WIN), (2,3,2,LOSE), (3,2,3,LOSE)}, and if we decide to switch doors, the sample space is S = {(2,3,1,WIN), (3,2,1,WIN), (1,2,3,LOSE), (1,3,2,LOSE)}. I do agree that writing down a sample space for difficult problems is...difficult, however. Note that this apparently contradicts the "frequentists don't want to deal with hard math" charge.**Frequentist appeal to asymptotics is silly**Actually, saying appealing to asymptotic results is silly is what is really silly. The Strong Law of Large Numbers (SLLN) and the Central Limit Theorem (CLT), for example, are some of the most important results in mathematical statistics. Approximations are a good thing, especially if the "exact" calculation doesn't differ much from the approximation. The CLT is a mathematical fact, and we see it work in simulations as well. There is also a "Bayesian CLT", the Bernstein-von Mises theorem. Obviously, just blindly applying asymptotics (or anything else) is not wise. Statisticians would need to simply make sure their sample size is large enough and check any and all assumptions (again, just like anything else) to be justified in using asymptotic theory.Some critics have suggested that CLT has poor performance for say a lognormal population. However, if there was a lognormal population, a statistician would make a histogram and observe that it is skewed, and probably consider taking a transformation of the data, such as a log. The critic then says 'ah, but you can't do that' or 'but you don't know what transformation to take'. However, how can the critic even know the population is exactly lognormal to begin with? A statistician in real life would simply observe the skew, take a transform, then back-transform to get, for example, a confidence interval on the original non-transformed scale. But the critic would then say 'ah, but now this is an interval on the median, not the mean'. But the frequentist would then say that for skewed distributions like lognormal, they are often described by their median better than their mean, and moreover, equations exist for confidence intervals of their mean anyway. And on and on and on!

**Frequentism hypothesis testing requires H**Nothing requires any model to be exactly true outside of the mathematics. If any assumptions do not hold exactly in reality, there are the fields of robust and nonparametric statistics which can address these issues._{0}to be exactly true**One-sided hypothesis tests are biased, have greater Type I error, contribute to the replication crisis, have more assumptions, are controversial, and etc.**At OneSided.org, Georgiev addresses these unfair portrayals of one-sided hypothesis tests and many related topics. He writesWe publish articles explaining one-sided statistical tests, resolving paradoxes and proving the need for using one-sided tests of significance and confidence intervals when claims corresponding to directional hypotheses are made. There are interactive simulations and code for simulations you can run yourself. You will also find links to related literature: both for and against one-sided tests.

**Frequentism relies on "i.i.d." assumptions**This is false, but obviously a lot of theory and teaching is done using "i.i.d" assumptions, and then complexity increases from there.**Bayesian statistics uses MCMC to solve problems**Bayesian statistics often rely on frequentist concepts for support. For example, the basic Bayes Rule is itself frequentist. In some forms of Bayesian statistics, prior distributions often come from previous experiments. Also, sampling from the posterior distribution using Markov Chain Monte Carlo (MCMC) has a frequentist*feeling*about it. For example:__Use a burn-in period?__Make coin flips > some small number, since relative frequency is "rough" for a small number of flips.__Use more iterations?__Flip the coin more times, you know it will have a better chance of convergence.__Use more chains?__Flip more coins, multiple evidence of convergence is better evidence than few.__Starting with a different seed?__If it still converges with different seeds, this is like entering a "collective" randomly and still getting the same relative frequency.

In addition, Bayesian statistics regularly uses other frequentist concepts such as histograms, distributions, sampling, simulation, model checking, calibration, nonparametrics, and asymptotic procedures, to name a few.

**Priors**Do Bayesians observe all values of a prior/posterior that has a continuous distribution or the thousands of realizations from a MCMC? If not, then they are using data they didn't actually observe.

**Frequentism is bad for science**Bayesian claims of frequentism being bad for science, fail to mention the examples of frequentism being good for science, which is selective reporting. There are plenty of examples of frequentism being good, if not great, for science: survey sampling, polling, quality control, Framingham Heart Study, studies showing smoking is bad for you, Rothamsted Experimental Station experimental design, casinos, life insurance, weather prediction (NWS MOS), lotteries, German tank problem, randomization, ecology, and Bayes Theorem itself is a frequentist theorem.I

**strongly**believe that__probability and statistics are the method of the scientific method__. See the books The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century by Salsburg, and Creating Modern Probability: Its Mathematics, Physics and Philosophy in Historical Perspective by von Plato. Also, see Frequentism is Good response to the "airport fallacy" article by Gunter and Tong that appeared in the 10/2017 issue of Significance, I say*"I agree that frequentism is an embarrassment, but it is actually an embarrassment of riches."***success of meta-analysis**The general results of standard meta-analysis, for example, by Cochrane and others, demonstrate that compilation of frequentism over time produces scientific knowledge.**There are many published false positives. Therefore, frequentism is bad**Any false positive is a result of working with data, where making a decision entails risk, as well as the fault of arbitrary journal standards, such as*requiring*"statistical significance" to be p < .05 before they will consider publishing your work. The same thing could easily occur with Bayes factor (BF) cutoffs. Not to mention, Bayesian and other false positive calculators are very sensitive to choice of priors and other assumptions. Also see Why Most Published Research Findings Are False by Ioannidis.**There are proofs of God existing, the resurrection of Jesus, and other miracles, that rely on Bayesian statistics**These are silly, but use priors correctly, arguably, especially in the subjective Bayesian paradigm. Some argue that it was Bayes'/Price's intention to use the theorem to refute Hume's argument against miracles. For some examples, see The Probability of God: A Simple Calculation That Proves the Ultimate Truth by Unwin, The Existence of God by Swinburne, and Bayesian evaluation for the likelihood of Christ's resurrection. Note that this could not possibly occur with frequentism. Are these therefore a mark against Bayesian statistics*as a whole*? Of course not! So why should misuses or misunderstandings of frequentist statistics or hypothesis testing or p-values count against frequentism?**Examples of Bayesian probability or statistics not working, or paradoxes**There are examples, but they are not well-known or popularized. There are also examples of generally good science like Bayesian Methods in the Search for MH370 that do not seem to be working. Note, searches using these and related search theory methods have helped find submarines and other planes (although they are*never*the only thing used in the search). On this important issue, statistician Mike Chillit has said"...while Bayesian is a powerful analysis tool in the right hands, it is not without risk. A Bayes formula that is front-loaded with a controlling assumption that MH370 "flew due south with no human input until fuel was exhausted" will always return whimsical results unless that is precisely what happened."

"By far the most serious error in this search was the attempt to make Bayesian statistics resolve location issues. In truth, Bayesian cannot be constructed even after the fact to find the correct location. It is simply not the tool for this challenge."

*"They thought they were incredibly clever. Bragged about their analysis skills in endless articles; spent more time writing a book on Bayesian than looking. They believed they'd find it within a month. But the analysis was way beyond naive."***The "replication crisis" was/is caused by frequentist null hypothesis significance testing**Everyone knows that a replication is technically never absolutely identical to another replication. In real life, we come as close as we can in the experimental setup, and this is the "similar" category. Plus, we are working with random data. No matter if frequentist or Bayesian, our decisions will have errors associated with them because of this fact of nature.**Questionable Bayesian research practices**Critics of NHST focus on questionable research practices as if they only apply to NHST. However, questionable research practices obviously exist with Bayesian approaches too. Elise Gould has said*"Non-NHST research is just as susceptible to QRPs as NHST."***Wrong definition of "replication crisis"**The standard meaning of "replication crisis" is that the effect size, or statistic, or general results of a current study did not match or reproduce those of a previous similarly designed study. However, that is not the standard experimental design definition of a "replication". The__only__thing "replication" means in experimental design is that the similarly designed study was conducted, and__not__that it obtained a similar effect size or statistic as a previous similarly designed study. In other words, if the replication "goes the other way", that is actually*good*information for scientific knowledge, and not a "crisis" which is the standard narrative being perpetuated.**German tank problem**See Wikipedia's entry on the German tank problem. Frequentism worked there just fine and Bayes did too, but I would say the frequentist solution is easier to do and explain. The Wikipedia article says the German tank problem is"...a practical estimation question whose answer is simple (especially in the frequentist setting) but not obvious (especially in the Bayesian setting)."

**frequentism and law**Frequentism and null hypothesis significance testing has proven very effective in law. See Legal Sufficiency of Statistical Evidence by Gelbach and Kobayashi. They say"Our core result is that mathematical statistics and black-letter law combine to create a simple standard: statistical estimation evidence is legally sufficient when it fits the litigation position of the party relying on it. This means statistical estimation evidence is legally sufficient when the pvalue is less than 0.5; equivalently, the preponderance standard is frequentist hypothesis testing with a significance level of just below 0.5."

"Finally, we show that conventional significance levels such as 0.05 require elevated standards of proof tantamount to clear-and-convincing or beyond-a-reasonable-doubt."

**Federalist Papers**See Applied Bayesian and Classical Inference: The Case of The Federalist Papers by Mosteller and Wallace. This is a__fantastic__book where they use frequentist ("classical") discriminant analysis to determine authorship of the Federalist Papers with unknown authorship, and contrast this to using a Bayesian approach. The level of detail they give is mind-boggling, and they really set the standard for these types of analyses. I'd recommend everyone read this book at some point in their statistical life. My take on this work is that the frequentist approach basically gives the same answer (spoiler: Madison) for*much*less assumptions and work (one can easily see this based on page counts of the frequentist and Bayesian sections). The Bayesian approach here is very dependent on choice of prior distributions and parameters. I'd like to point out that their Bayesian approach also relies heavily on*frequencies*of words and combinations. Is a Bayesian analysis relying on frequencies a type of frequentism?**We learn from sampling the world**See Sampling Algorithms by Tille. If we have a population of N things and we sample n things, uncertainty about what is being measured decreases as n/N, the "sampling fraction", goes to 1.**We learn from repetition**We'd have strong suspicion a coin is biased, for example, after flipping it many times and using the Strong Law of Large Numbers (SLLN), as well as using frequentist results from quality control. We'd have a better strategy for game theory situations after more repetitions. See Games and Decisions: Introduction and Critical Survey by Luce and Raiffa.**Assumptions and Ockham's Razor**I believe that frequentism has less assumptions going into it because Bayes has all that frequentism has, plus priors and parameters and hyperparameters, and more overall subjectivity. If we let E stand for an event, and H_{1}for the one hypothesis, H_{2}for the other hypothesis, then Ockham's Razor is:*if hypotheses H*_{1}(m) and H_{2}(n), with assumptions m and n respectively, explain event E equally well, choose H_{1}as the best working hypothesis if m < n**severity**The notion of "severity" demonstrates frequentism and hypothesis testing and their relation to good science. See Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars by Mayo, and Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction by Mayo and Spanos. It essentially formalizes Popper's notion of thoroughly testing a claim. They write"The intuition behind requiring severity is that:

Data

**x**_{0}in test T provide good evidence for inferring H (just) to the extent that H passes severely with**x**_{0}, i.e., to the extent that H would (very probably) not have survived the test so well were H false."**Nonparametric**The nonparametric statistics approach has even less assumptions than standard frequentism or Bayesian statistics. See Nonparametric Statistical Methods by Hollander, Wolfe, and Chicken. Also see Nonparametric Statistical Inference by Gibbons and Chakraborti.**General skepticism of Bayesian interpretations**See Frequentism as Positivism: a three-sided interpretation of probability by Lingamneni. In it he shows how probability interpretations are hierarchical, as well as says"...while I consider myself a frequentist, I affirm the value of Bayesian probability...My skepticism is confined to claims such as the following: all probabilities are Bayesian probabilities, all knowledge is Bayesian credence, and all learning is Bayesian conditionalization".

Also see Bayesian Just-So Stories in Psychology and Neuroscience by Bowers and Davis. In it, they say

*"According to Bayesian theories in psychology and neuroscience, minds and brains are (near) optimal in solving a wide range of tasks. We challenge this view and argue that more traditional, non-Bayesian approaches are more promising."*Also see Is it Always Rational to Satisfy Savage's Axioms? by Gilboa, Postlewaite, and Schmeidler. In it, they say

*"This note argues that, under some circumstances, it is more rational not to behave inaccordance with a Bayesian prior than to do so. The starting point is that in the absence of information, choosing a prior is arbitrary. If the prior is to have meaningful implications, it is more rational to admit that one does not have sufficient information to generate a prior than to pretend that one does. This suggests a view of rationality that requires a compromise between internal coherence and justification, similarly to compromises that appear in moral dilemmas. Finally, it is argued that Savage's axioms are more compelling when applied to a naturally given state space than to an analytically constructed one; in the latter case, it may be more rational to violate the axioms than to be Bayesian."***Null hypothesis significance testing (NHST) is too difficult with planned/unplanned "data looks" and stopping rules in clinical trials**Of course, this contradicts the "frequentists avoid the harder Bayesian math" charge somewhat. On the contrary, various "adaptive designs" have been worked out and more are being explored in medicine, sample surveys, and other areas. See Adaptive Designs for Clinical Trials and the math details by Bhatt and Mehta. Do multiple data looks and etc. make things harder? Absolutely, in frequentist and Bayesian approaches. Are the extra difficulties insurmountable? Probably not. Not to mention, so-called NHST is the dominant statistical method in practice, so how is it too difficult as claimed if most everyone is actually doing it?**If two persons work on the same data and have different stopping intention, they may get two different p- values.**If one stopped after m trials and one stopped after n trials, and m is different than n, obviously the results probably*should*differ because the data would not be the same as claimed. As mentioned above, things like this could possibly be accounted for in sequential and adaptive design. Frequentism can directly address stopping rule issues, while Bayesian inference sweeps the issue under the rug because it only considers the data that were actually observed (observed data convoluted with a possibly non-observed subjective prior, that is). As Steele notes, Stopping rules matter to Bayesians too. Steele writes"If a drug company presents some results to us - "a sample of n patients showed that drug X was more effective than drug Y" - and this sample could i) have had size n fixed in advance, or ii) been generated via an optional stopping test that was 'stacked' in favour of accepting drug X as more effective - do we care which of these was the case? Do we think it is relevant to ask the drug company what sort of test they performed when making our final assessment of the hypotheses? If the answer to this question is 'yes', then the Bayesian approach seems to be wrong-headed or at least deficient in some way."

See also Why optional stopping is a problem for Bayesians by Heide and Grunwald.

**I dislike Ronald Fisher, therefore frequentism is false**Most of the dislike is Fisher envy. He created maximum likelihood, experimental design, ANOVA, F distribution, sufficiency, co-founder of the field of population genetics, conducted important research on natural selection and inheritance, and gave us many statistical terms. What have any critics of frequentism, or any prominent Bayesians for that matter, done in comparison? A case of sour grapes perhaps? Some quotes I found on Fisher are:- greatest statistician ever
- one of the greatest scientists in the 20th century
- greatest biologist since Charles Darwin
__Statistical Methods for Research Workers__occupies a position in quantitative biology similar to Isaac Newton's__Principia__in physics

**Fisher liked smoking, therefore frequentism is false**The likes of a man or woman have no bearing on statistical theory. Obvious? He did accept the correlation between smoking and lung cancer, but not the causation. He said more research needed to be done on the issue.**Fisher studied eugenics, therefore frequentism is false**Studying eugenics was socially acceptable at the time.**I do not like the label "Inverse Probability" for Bayesianism, therefore frequentism is false**Some critics, who are unaware of the history of probability and statistics, claimed that Fisher created the term "Inverse Probability" and that this was intellectually dishonest. In contrast, "direct probability" refers to the likelihood, because that is where the data you directly observe enter, whereas "inverse probability" refers to probability distributions of unobserved parameters. However, the term "inverse probability" for Bayesian probability was used long before Fisher in an 1830's paper by de Morgan, who was referencing work by Laplace. In fact, early on Fisher used the term "inverse probability", but later was one of the first to use the adjective "Bayesian". Fienberg discusses this in When Did Bayesian Inference Become "Bayesian"?.**But...Fisher!**Complain about Fisher all you'd like, but many others have also pointed out various flaws in the Bayesian approach, for example Boole, Venn, von Mises, Lecam, Neyman, Mayo, Efron, Taleb, and on and on. Critics need to address the flaws rather than the person.

**Frequentism is used to p-hack**The loudest claims of "p-hacking" may really just be "p-envy", or perhaps what Wasserman calls "frequentist persuit". If anything, Bayesian inferences can increase these problems, or create a different set of problems, because in addition to the usual myriad of things to choose from in any analysis, now we have an infinite number of priors and other statistics we can choose from. See Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking by Wicherts, Veldkamp, et al, for a good discussion of p-hacking. Also, make sure not to look at data prior to making the prior, and don't retry your analysis with different priors. Of course,__any__method, frequentist or Bayesian (or anything else), can be "hacked" or "gamed". The article Possible Solution to Publication Bias Through Bayesian Statistics, Including Proper Null Hypothesis Testing by Konijn et al discusses "BF-hacking" in Bayesian analysis, and notes"God would love a Bayes Factor of 3.01 nearly as much as a BF of 2.99."

**Pre-registration**The idea for all scientific studies to "pre register" to help prevent scientist/researcher p-hacking/prior-fiddling behavior I think is really great, for Bayesian, frequentist, anything.**Frequentism is "ad hoc"**Is "ad hoc" a bad word? There*are*many ways to interpret even simple 2x2 tables, but why is that bad? Note, this contradicts the "frequentists apply their stuff too mechanistically" charge. Let's talk about priors. How many ways can we assign priors, hyperparameters, etc.? How often do Bayesians go back and tweak their prior to get convergence or the prior predictive distribution or other results "just right"?**Specifying statistical tests is too arbitrary**Mostly one conditions on sufficient statistics. Is specifying a prior not arbitrary? Where do you stop with parameters on priors, hyperparameters, and on and on. Conjugate priors seem completely artificial (conjugacy is a term for the combination of the prior with the likelihood that yields a posterior belonging to the same family of distributions as the prior, which simplifies the analyses).**Frequentists apply their methods too mechanistically**Bayesians too. Get prior. Get likelihood. MCMC to get posterior. Use posterior_{t}for prior_{t+1}("today's posterior is tomorrow's prior"). Both, however, are caricatures. The careful statistician, Bayesian or frequentist, does not operate mindlessly. Of course, this contradicts the "ad hoc" charge somewhat.

**Cutoff of p<.05 is arbitrary**Yes, I agree, and Fisher himself pointed it out. He saidSee his Statistical Methods, Experimental Design, and Scientific Inference Arbitrary cutoffs are a standard of many journals, not"It is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result. It is obvious that an experiment would be useless of which no possible result would satisfy him".

__any__problem with the statistical theory itself. What is causing you to publish there? What is causing you from just reporting the observed p-value? What is causing you to not use a different alpha? Why are you not also focusing on experimental design and power? Why don't you replicate your experiment yourself a few times before thinking about publishing? The Neyman-Pearson approach looks more at rules to govern behavior and helping to insure that in the long run we are not often wrong.**big data**Mayo writes*"In some cases it's thought Big Data foisted statistics on fields unfamiliar with its dangers.."***Frequentism only considers sampling error**This is a__very__common misconception held by critics of frequentism. The total survey error approach in survey statistics, for example, focuses on many types of errors, not just sampling error. In the 1940s, Deming discussed many types of non-sampling errors in his classic Some Theory of Sampling. Also see Total Survey Error in Practice by Biemer, Leeuw, et al. It is also very necessary to mention Pierre Gy's theory of sampling. Gy developed a total error approach for sampling solids, liquids, and gases, which is very different from survey sampling. See A Primer for Sampling Solids, Liquids, and Gases: Based on the Seven Sampling Errors of Pierre Gy by Patricia Smith. Statisticians do their best to minimize sampling and non-sampling errors.**With a large enough sample size you can declare anything statistically significant**Merely increasing sample size, while increasing power, increases the tests sensitivity, and this shows up in severity measure. Additionally, increasing n by say b data points will only really matter if you get the "right data" to make your measure statistically significant after adding the b additional pieces of data.**You can't learn anything from hypothesis testing, or all you learn is that it is unlikely you would have gotten these data if the null were true and there is literally no alternative theory to estimate the probability of, just "not null."**Bayesian and other critics often attempt to limit frequentism to be*only*null hypothesis significance testing, which in reality is just a single component of what is under the umbrella of frequentism. We understand that knowledge does not happen in a vacuum, and the combination of experimental design, science, survey sampling, and other sound statistics, and "all you learn" from hypothesis testing can be quite a lot. See the deflection of light example from above. Additionally, the Neyman-Pearson tests can be the most powerful or uniformly most powerful, which seems very important. Wikipedia notes that Neyman-Pearson tests are used in things like economics of land value, electronics engineering, design and use of radar systems, digital communication systems, signal processing systems, minimizing false alarms or missed detections, particle physics, and tests for signatures of new physics against nominal Standard Model predictions.**All null models are actually false, therefore hypothesis testing is worthless**This is somewhat irrelevant as "all models are false, some are useful"**Null results from hypothesis testing aren't useful**- This is
*totally*false. See HEP physics looking at p-values. In it they say"Statistical methods continue to play a crucial role in HEP analyses; recent Higgs discovery is an important example. HEP has focused on frequentist tests for both p- values and limits; many tools developed."

- Another example is economic activity in a NAICS, looking at changes from year to year (null is no change). This information is used by agencies in their official statistics for gross domestic product (GDP), income accounts, and policy decision making.
- Another example is in medicine. See Effects of n-3 Fatty Acid Supplements in Diabetes Mellitus, where a null result was very useful.
- I believe changepoint analysis is a (yet another) great example of the importance and success of hypothesis testing. Changepoint detection is the task of estimating the point at which various statistical properties of a sequence of observations change. In the paper changepoint: An R Package for Changepoint Analysis by Killick and Eckley write
"The detection of a single changepoint can be posed as a hypothesis test. The null hypothesis, H

_{0}, corresponds to no changepoint (m = 0) and the alternative hypothesis, H_{1}, is a single changepoint (m = 1)"

- This is
**Frequentism focuses on proper variance estimation**The "right" variance is key in survey design (and all areas) because it allows you, for example, to get an accurate denominator in a test statistic (observed-expected)/standard error, and hence a more correct probability and decision. See Introduction to Variance Estimation by Wolter.**Frequentists suffer from "dichotomania" - always making a decision based on two forced outcomes, such as reject or fail to reject**As mentioned, frequentism is not*just*null hypothesis significance testing. But, if we focus on that and need to make a decision, logically our choices need to exhaust the parameter space. If one wants to make a Yes/No decision, frequentism would provide estimates and just about any other statistic, not*just*the Yes/No decision. Of course, other times we may have more than two decisions to decide from, and frequentism handles these cases as well. I personally would rather "suffer from dichotomania" and make decisions than suffer from extreme subjectivity, using brittle priors, and pretending belief is probability.**P-values are bad, but my other statistic is better**See In defense of P values by Murtaugh. The p-value, CI, AIC, BIC, BF, are all very much related. I think about a p-value as a test statistic put on a different scale. Saying a p-value is "bad" is like saying use Fahrenheit (F) over Celsius (C) because C is bad. As an example, consider that the world starts using Bayes Factors (BF) instead of p-values. A question naturally arises, for what values of BF do things become something like "statistically significant"? Consider the following informal loose correspondence between p-values and BF:p-value Corresponding Bayes Factor .05 3 - 5 .01 12 - 20 .005 25 - 50 .001 100 - 200 Perhaps there would be academic journals that would not let one publish if the BF is not greater than 5. Maybe there would be replications of studies that had large BF that now have a smaller BF. There would probably be plenty of papers saying we need reform because of the misunderstanding of BF even among professional statisticians, or that some other statistic or approach is better than BF. A few people would mention that the first users of BF pointed out these stumbling blocks one misuses of BF a long time ago. One sees the point I hope. Statisticians also use tables, graphs, and other statistics to make conclusions, so an over-emphasis on p-values, BF, etc., is somewhat misguided.

**Reference class problem**A reference class problem exists with not just frequentism, but also Bayesian. For example, does the prior you're using on the ability of soccer players apply for all players, all male players, all players within a given year, all players on a given team, etc. No matter the measure, we always need to define what it is a measure of. Fisher himself pointed this out in the 1930s and Venn 70 years before that! Just like all probabilities are conditional, we all belong to different classes and frequency(male) is different from frequency(male wears glasses), etc. This is why the*context*of the problem is important where you define all of these things. Often if a class is too small or empty, and could not meet distributional assumptions, one can "collapse" to the next class that is not too small or empty. As an example, consider the North American Industry Classification System, or NAICS, levels. For example, if there were few or no observations in NAICS 11116 "Rice Farming" you could collapse to NAICS 1111 "Oilseed and Grain Farming". If there were few or no observations in NAICS 1111, you could collapse to NAICS 111 "Crop Production". And last, if there were few or no observations in NAICS 111, you could collapse to NAICS 11 "Agriculture, Forestry, Fishing and Hunting". A scenario like this could come into play if setting up cells for imputation, for example.**Frequentism violates the Strong Likelihood Principle**Yes in some sense, and so? The likelihood itself doesn't obey probability rules, needs to be calibrated, and doesn't have as high of status as probability, so using the likelihood alone is not a reasonable way to do science. See In All Likelihood: Statistical Modelling and Inference Using Likelihood by Pawitan. Additionally, the SLP violation charge has also been severely critiqued and found wanting. If the Weak Conditionality Principle is WCP, and the Sufficiency Principle is SP, in On the Birnbaum Argument for the Strong Likelihood Principle by Mayo, she writes"Although his [Birnbaum] argument purports that [(WCP and SP) entails SLP], we show how data may violate the SLP while holding both the WCP and SP."

**Maximum likelihood methods, mostly used by frequentists, can have problems when the arbitrarily defined space of possible parameter values includes regions that make no sense.**On one hand, this can allow possible parameter values that make little sense. On the other hand, by letting the data speak, it can help prevent subjective beliefs and strong, possibly unwarranted, assumptions from dictating allowable parameter values. Efron has mentioned that one can solve many of these issues by using the bootstrap. One could also use a penalized likelihood approach. Frequentists are not "stuck" using only maximum likelihood.**Basu's elephant shows the flaws with Horvitz-Thompson (HT) estimators**On the contrary, it shows a silly example where you don't do proper survey design. The "paradox" disappears entirely if you create your survey and weights appropriately. For example, weighting by a measure of size (elephant weight). Even in the flawed example, if we had larger n the paradox would disappear. But of course, if the person just wanted to estimate the weight of all elephants by using the weight of a single elephant, then just state your assumptions and methodology and just do it, and there is no need for HT or any other type of sampling whatsoever.**Various confidence interval (CI) paradoxes**There*are*some well-known paradoxes with confidence intervals.**Is the confidence 100% or 50%, or 75%?**From In All Likelihood: Statistical Modelling and Inference Using Likelihood by Pawitan (adapted from Berger and Wolpert)Someone picks a fixed integer theta and asks you to guess it based on some data as follows. He is going to toss a coin twice (you do not see the outcomes), and from each toss he will report theta+1 if it turns out heads, or theta-1 otherwise. Hence the data x

_{1}and x_{2}are an i.i.d. sample from a distribution that has probability .5 on theta-1 or theta+1. For example, he may report x_{1}= 5 and x_{2}= 5.The following guess will have a 75% probability of being correct:

C(x

_{1}, x_{2}) =

x_{1}-1, if x_{1}= x_{2}

(x_{1}+x_{2})/2, otherwiseHowever, if x

_{1}ne x_{2}, we should be 100% confident that the guess is correct, otherwise we are only 50% confident. It will be absurd to insist that on observing x_{1}ne x_{2}you only have 75% confidence in (x_{1}+x_{2})/2.First, you

*have*to love us mathematical statisticians because this example is one of the most contrived examples I've ever seen! Second, there is actually no paradox because the "confidence" is in the entire process. If you want to break down a process into subsets of the process (the 100% or 50% parts), you can do that as well. This "paradox" is basically the confidence interval version of the reference class problem. See this spreadsheet for a simulation.**Jaynes' truncated exponential failure times example**Consider the modelp(x|theta) =

e^{(theta-x)}, if x > theta

0, if x < thetaWe observe {10, 12, 15}. What is a 95% confidence or credible interval for theta if it is known that theta must be less than 10? It turns out that a naive frequentist confidence interval, using an unbiased estimator approach, gives a 95% confidence interval of (10.2, 12.2). The fact that the

*lower*limit of the confidence interval is greater than 10 is a problem because logically theta must be smaller than the smallest observed data. The Bayesian credible interval, using a flat prior, gives a 95% credible interval of (9, 10) which is more realistic. This example unfortunately doesn't permit the frequentists to consider any other approach for calculating confidence intervals. I believe order statistics (for example the minimum) and bootstrap could be useful with this problem. See this spreadsheet for some explorations.

Note that these paradoxes are typically resolved by some of these approaches

- realizing they aren't paradoxes at all ("not a bug but a feature")
- using larger samples
- using different data
- taking information withheld from the frequentist into account
- using other frequentist approaches that the counterexample prohibits

**False positives**These are not counterexamples but an outcome of working with data and making decisions in the face of risks, and having to conform to arbitrary journal standards. Let's not pretend that there aren't or wouldn't be any false positives if we use a Bayesian analysis.**Counterexamples, paradoxes, or issues in Bayesian probability and statistics.**- Consider a prior on the parameter theta where theta ~ U(0,1). What about the distribution of theta
^{2}, log[theta/(1-theta)], or 1/theta? It is true that I'd expect theta^{500}to be closer to 0 than 1, but the general question still stands of*how does ignorance on one scale translate into knowledge on another*? - Cromwell's rule. Cannot update priors of 0 away from 0, no matter how much data you obtain. This can be interpreted to mean that hard convictions are insensitive to counter-evidence.
- "Bayesian divergence". From Wikipedia
"An example of Bayesian divergence of opinion is based on Appendix A of Sharon Bertsch McGrayne's 2011 book The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy. Tim and Susan disagree as to whether a stranger who has two fair coins and one unfair coin (one with heads on both sides) has tossed one of the two fair coins or the unfair one; the stranger has tossed one of his coins three times and it has come up heads each time.

Tim assumes that the stranger picked the coin randomly - i.e., assumes a prior probability distribution in which each coin had a 1/3 chance of being the one picked. Applying Bayesian inference, Tim then calculates an 80% probability that the result of three consecutive heads was achieved by using the unfair coin, because each of the fair coins had a 1/8 chance of giving three straight heads, while the unfair coin had an 8/8 chance; out of 24 equally likely possibilities for what could happen, 8 out of the 10 that agree with the observations came from the unfair coin. If more flips are conducted, each further head increases the probability that the coin is the unfair one. If no tail ever appears, this probability converges to 1. But if a tail ever occurs, the probability that the coin is unfair immediately goes to 0 and stays at 0 permanently.

Susan assumes the stranger chose a fair coin (so the prior probability that the tossed coin is the unfair coin is 0). Consequently, Susan calculates the probability that three (or any number of consecutive heads) were tossed with the unfair coin must be 0; if still more heads are thrown, Susan does not change her probability. Tim and Susan's probabilities do not converge as more and more heads are thrown."

- "Bayesian convergence". Also from Wikipedia
An example of Bayesian convergence of opinion is in Nate Silver's 2012 book The Signal and the Noise: Why so many predictions fail - but some don't. After stating, "Absolutely nothing useful is realized when one person who holds that there is a 0 (zero) percent probability of something argues against another person who holds that the probability is 100 percent", Silver describes a simulation where three investors start out with initial guesses of 10%, 50% and 90% that the stock market is in a bull market; by the end of the simulation (shown in a graph), "all of the investors conclude they are in a bull market with almost (although not exactly of course) 100 percent certainty."

Bayesian convergence in this case is simply a nicer way of expressing the likelihood swamped the priors.

- Bayesian does not explain why we would even need a prior/belief for a coin flip experiment to see the Strong Law of Large Numbers (SLLN) in action. You might have knowledge the coin comes from a mint or a magician, or you might be mistaken in your beliefs, but the data from a good experiment would reveal this.
- If there were no human, or other, brains around to observe an event with probability p, frequentist probability would still work for estimating p, but a subjective Bayesian approach wouldn't.
- The article Possible Solution to Publication Bias Through Bayesian Statistics, Including Proper Null Hypothesis Testing by Konijn et al discusses "BF-hacking".
- Bayesian statisticians can tweak their priors until convergence and other criteria look good. Is this tweaking accounted for? Is this even still doing Bayesian statistics?
- Bayesian statistics does not tell you specifically how to select a prior for all situations. The issue gets very complex in multidimensional settings, as well as trying to select a prior that is good for many parameters.
- Simulations can give slightly different answers unless the seed for the pseudo random number generator is fixed. This could be seen as contradicting the claim that Bayesian inferences are exact, since MCMC is being used to solve problems.
- Two people using the same data and likelihood, but even slightly different priors, can reach different conclusions. Why should personal belief matter more than data?
- Bayesian use of priors or updating does not prevent against poor assumptions and models. Bayesian statistics is not the right tool for every job, and using Bayesian analysis does not automatically make you right.
- Is a Bayes Theorem with a subjective prior a Drake Equation type of thing? That is, the equation is "correct", but you can change the inputs to get whatever output you want.
- Bayesian analyses that are not simple rely exclusively on MCMC with possibly no way to verify your results analytically.
- Every analysis relying on Bayesian statistics automatically requires a sensitivity analysis on the priors.
- In A Systematic Review of Bayesian Articles in Psychology: The Last 25 Years by van de Schoot et al, the popularity of Bayesian analysis has increased since 1990 in psychology articles. However, quantity is not necessarily quality, and they write
Because Bayesian analysis is "brittle" and often highly dependent on the priors, and no one can replicate your work if you don't detail your priors, these practices are"...31.1% of the articles did not even discuss the priors implemented"

"Another 24% of the articles discussed the prior superficially, but did not provide enough information to reproduce the prior settings..."

"The discussion about the level of informativeness of the prior varied article-by-article and was only reported in 56.4% of the articles. It appears that definitions categorizing "informative," "mildly/weakly informative," and "noninformative" priors is not a settled issue."

"Some level of informative priors was used in 26.7% of the empirical articles. For these articles we feel it is important to report on the source of where the prior information came from. Therefore, it is striking that 34.1% of these articles did not report any information about the source of the prior."

"Based on the wording used by the original authors of the articles, as reported above 30 empirical regression-based articles used an informative prior. Of those, 12 (40%) reported a sensitivity analysis; only three of these articles fully described the sensitivity analysis in their articles (see, e.g., Gajewski et al., 2012; Matzke et al., 2015). Out of the 64 articles that used uninformative priors, 12 (18.8%) articles reported a sensitivity analysis. Of the 73 articles that did not specify the informativeness of their priors, three (4.1%) articles reported that they performed a sensitivity analysis, although none fully described it."

__very__worrying. - Analysis for large and small sample proceeds the same, which is consistent, but mistaken. It is known that priors are even more influential with small samples.
- No deep distinctions between statistics and parameters, priors and likelihoods, and fixed and random effects. And yes, I "get" that Bayesians view this as strength rather than a weakness.
- Priors can misrepresent opinion as experiment, and vice a versa. Lecam writes
Thus if we follow the theory and communicate to another person a density C*theta

^{100}*(l-theta)^{100}this person has no way of knowing whether (1) an experiment with 200 trials has taken place or (2) no experiment took place and this is simply an a priori expression of opinion. Since some of us would argue that the case with 200 trials is more "reliable" than the other, something is missing in the transmission of information. - Bartlett paradox. If a prior is flat on an infinite-volume parameter manifold, this scenario always favors the smaller model. At first this seems like a sensible application of Occam's Razor, but the paradox is that this happens regardless of the goodness of fit. See The Lindley paradox: the loss of resolution in Bayesian inference by LaMont and Wiggins for a good discussion of the Lindley and Bartlett paradoxes and related issues.
- Optional stopping, or multiplicities in general, can be an issue for Bayesians (as well as frequentists), but Bayesians often claim it is not an issue.
- Are Bayesian statistics conferences and societies safe for women? These experiences are totally heartbreaking and disgusting. I understand, obviously, that the vast majority of Bayesians are not like this, as well as there are probably some frequentist creeps out there. However, this is a concrete example that happened. Please read this post by Lum, and also this disturbing article. These articles have the following quotes:
"...a band performing at the closing party made jokes about sexual assault. This is a band that is composed mostly of famous academics in machine learning and statistics..."

"At ISBA 2010 (the same conference where the comments were made about my dress) [the International Society for Bayesian Analysis - J], I saw and experienced things that, in retrospect, were instrumental in my decision to (mostly) leave the field."

"There really is just a lot of sexual harassment of women in Bayesian statistics and machine learning..."

"The researchers involved are experts in Bayesian statistics, which underpins a powerful type of AI known as machine learning. The accusations have surfaced during a growing debate over the lack of diversity among machine learning researchers..."

- There is one type of frequentism but many different types of Bayesian (usually based on how you get the priors) and some in-fighting among them. Are you "empirical", "subjective", "objective", or something else? Do you not look at the data before making the prior, or do you constantly fiddle with and change the prior until it makes diagnostics "looks good"? Probably joking a little, Good notes that there are quite a few varieties of Bayesian.
- Despite Bayesian insistence on inference not depending on counterfactual reasoning, Bayesians
*are*interested in such issues during the experimental design stage, which is inconsistent. - Proofs of God existing, the resurrection of Jesus, and other miracles, rely on Bayesian statistics.
- Using asymmetric prior distributions in medicine could delay recognition of costly effects of making treatments in the unexpected direction. For example, physicians for many years have prescribed low-fiber diets for bowel problems until evidence accumulated they were more harmful than beneficial, some physical therapies can unexpectedly worsen injuries, and antibiotics have often been given for conditions that antibiotics can worsen.

- Consider a prior on the parameter theta where theta ~ U(0,1). What about the distribution of theta

**But**this contradicts the "frequentism is bad for science" critique.**Frequentism is really just Bayesian statistics with a flat prior**This is like saying atheism is really religion but without belief. Frequentism is statistics, period. No need for priors at all. Such priors are not solicited. Lack of priors is not a flat prior, even if mathematics may come out identical (in certain cases, or at least the decision from the analysis is in the same direction), the interpretation is not the same, nor is it needed. Additionally, frequentists can use a penalized likelihood approach, or a ridge regression approach, which get away from the Bayesian belief of frequentism only being equivalent to using a "flat prior". Another take on this charge is, what are Bayesian problems with frequentism if that were true? You wouldn't complain frequentism is false if it is just Bayesian statistics which you hold true. The statistician Edwards said"It is sometimes said, in defence of the Bayesian concept, that the choice of prior distribution is unimportant in practice, because it hardly influences the posterior distribution at all when there are moderate amounts of data. The less said about this 'defence' the better."

**Bayes Theorem**The simple Bayes Theorem, despite the name,*is*fully in the frequentist statistics domain and is a basic result from the multiplication rule and conditional probability. The equation isP(A|B) = [P(B|A)P(A)]/P(B)

where A and B are general events. Where Bayes Theorem becomes technically "Bayesian" is where the P(A) is a probability distribution for a parameter, and*definitely*"Bayesian" where the P(A) is based, not on prior experiment or objectivity, but on subjectivity. For a binary event, P(B) = P(B|A)P(A) + P(B|not A)P(not A). Note that the ratio often becomes computationally intractable because of very difficult integrals in the numerator and denominator.**Bayesian is just conditional probability, nothing more, nothing less.**The standard Bayes Theorem, yes, or even when the prior is based on a lot of frequency data. However, when the prior is subjective, or the sample size is very small, or it has poor frequentist properties, I do not believe it is justified to be called "probability". Otherwise you might have to accept absurdities like the Bayesian proofs of God, for example, as "probability" just because they go through the process and are a number between 0 and 1 that satisfy the axioms.**Frequentists are hypocrites because of latent variable models!**For example, in Latent Variable Models and Factor Analysis: A Unified Approach by Bartholomew et al,It is claimed by critics that this is a frequentist text and that therefore frequentists have cognitive dissonance and frequentists will complain about the prior of other people's Bayesian analyses, yet they are happy to apply latent variable models, stating that the results don't depend on the prior. In actuality, this text has pages on Bayesian analysis. Also, there are times where priors strongly influence a Bayesian analysis, for example where there is not a lot of data, so frequentists "complaining" is certainly justified in those cases. Bartholomew makes the point that

That is, he is making very clear that latent variable models are some hybrid of the two approaches. He also writes*The link between the two is expressed by the distribution of x given w [I changed his symbol to w. -J]. Frequentist inference treats w as fixed; Bayesian inference treats w as a random variable. In latent variables analysis we may think of x as partitioned into two parts x and y where x is observed and y, the latent variable, is not observed. Formally then, we have a standard inference problem in which some of the variables are missing. The model will have to begin with the distribution of x given w and y. A purely frequentist approach would treat w and y as parameters whereas the Bayesian would need a joint prior distribution for w and y. However, there is now an intermediate position, which is more appropriate in many applications, and that is to treat y as random variable with w fixed.*This is apparently because X is sufficient for y in the Bayesian sense. In other words, if no priors matter, it is like not using a prior in the first place.There can be no empirical justification for choosing one prior distribution for y rather than another.

**Likelihood swamps the prior**It is well known that the likelihood swamps the prior as n increases, especially if it is related to effect size. There is probably more agreement on likelihood models than on priors. So if the likelihood model is a good candidate for "truth", we see Bayesian converge to frequentism as n increases, for any choice of prior, this is not a strong argument for using priors, especially when one can incorporate expert knowledge in other ways, such as experimental design, subject matter expertise, survey sampling, likelihood, etc. If priors are irrelevant for large n, then they are still irrelevant for small n, even if they have more pull. Although, for small n, as you may have expected, most frequentist and even Bayesian analyses (almost any type of analysis) are of dubious value. See A Closer Look at Han Solo Bayes.**Bayesian model is a special case of the classical model**Patriota has noted that the Bayesian model is just a special case of the more general classical model. That is, imposing a prior does not lead to a more general structure, because when you impose a rule you are restricting the mathematical structure. Patriota also proposed an s-value as an alternative to the p-value. Although he notes that finding thresholds for the s-value to decide about a H_{0}is still an open problem, and that the asymptotic p-value can be used to test H_{0}. I interpret all of this to mean that the classical model and p-values are in fact doing a pretty good job.

**Everything?**That is very doubtful. If "everything is subjective" is true, then the claim "everything is subjective" is itself subjective and therefore I doubt it very much. Arguably the main purpose of science is to be objective as possible.**Frequentists are being very subjective when they choose to calculate a 95% confidence interval instead of a 90% confidence interval**This is not*as*subjective of a choice as subjective Bayesians make it out to be. Ideally, the choice of confidence/alpha should be based on error, cost, subject matter expertise, and other considerations. The choice is not, or should not be, the statistician just willy-nilly deciding on a number. Either way, the confidence interval is a process to make intervals that capture an objective unknown constant parameter a certain percent of the time, that can, for example, easily be demonstrated to work in simulations.**Those wanting to justify their alpha will always fail.**The critique is that it is unclear how exactly the researcher should go about the process of justifying an alpha. In The fallacy of the null-hypothesis significance test Rozeboom wroteAlpha is the probability of making a Type I error. Statisticians should therefore make alpha directly tied to the cost of making a Type I error (and then further adjust alpha smaller if needed). This cost can be an actual dollar amount, lives lost, or the general cost of "if the claim being tested were true, how would that disrupt our current understanding of the world?", for example, in the case of testing claims of ESP. Moreover, Rozeboom presumably has no problems with using "personal temerity" in choosing a prior distribution."Now surely the degree to which a datum corroborates or impugns a proposition should be independent of the datum-assessor's personal temerity. Yet according to orthodox significance-test procedure, whether or not a given experimental outcome supports or disconfirms the hypothesis in question depends crucially upon the assessor's tolerance for Type I risk."

**Justifying an alpha "does not turn weak evidence into strong evidence".**If each of three people conduct the same test on the same data and coincidentally each get p-value = .047, but their alphas were .05, .10, and .001, respectively, this is breaking the rules by supposedly turning weak evidence into strong evidence. However, it is clear that the evidence remains weak in each case. All this critique really shows is that the Bayesian bad habit of relying on unchecked subjectivity ("personal temerity") to set alpha can be remedied by objective frequentism standards (cost of making Type I error) to set alpha. Such criticisms tend to completely ignore notions of replication of experiments, as well as ignore similar "cutoff" issues if Bayes Factors, or any other statistic, were to be used instead of p-values to denote something like "statistical significance".**Asking me to set alpha so you can make a decision means your conclusion would depend on my criterion. If so, isn't that weird, because my criterion didn't influence the evidence, right?**The criterion didn't influence the evidence, but it can influence the decision. A decision depends on evidence__and__criteria. The decision to bring an umbrella depends on the amount of rain__and__one's tolerance or cost for getting wet. Ideally alpha should be set based on the cost of making a Type I error, and not on arbitrary and completely subjective beliefs. The article "Setting an Optimal Alpha That Minimizes Errors in Null Hypothesis Significance Tests" by Mudge et al discusses a more intelligent way to set alpha.**Everyone makes assumptions and us Bayesians at least make our assumptions explicit**Frequentists do too, they spell out assumptions in detail as well. Consider a possible Bayesian responseIf everything is supposedly subjective, why the label in the first place? Why is there a separate 'subjective Bayesian' term created by Bayesians? Obviously the subjective part refers to the priors and not anything else, like taking previous experiments into account, likelihoods, expert opinion, etc. Anyone can flip a coin and observe the relative frequency of Heads tending to converge to a horizontal line as the number of flips increase, or the ball bearings cascade down a Galton board to form an approximate normal distribution, and are therefore not subjective.Ever hear a frequentist call their stats "subjective?" Some frequentism popularizers even have the audacity to teach that the main difference between Bayes and Frequentist is that Bayesian is subjective and frequentism is objective.

**Passing the buck on subjectivity**Using experts to back up subjective priors doesn't solve the problem of subjectivity. Are you using experts (and hence priors) from the plaintiff or the defendant?**Don't critique us Bayesians for using priors since frequentists use background knowledge all the time**Scientists of all stripes are always permitted to use knowledge of things, experimental design, results from previous experiments, subject matter expertise, logic, scientific knowledge, penalized likelihood, direction of statistical tests, etc. Priors, however, are very specific mathematical objects, and that is what is being referred to. To compare using Bayesian priors, which are putting possibly subjective probability distributions on parameters to using any inputs for an analysis and saying "well, both are the same thing*really*" is simply mistaken.**Bayesian can take expert opinion into account using the prior**It can also take personal beliefs into account that can vary from person to person. Do the good and bad cancel out? Frequentism can take expert opinion, background knowledge, and results from experiments into account as well, just not using priors. There is obviously some subjectivity in choice of models, analysis, significance level, etc.If testing claims of ESP, consider lowering alpha drastically because it is an extraordinary claim that, if true, would change our fundamental knowledge about the world. The James Randi Educational Foundation had a $1,000,000 challenge to anyone demonstrating ESP, psychic, paranormal, etc., powers in a controlled setting, and they took this approach. Suffices to say, using good experimental design and low alpha precluded winning the money merely by chance. The JREF rightly recognized that setting alpha should be based on cost of making a Type 1 error.

Here is some "expert opinion" that led to Bayesian proofs of God existing: The Probability of God: A Simple Calculation That Proves the Ultimate Truth by Unwin, and The Existence of God by Swinburne.

**Bayesians are more honest than frequentists**Presumably because of making assumptions explicit. However, this is just opinion/experience, and has no bearing on the mathematical theory. As mentioned, frequentists do make their assumptions known.**Don't get hung up on models**Like Gelman writes at his blog Statistical Modeling, Causal Inference, and Social Science, we use models and to not get hung up on the models, and to check them (mostly using frequentist concepts), iterate, etc. However, I believe while we all use models, stuff like priors*are*different than other models since they literally can be anything, while likelihood models are more agreed upon and "constrained", for lack of a better word. For example, see Golf Putting Models. I know how to interpret a logistic regression, how to extend it for more predictors, and so on. How do I do that with Gelman's (sensible, mind you) unique model? How do I compare everything about his model with models others could use to analyze this putting data? It isn't really clear statistically how to do that, and surely a simple "fit" may not be the best way to decide the winner.Of course, this contradicts the "all null models are actually false" charge.

**Mathematics based on subjectivity is not well-defined**A Bayesian approach does not, or cannot, give a full account of the "mathematical rules of engagement" for working with subjective quantities. Simply put, just because a number is between 0 and 1 and one feels it is a probability does not mean it is a probability properly defined. Probably most frequentists are fine with it being "chance", "uncertainty", "personal belief", however.**Frequentist statistics tries to make the world a**This contradicts the "Some frequentism popularizers even have the audacity to teach that the main difference between Bayes and Frequentist is that Bayesian is subjective and frequentism is objective" charge. There are many similar sayings I've come across in looking at critiques of frequentism. So while this was meant to be another "funny" one, I'll respond seriously. Much "frequentism" has made and continues to make the world better. Studies showing smoking is bad, risk factors for CHD, numerous sample surveys informing people, enjoyment of games of chance, experimental design for science, weather prediction, benefits of randomization, etc. As mentioned before, also see The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century by Salsburg. I personally do not believe that extreme forms of subjectivity have ever been aligned with good scientific practice.*correct*place. It is objective. Bayesian Statistics tries to make the world a*better*place. It is subjective.

**Background**B.S. Mathematics, M.S. Statistics, Mathematical Statistician for ~15 years and counting. Earned a B in Bayesian statistics (p < .0001).

Thanks for reading and sending in comments/corrections!

Tweet |

If you enjoyed *any* of my content, please consider supporting it in a variety of ways:

**Check out a random article at http://statisticool.com/random.htm****Buy what you need on Amazon from my affiliate link****Share my Shutterstock photo gallery****Sign up to be a Shutterstock contributor**