8/14/18
Tweet 

This article is my response to various arguments against frequentism that I've read or heard. Frequentism is roughly defined as:
relfreq(Heads)_{t+1} = ((t1)*relfreq(Heads)_{t} + I_{t})/t, where
I_{t} = 1 if Heads is observed on the t^{th} trial, 0 otherwise
Of course, the lessons learned and results from experiments are used to inform future experiments and projects. Are there examples of Bayesian updating (posterior_{t} used as prior_{t+1}) being done longterm? I'd only rely on the results if they had good longterm frequentist properties. Updating will be bad if GI (from GIGO). Do Bayesians guarantee that at any time t along the way there will not be GI? See Compounding Errors showing the general idea of how small errors now can create big errors later on in a process. Owhadi wrote
How do you make sure that your predictions are robust, not only with respect to the choice of prior but also with respect to numerical instabilities arising in the iterative application of the Bayes rule?
This would be a mistake because data and parameters differ in more aspects than just observed and unobserved, likelihoods and priors are very different and have different uses even if both are "just distributions", and wanting to banish use of the terms "estimate" and "random" is just silly. One could probably argue that Bayesians may want to blur the differences between likelihoods and priors, and banish the words estimate and random, to blunt the criticism against problematic but fundamental Bayesian concepts and simultaneously diminish frequentist contributions. McElreath adds, however, that at times he uses these terms, that sometimes their use is OK, so determining exactly what he is proposing is rather confusing.
As mentioned, the ASA document was not against pvalues but against the misunderstanding and misuse of pvalues. In that document they wrote that other approaches, like Bayesian, "...have further assumptions". I was always taught to not just do p < .05 and leave it at that, but to have good experimental or survey design, give confidence intervals, graphs, not have arbitrary cutoffs, and so on. See Regarding the ASA Statement on PValues and The Statistical Sleuth by Ramsey and Schafer. Mayo writes
"Misinterpretations and abuses of tests, warned against by the very founders of the tools, shouldn't be the basis for supplanting them with methods unable or less able to assess, control, and alert us to erroneous interpretations of data."
By the way, these warnings about pvalues are all things that we have known since Fisher's time.
One could argue the other way, that Bayesians really want to say that their procedures have good longterm performance.
Some critics have suggested that CLT has poor performance for say a lognormal population. However, if there was a lognormal population, a statistician would make a histogram and observe that it is skewed, and probably consider taking a transformation of the data, such as a log. The critic then says 'ah, but you can't do that' or 'but you don't know what transformation to take'. However, how can the critic even know the population is exactly lognormal to begin with? A statistician in real life would simply observe the skew, take a transform, then backtransform to get, for example, a confidence interval on the original nontransformed scale. But the critic would then say 'ah, but now this is an interval on the median, not the mean'. But the frequentist would then say that for skewed distributions like lognormal, they are often described by their median better than their mean, and moreover, equations exist for confidence intervals of their mean anyway. And on and on and on!
We publish articles explaining onesided statistical tests, resolving paradoxes and proving the need for using onesided tests of significance and confidence intervals when claims corresponding to directional hypotheses are made. There are interactive simulations and code for simulations you can run yourself. You will also find links to related literature: both for and against onesided tests.
In addition, Bayesian statistics regularly uses other frequentist concepts such as histograms, distributions, sampling, simulation, model checking, calibration, nonparametrics, and asymptotic procedures, to name a few.
I strongly believe that probability and statistics are the method of the scientific method. See the books The Lady Tasting Tea: How
Statistics Revolutionized Science in the Twentieth Century by Salsburg, and Creating Modern Probability: Its Mathematics, Physics and Philosophy
in Historical Perspective by von Plato. Also, see Frequentism is Good response to the "airport fallacy" article by Gunter and Tong that appeared in the 10/2017 issue of Significance, I say
"I agree that frequentism is an embarrassment, but it is actually an embarrassment of riches."
"...while Bayesian is a powerful analysis tool in the right hands, it is not without risk. A Bayes formula that is frontloaded with a controlling assumption that MH370 "flew due south with no human input until fuel was exhausted" will always return whimsical results unless that is precisely what happened."
"By far the most serious error in this search was the attempt to make Bayesian statistics resolve location issues. In truth, Bayesian cannot be constructed even after the fact to find the correct location. It is simply not the tool for this challenge."
"They thought they were incredibly clever. Bragged about their analysis skills in endless articles; spent more time writing a book on Bayesian than looking. They believed they'd find it within a month. But the analysis was way beyond naive."
"NonNHST research is just as susceptible to QRPs as NHST."
"...a practical estimation question whose answer is simple (especially in the frequentist setting) but not obvious (especially in the Bayesian setting)."
"Our core result is that mathematical statistics and blackletter law combine to create a simple standard: statistical estimation evidence is legally sufficient when it fits the litigation position of the party relying on it. This means statistical estimation evidence is legally sufficient when the pvalue is less than 0.5; equivalently, the preponderance standard is frequentist hypothesis testing with a significance level of just below 0.5.""Finally, we show that conventional significance levels such as 0.05 require elevated standards of proof tantamount to clearandconvincing or beyondareasonabledoubt."
if hypotheses H_{1}(m) and H_{2}(n), with assumptions m and n respectively, explain event E equally well, choose H_{1} as the best working hypothesis if m < n
"The intuition behind requiring severity is that:Data x_{0} in test T provide good evidence for inferring H (just) to the extent that H passes severely with x_{0}, i.e., to the extent that H would (very probably) not have survived the test so well were H false."
"...while I consider myself a frequentist, I affirm the value of Bayesian probability...My skepticism is confined to claims such as the following: all probabilities are Bayesian probabilities, all knowledge is Bayesian credence, and all learning is Bayesian conditionalization".
Also see Bayesian JustSo Stories in Psychology and Neuroscience by Bowers and Davis. In it, they say
"According to Bayesian theories in psychology and neuroscience, minds and brains are (near) optimal in solving a wide range of tasks. We challenge this view and argue that more traditional, nonBayesian approaches are more promising."
Also see Is it Always Rational to Satisfy Savage's Axioms? by Gilboa, Postlewaite, and Schmeidler. In it, they say
"This note argues that, under some circumstances, it is more rational not to behave inaccordance with a Bayesian prior than to do so. The starting point is that in the absence of information, choosing a prior is arbitrary. If the prior is to have meaningful implications, it is more rational to admit that one does not have sufficient information to generate a prior than to pretend that one does. This suggests a view of rationality that requires a compromise between internal coherence and justification, similarly to compromises that appear in moral dilemmas. Finally, it is argued that Savage's axioms are more compelling when applied to a naturally given state space than to an analytically constructed one; in the latter case, it may be more rational to violate the axioms than to be Bayesian."
Also, in an interview Frederick Eberhardt has said
"My thinking is not Bayesian. In fact, years ago, together with David Danks, I wrote a paper arguing that several experiments in cognitive psychology that purported to show evidence of Bayesian reasoning in humans, showed no such thing, or only under very bizarre additional assumptions. It was not a popular paper among Bayesian cognitive scientists."
"If a drug company presents some results to us  "a sample of n patients showed that drug X was more effective than drug Y"  and this sample could i) have had size n fixed in advance, or ii) been generated via an optional stopping test that was 'stacked' in favour of accepting drug X as more effective  do we care which of these was the case? Do we think it is relevant to ask the drug company what sort of test they performed when making our final assessment of the hypotheses? If the answer to this question is 'yes', then the Bayesian approach seems to be wrongheaded or at least deficient in some way."
See also Why optional stopping is a problem for Bayesians by Heide and Grunwald.
"God would love a Bayes Factor of 3.01 nearly as much as a BF of 2.99."
"It is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result. It is obvious that an experiment would be useless of which no possible result would satisfy him".See his Statistical Methods, Experimental Design, and Scientific Inference Arbitrary cutoffs are a standard of many journals, not any problem with the statistical theory itself. What is causing you to publish there? What is causing you from just reporting the observed pvalue? What is causing you to not use a different alpha? Why are you not also focusing on experimental design and power? Why don't you replicate your experiment yourself a few times before thinking about publishing? The NeymanPearson approach looks more at rules to govern behavior and helping to insure that in the long run we are not often wrong.
"Statistical methods continue to play a crucial role in HEP analyses; recent Higgs discovery is an important example. HEP has focused on frequentist tests for both p values and limits; many tools developed."
"The detection of a single changepoint can be posed as a hypothesis test. The null hypothesis, H_{0}, corresponds to no changepoint (m = 0) and the alternative hypothesis, H_{1}, is a single changepoint (m = 1)"
Perhaps there would be academic journals that would not let one publish if the BF is not greater than 5. Maybe there would be replications of studies that had large BF that now have a smaller BF. There would probably be plenty of papers saying we need reform because of the misunderstanding of BF even among professional statisticians, or that some other statistic or approach is better than BF. A few people would mention that the first users of BF pointed out these stumbling blocks one misuses of BF a long time ago. One sees the point I hope. Statisticians also use tables, graphs, and other statistics to make conclusions, so an overemphasis on pvalues, BF, etc., is somewhat misguided.
"Although his [Birnbaum] argument purports that [(WCP and SP) entails SLP], we show how data may violate the SLP while holding both the WCP and SP."
Someone picks a fixed integer theta and asks you to guess it based on some data as follows. He is going to toss a coin twice (you do not see the outcomes), and from each toss he will report theta+1 if it turns out heads, or theta1 otherwise. Hence the data x_{1} and x_{2} are an i.i.d. sample from a distribution that has probability .5 on theta1 or theta+1. For example, he may report x_{1} = 5 and x_{2} = 5.The following guess will have a 75% probability of being correct:
C(x_{1}, x_{2}) =
x_{1}1, if x_{1} = x_{2}
(x_{1}+x_{2})/2, otherwiseHowever, if x_{1} ne x_{2}, we should be 100% confident that the guess is correct, otherwise we are only 50% confident. It will be absurd to insist that on observing x_{1} ne x_{2} you only have 75% confidence in (x_{1}+x_{2})/2.
First, you have to love us mathematical statisticians because this example is one of the most contrived examples I've ever seen! Second, there is actually no paradox because the "confidence" is in the entire process. If you want to break down a process into subsets of the process (the 100% or 50% parts), you can do that as well. This "paradox" is basically the confidence interval version of the reference class problem. See this spreadsheet for a simulation.
p(xtheta) =
e^{(thetax)}, if x > theta
0, if x < theta
We observe {10, 12, 15}. What is a 95% confidence or credible interval for theta if it is known that theta must be less than 10? It turns out that a naive frequentist confidence interval, using an unbiased estimator approach, gives a 95% confidence interval of (10.2, 12.2). The fact that the lower limit of the confidence interval is greater than 10 is a problem because logically theta must be smaller than the smallest observed data. The Bayesian credible interval, using a flat prior, gives a 95% credible interval of (9, 10) which is more realistic. This example unfortunately doesn't permit the frequentists to consider any other approach for calculating confidence intervals. I believe order statistics (for example the minimum) and bootstrap could be useful with this problem. See this spreadsheet for some explorations.
Note that these paradoxes are typically resolved by some of these approaches
"An example of Bayesian divergence of opinion is based on Appendix A of Sharon Bertsch McGrayne's 2011 book The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy. Tim and Susan disagree as to whether a stranger who has two fair coins and one unfair coin (one with heads on both sides) has tossed one of the two fair coins or the unfair one; the stranger has tossed one of his coins three times and it has come up heads each time.Tim assumes that the stranger picked the coin randomly  i.e., assumes a prior probability distribution in which each coin had a 1/3 chance of being the one picked. Applying Bayesian inference, Tim then calculates an 80% probability that the result of three consecutive heads was achieved by using the unfair coin, because each of the fair coins had a 1/8 chance of giving three straight heads, while the unfair coin had an 8/8 chance; out of 24 equally likely possibilities for what could happen, 8 out of the 10 that agree with the observations came from the unfair coin. If more flips are conducted, each further head increases the probability that the coin is the unfair one. If no tail ever appears, this probability converges to 1. But if a tail ever occurs, the probability that the coin is unfair immediately goes to 0 and stays at 0 permanently.
Susan assumes the stranger chose a fair coin (so the prior probability that the tossed coin is the unfair coin is 0). Consequently, Susan calculates the probability that three (or any number of consecutive heads) were tossed with the unfair coin must be 0; if still more heads are thrown, Susan does not change her probability. Tim and Susan's probabilities do not converge as more and more heads are thrown."
An example of Bayesian convergence of opinion is in Nate Silver's 2012 book The Signal and the Noise: Why so many predictions fail  but some don't. After stating, "Absolutely nothing useful is realized when one person who holds that there is a 0 (zero) percent probability of something argues against another person who holds that the probability is 100 percent", Silver describes a simulation where three investors start out with initial guesses of 10%, 50% and 90% that the stock market is in a bull market; by the end of the simulation (shown in a graph), "all of the investors conclude they are in a bull market with almost (although not exactly of course) 100 percent certainty."
Bayesian convergence in this case is simply a nicer way of expressing the likelihood swamped the priors.
"...31.1% of the articles did not even discuss the priors implemented"Because Bayesian analysis is "brittle" and often highly dependent on the priors, and no one can replicate your work if you don't detail your priors, these practices are very worrying."Another 24% of the articles discussed the prior superficially, but did not provide enough information to reproduce the prior settings..."
"The discussion about the level of informativeness of the prior varied articlebyarticle and was only reported in 56.4% of the articles. It appears that definitions categorizing "informative," "mildly/weakly informative," and "noninformative" priors is not a settled issue."
"Some level of informative priors was used in 26.7% of the empirical articles. For these articles we feel it is important to report on the source of where the prior information came from. Therefore, it is striking that 34.1% of these articles did not report any information about the source of the prior."
"Based on the wording used by the original authors of the articles, as reported above 30 empirical regressionbased articles used an informative prior. Of those, 12 (40%) reported a sensitivity analysis; only three of these articles fully described the sensitivity analysis in their articles (see, e.g., Gajewski et al., 2012; Matzke et al., 2015). Out of the 64 articles that used uninformative priors, 12 (18.8%) articles reported a sensitivity analysis. Of the 73 articles that did not specify the informativeness of their priors, three (4.1%) articles reported that they performed a sensitivity analysis, although none fully described it."
Thus if we follow the theory and communicate to another person a density C*theta^{100}*(ltheta)^{100} this person has no way of knowing whether (1) an experiment with 200 trials has taken place or (2) no experiment took place and this is simply an a priori expression of opinion. Since some of us would argue that the case with 200 trials is more "reliable" than the other, something is missing in the transmission of information.
"...a band performing at the closing party made jokes about sexual assault. This is a band that is composed mostly of famous academics in machine learning and statistics...""At ISBA 2010 (the same conference where the comments were made about my dress) [the International Society for Bayesian Analysis  J], I saw and experienced things that, in retrospect, were instrumental in my decision to (mostly) leave the field."
"There really is just a lot of sexual harassment of women in Bayesian statistics and machine learning..."
"The researchers involved are experts in Bayesian statistics, which underpins a powerful type of AI known as machine learning. The accusations have surfaced during a growing debate over the lack of diversity among machine learning researchers..."
As Bayesian methods continue to grow in accessibility and popularity, more empirical studies are turning to Bayesian methods to model small sample data. Bayesian methods do not rely on asympotics, a property that can be a hindrance when employing frequentist methods in small sample contexts. Although Bayesian methods are better equipped to model data with small sample sizes, estimates are highly sensitive to the specification of the prior distribution. If this aspect is not heeded, Bayesian estimates can actually be worse than frequentist methods, especially if frequentist small sample corrections are utilized. We show with illustrative simulations and applied examples that relying on software defaults or diffuse priors with small samples can yield more biased estimates than frequentist methods. We discuss conditions that need to be met if researchers want to responsibly harness the advantages that Bayesian methods offer for small sample problems as well as leading small sample frequentist methods.
"It is sometimes said, in defence of the Bayesian concept, that the choice of prior distribution is unimportant in practice, because it hardly influences the posterior distribution at all when there are moderate amounts of data. The less said about this 'defence' the better."
P(AB) = [P(BA)P(A)]/P(B)where A and B are general events. Where Bayes Theorem becomes technically "Bayesian" is where the P(A) is a probability distribution for a parameter, and definitely "Bayesian" where the P(A) is based, not on prior experiment or objectivity, but on subjectivity. For a binary event, P(B) = P(BA)P(A) + P(Bnot A)P(not A). Note that the ratio often becomes computationally intractable because of very difficult integrals in the numerator and denominator.
It is claimed by critics that this is a frequentist text and that therefore frequentists have cognitive dissonance and frequentists will complain about the prior of other people's Bayesian analyses, yet they are happy to apply latent variable models, stating that the results don't depend on the prior. In actuality, this text has pages on Bayesian analysis. Also, there are times where priors strongly influence a Bayesian analysis, for example where there is not a lot of data, so frequentists "complaining" is certainly justified in those cases. Bartholomew makes the point that
The link between the two is expressed by the distribution of x given w [I changed his symbol to w. J]. Frequentist inference treats w as fixed; Bayesian inference treats w as a random variable. In latent variables analysis we may think of x as partitioned into two parts x and y where x is observed and y, the latent variable, is not observed. Formally then, we have a standard inference problem in which some of the variables are missing. The model will have to begin with the distribution of x given w and y. A purely frequentist approach would treat w and y as parameters whereas the Bayesian would need a joint prior distribution for w and y. However, there is now an intermediate position, which is more appropriate in many applications, and that is to treat y as random variable with w fixed.That is, he is making very clear that latent variable models are some hybrid of the two approaches. He also writes
There can be no empirical justification for choosing one prior distribution for y rather than another.This is apparently because X is sufficient for y in the Bayesian sense. In other words, if no priors matter, it is like not using a prior in the first place.
"Now surely the degree to which a datum corroborates or impugns a proposition should be independent of the datumassessor's personal temerity. Yet according to orthodox significancetest procedure, whether or not a given experimental outcome supports or disconfirms the hypothesis in question depends crucially upon the assessor's tolerance for Type I risk."Alpha is the probability of making a Type I error. Statisticians should therefore make alpha directly tied to the cost of making a Type I error (and then further adjust alpha smaller if needed). This cost can be an actual dollar amount, lives lost, or the general cost of "if the claim being tested were true, how would that disrupt our current understanding of the world?", for example, in the case of testing claims of ESP. Moreover, Rozeboom presumably has no problems with using "personal temerity" in choosing a prior distribution.
Ever hear a frequentist call their stats "subjective?" Some frequentism popularizers even have the audacity to teach that the main difference between Bayes and Frequentist is that Bayesian is subjective and frequentism is objective.If everything is supposedly subjective, why the label in the first place? Why is there a separate 'subjective Bayesian' term created by Bayesians? Obviously the subjective part refers to the priors and not anything else, like taking previous experiments into account, likelihoods, expert opinion, etc. Anyone can flip a coin and observe the relative frequency of Heads tending to converge to a horizontal line as the number of flips increase, or the ball bearings cascade down a Galton board to form an approximate normal distribution, and are therefore not subjective.
If testing claims of ESP, consider lowering alpha drastically because it is an extraordinary claim that, if true, would change our fundamental knowledge about the world. The James Randi Educational Foundation had a $1,000,000 challenge to anyone demonstrating ESP, psychic, paranormal, etc., powers in a controlled setting, and they took this approach. Suffices to say, using good experimental design and low alpha precluded winning the money merely by chance. The JREF rightly recognized that setting alpha should be based on cost of making a Type 1 error.
Here is some "expert opinion" that led to Bayesian proofs of God existing: The Probability of God: A Simple Calculation That Proves the Ultimate Truth by Unwin, and The Existence of God by Swinburne.
Of course, this contradicts the "all null models are actually false" charge.
Wasserman has noted that we all use p(X) for probability, but maybe should use f(X) for frequencies and b(X) for beliefs/Bayesian.
Thanks for reading and sending in comments/corrections!
Tweet 

If you enjoyed any of my content, please consider supporting it in a variety of ways: