Thomas Hoskyns Leonard Blog: BAYESIAN HISTORY:5. THE ENIGMATIC EIGHTIES

5. THE ENIGMATIC EIGHTIES

The real thing in this world is not where we stand, but in which direction we are moving (Julius Nyrere)

The Mayflower Rose, by Fabio Cunha

When I remember the promising early years of the Reagan era in America, I recall my metaphoric poem, The Mayflower Rose. Maybe Rose was America itself.

In what wondrous dream

Do I suppose

I met the Mayflower Rose?

Her petals turned pink

In a hug and a blink;

Her stem twisted in the breeze,

When I fell to my knees;

Her aura turned heavenly and angelic,

As I plied her with Dumnonian magic.

But when the prickly thistle flew in,

Rose was gone in the din

I twisted and turned

As I drank like a tank for ever and a day.

When Yank fought Assyrian in the Gulf of Tears,

She, the Voice behind the Screen,

Spoke as if I’d never been.

Now she beams across the mind waves

In my wondrous dreams.

The 1980s were made even more enigmatic by George Box’s seminal article [67], in which Box argued that while ‘coherent’ Bayesian inferences are appropriate given the assumed truth of a sampling model, the assumed model should always be checked out against a completely general alternative using a frequency-based significance test together with other non-Bayesian data analytic techniques. His not-so-original philosophy ‘All models are wrong, but some are useful’ has since become engrained in statistical folklore. Consequently, Bayesian researchers were forced to acknowledge that model-based developments do not usually completely solve the applied problem.

Box’s approach posed a serious enigma. Should Bayesians continue blandly with their time-honoured model-based research, or try to compete in the ‘model unknown’ arena, where they would need to develop new talents and techniques? Many Bayesians opted out by focussing on the first of these options, and others were faced with daunting technical and conceptual difficulties while seeking theoretical procedures that might justify their choices of parameter-parsimonious sampling model. However the increased computational feasibility of model-based inferences made it easier for them to concentrate more of their energies on trying to do this.

Many economists persisted with the Savage-Lindley gob-smacking philosophy ‘A model should be as big as an elephant’ (see [68]) while failing to reasonably identity all the parameters in their, albeit highly meaningful, models from finite data sets. However, for sensibly parametrized non-linear models, importance sampling continued to play a key model-based inferential role, and AIC and other information criteria were frequently used for model comparison.

When the sample space is continuous, Box referred to the prior predictive density p(y) of the nx1 observation vector y. This averages the hypothesised sampling density of y with respect to the prior distribution of the unknown parameter vector θ. The elements of y of course consist of the numerical realizations of the corresponding elements of a random vector Y. Box then recommended testing the hypothesised distribution against a general alternative by reference to the ‘significance probability’,

α = prob [log p (Y) < log p(y)]

For this definition to make sense in general terms ‘prob’ should be interpreted as shorthand for ‘prior predictive probability’, rather than sampling probability, Therefore α cannot in general be regarded as a classical significance probability. This criterion is heavily dependent upon the choice of prior distribution for θ, as evidenced by the various special cases considered in Box’s paper. In special cases e.g. involving the linear statistical model, α depends upon the realization of a low-dimensional model-based sufficient statistic and not further upon the observations or residuals, a curious property.

Shortly before George read his paper to the Royal Statistical Society, I talked to him about the large sample size properties of α (See Author’s Notes below). These seemed to suggest that α was too overly-dependant on the prior assumptions for effective practical model checking. George nevertheless went proudly ahead as planned, his paper was very well received by the assembled Fellows, and his path-breaking philosophies have become engrained in applied Bayesian Statistics ever since.

Model testing procedures which adapted Box’s ideas by instead referring to a ‘posterior predictive p-value’ proved to be somewhat more appealing. See, for example, Andrew Gelman, Xian-Li Meng and Hal Stern’s well-cited paper in Statistica Sinica (1996).

Nevertheless, the question remains as to how small a p-value would be needed to justify rejecting the hypothesised model. In conceptual terms, you might wonder whether it is reasonable to aspire to checking out a model against a complete general alternative (including models with variously non-independent error terms), without any class of alternative models in mind. From a frequency perspective any fixed size test procedure which is reasonably powerful against some choices of alternative model, might be less powerful against other choices. Moreover, a significance test is next to worthless if it possesses unconvincing power properties.

If you don’t think that flagging your hypothesised sampling models with tail probabilities provides the final answer, then that leaves you with residual analyses and the well-versed criteria for model comparison, which nowadays include AIC, DIC and cross-validation. Assigning positive prior probabilities to several candidate models with several unknown parameters is not usually a good idea, unless the model parameters are estimated empirically, since the posterior probabilities will depend upon Bayes factors which are subject to the sorts of paradoxes and sensitivities that I described in Ch. 2. However, in the empirical case, one possibility would be to estimate the sampling density by its estimated posterior mean value function.

In his 1980 article in the Annals of Statistics, Jim Berger reported Bayesian procedures which improved on standard admissible estimators in continuous exponential families. His books on Statistical Decision Theory were published in 1980 and 1985, and his 1984 monograph with Robert Wolpert is titled: The Likelihood Principle: A Review and Generalisations. Jim has always analysed Bayesian-related problems with outstanding mathematical rigor and a strong regard for the frequency properties of the estimation, inferential, and decision-making procedures. He is currently an Arts and Sciences Distinguished Professor in the Department of Statistical Sciences at Duke University.

In their 1980 paper in the Annals of Statistics, Connie Shapiro (now Connie Page) and Bob Wardrop applied Dynkin’s identity to prove optimality of their sequential Bayes estimates for the rate of a Poisson process, with a loss function which involved costs per arrival and per unit time. They investigated the frequency properties of their procedures in an asymptotic situation.

Connie Page

In the same year, Shapiro and Wardrop reported their Bayesian sequential estimation for one-parameter exponential families in an equally high quality paper in JASA.

Now an Emeritus Professor at Wisconsin, and as ever an outstanding teacher, Bob Wardrop published his book Statistics: learning in the presence of variation in 1995. Connie Page, who has always played an important leadership role in American Statistics, went on to focus on statistical consultancy at Michigan State.

Tom's Facebook friend Bob Wardrop with his three grandchildren, Lodi, Wisconsin, 2013

In 1981, Mike West, one of Adrian Smith’s many highly successful research students, published an important paper in JRSSB out of his Ph.D. thesis about robust sequential approximate Bayesian estimation. In 1983, Mike and Adrian reported an interesting application of the Kalman Filter to the monitoring of renal transplants. Mike’s path breaking joint paper in JASA (1985) with Jeff Harrison and Helio Migon set fresh horizons for the forecasting of counted data within a dynamic generalised linear model framework that was easily expressible in terms of two or more stages of a hierarchical distributional structure. Mike West reported his many achievements in Bayesian forecasting much later, in his magnificent books with Harrison, and Harrison and Pole.

In 1981, Dr. Jim Low of Kingston General Hospital, Ontario, and six co-authors, including Louis Broekhoven and myself, reported their 1978 Bayesian analysis of the Ontario foetal metabolic acidosis data in an invited discussion paper which was published in the American Journal of Obstetrics and Gynaecology. See [15], pp92-95 for a description. Various birth weight distributions were modelled using Edgeworth’s skewed normal distribution, which I thought that I was inventing for this purpose.

Dr. James Low, Kingston General Hospital

The many advances in Bayesian medical statistics of between 1981 and 2006 were capably reviewed by Deborah Ashby and published on-line by Wiley in 2006. Important developments during the early 1980s include Jay Kadane and Nell Sedrank’s Bayesian motivations towards more ethical clinical trials, the methodology developed by Laurence Freedman and David Spiegelhalter for the assessment of subjective opinion concerning clinical trials, the clinical decision-support systems devised by David Spiegelhalter and Robin Knill-Jones, with very useful applications in gastroenterology, and the empirical Bayes estimates of cancer mortality rates which were proposed by Robert Tsutakawa, Gary Shoop and Carl Marienfield.

In his 1990 paper ‘Biostatistics and Bayes’ in Statistical Sciences, Norman Breslow reviewed the changing attitudes of biostatisticians towards the Bayesian and Empirical Bayesian paradigms.

Peter Armitage, Geoffrey Berry and John Matthews are sympathetic towards Bayesian methods in medical statistics in their expository treatise Statistical Methods in Medical Research (4th. edition, 2002).

In 2002, John Duffy and I co-authored an article in Statistics in Medicine concerning the Mantel-Haenzel model for several 2x2 contingency tables. Our Laplacian approximations facilitated a metanalysis for data from several clinical trials and we applied our methodology to various sets of ear infection data.

The leading Bayesian statistician Morris De Groot bravely led an ASA delegation to Taiwan to investigate the brutal murder in Tapei in 1981 of the Han Chinese martyr Chen Wen Chen, who was De Groot’s colleague in the Department of Statistics at Carnegie-Mellon University. De Groot was lucky to get out of Taiwan alive when the medical doctor in the delegation performed an autopsy which revealed the true cause of death. The Associated Press Reporter Tina Chou was blacklisted, and hounded for many years afterwards for reporting the results of the autopsy. She has not been seen or heard of for at least a decade.

Wen Chen’s assassination by the axe-wielding Taiwanese secret police is regarded as a seminal event in the history of Taiwan. It would appear, according to various reports I have received down my grapevine, that the crime may have been brought about by some scary interstate academic intrigue within the United States, maybe even with undertones of professional jealousy. Morry De Groot endeavours on behalf of the ASA, with the assistance of his very supportive colleagues at Carnegie-Mellon, were extremely laudable. Nevertheless American Bayesians should not be completely proud of what happened, if an unexpected e-mail I received in Madison during the 1990s is to be believed. Chen Wen Chen’s case was reopened in Taiwan several years ago. It would, perhaps, still benefit from further investigation.

Chen Wen Chen (1950-1981)
Han Chinese Martyr

Morris De Groot (1931-1989)

In 1981, I helped George Box to organize a special statistical year at the U.S. Army’s Mathematics Research Center (M.R.C.) at the University of Wisconsin-Madison. George’s motivation was to orchestrate a big debate about the Bayesian-frequency controversy, and concerning which philosophy to adopt when checking out the specified sampling model. If he was looking for some sort of consensus of opinion that the frequency model-testing procedures he’d proposed in his 1980 paper [67] were the way to go, then that is what he eventually achieved after some relatively courteous disagreement from the diehard Bayesians.

With these purposes in mind, a number of Bayesian and frequentist statisticians were invited to visit M.R.C. during the year. Their lengths of stay varied between a few weeks and a whole semester. Then, in December 1981 everybody was invited back to participate in an international conference in a beautiful auditorium overlooking Lake Mendota. The conference proceedings Scientific Inference, Data Analysis, and Robustness were edited by George, myself, and Jeff C.F. Wu, but I never knew how many copies were sold by Academic Press. Maybe the proceedings fell stillborn, David Hume-style, from the press.

The Mathematics Research Center was housed at that time on the fifth, eleventh and twelfth floors of the elegantly thin WARF building on the extreme edge of the UW campus, near Picnic Point which protrudes from the south-east shoreline of the still quite temperamental Lake Mendota. The quite gobsmacking covert activities were not altogether confined to the thirteenth floor.

The WARF Building, Madison, Wisconsin

During the Vietnam War, M.R.C. was instead housed in Sterling Hall in the middle of the UW campus. However, on August 24, 1970, it was bombed as a protest by four young people as a protest against the University’s research connections with the U.S. Military. The bombing resulted in the death of a university physics researcher and injuries to three others.

The Bombing of Sterling Hall

Visitors during the special year included, as I remember, Hirotugu Akaike, Don Rubin, Mike Titterington, Peter Green, Granville Tunnicliffe-Wilson, Irwin Guttman, George Barnard, Colin Mallows, Morris De Groot, Toby Mitchell, Bernie Silverman, Phil Dawid, George Barnard, and Michael Goldstein. However Dennis Lindley was the Bayesian who George Box really wanted to attract to Madison. Dennis was by then an expensive commodity on the American circuit where he wasn’t always kowtowed to. He, for example, reportedly got into a classic confrontation with a Dean at Duke a year or so later when the Dean invited him to teach a basic level course on frequentist statistics.

Hirotugu Akaike (1927-2009)

George emptied the military’s coffers to the tune of almost $50000 in his ultimately successful efforts to tempt Dennis into visiting us for the first semester of 1981. George advised me that it’s always important to bury the hatchet with your long-term adversaries since we’re all really ‘one big happy family’, or words to that effect (he and Dennis had crossed swords over twenty years previously on a non-academic issue, at which time Dennis had reportedly ‘displayed an iron fist from behind a velvet glove’), and Dennis and his kindly wife Joan stayed in George Tiao’s enormous house on the icy west side of Madison for the duration of their visit.

After an awkward settling-in period, Dennis and George proceeded to debate the fundamental philosophical issues during a series of special seminars at M.R.C. While I thought that Dennis was winning, primarily because of the way his mathematical sharpness and sense of self-belief contrasted with George’s aura of practical relevance and imminent greatness, one of my students thought that George was treating him (Dennis) like a punch bag. Anyway, the debate petered out in April 1981 when Dennis departed in his usual gentlemanly style to continue his early retirement routine at another American college.

George presented a variation of his, albeit strangely flawed, 1980 paper [67] on prior predictive model-checking at M.R.C.’s international conference on the shores of Lake Mendota in December 1981. However, he’d previously declared victory when Dennis suddenly pulled out of the much-planned and expensively-funded confrontation, for reasons best known to himself, by abruptly declining to attend the meeting.

The conference was nevertheless highly successful and George received lots of pats on the back for his refreshing re-interpretation of Bayesian Statistics. Indeed, Bob Hogg was so thrilled about it that he invited me to give a seminar on a similar topic in Iowa City. As I was to discover afterwards, this was so that he could tease and irritate one of his ‘coherently Bayesian’ colleagues with the good news. What a soap opera, but I’m sure that the U.S. Army got value for money.

Robert V.Hogg

I still seriously wonder whether some of the research produced at M.R.C. during the special year was used by the U.S. Army Research Office to enhance Reagan’s Star Wars program. With some of the stuff going on behind closed doors involving the statistics of Pershing missiles, nuclear missiles hitting silos, and whatever, I wouldn’t put it past them.

George Box’s academic life story is published in his autobiography The Accidental Statistician. Whilst never humble or an extremely brilliant mathematician, his forte lay in his ability to perceive simple practical solutions in situations where the mathematrizers couldn’t untangle the wood from the trees. That was his saving grace, and at that he was supremely magnificent. By 1981, when he was 62, he had doubtlessly achieved his much-longed-for everlasting greatness. George did not however retire until many years afterwards, even though he was presented with a retirement cheque and a wooden rocking chair during a faculty reception in 1992.

In 1982, S. James Press impressed the statistical world with the second edition of his magnificently detailed book Applied Multivariate Analysis, which referred to both Bayesian and frequentist methods of inference in meaningful multi-dimensional models. Jim was to publish three further Bayesian books with John Wiley. The second of the trio was co-authored with Judith Tanur.

Jim is an Emeriti Professor of Statistics at the University of California at Riverside. He was appointed Head of the Statistics Department there when the long-exiled iconic English statistician Florence David retired from the position in 1977.

Jim founded the Bayesian Statistical Sciences Section of the American Statistical Association in 1992, in collaboration with Arnold Zellner and ISBA. He has always been a powerful figure in Bayesian Statistics, with an ultra-charming personality and a wry sense of humour. He learnt his trade as a coherent Bayesian while visiting UCL in 1971-2, but never lost sight of his practical roots.

Irving Jack Good

In their 1982 JASA paper, Jack Good and James Crook used the concept of the strength of a significance test, in the context of contingency table analysis. The strength of a test averages the power with respect to the prior distribution. Bayesian decision rules can yield optimal fixed-size strength properties. See Lehmann’s Testing Hypotheses, p91, and exercise 3.10g on page 162 of [15]. When the null hypothesis is composite, the average of the probability of a Type 1 error with respect to a prior measure could be considered rather than the size.

Valen Johnson pursued Good and Crook’s ideas further in his 2005 paper in JRSSB by considering Bayes factors based on test statistics whose distributions under the null hypothesis do not depend on the unknown parameters. Johnson’s significance testing approach removes much of the subjectivity that is usually associated with Bayes factors. In his 2009 article in JASA with Jianhua Hu, the courageous co-authors extended this methodology to model selection, a wonderful piece of work which should influence Bayesian model selectors everywhere.

Valen Johnson is Professor of Statistics at Texas A&M University. His applied Bayesian research interests include educational assessment, ordinal and rank data, clinical design, image analysis, and reliability analysis. More to the point, his theoretical ideas are highly and refreshingly original, and I only wish that I’d thought of them myself. He is a worthy candidate for the Bayesian Hall of Fame, should that ever materialise.

Valen Johnson

Debabrata Basu and Carlos Pereira published an important Bayesian paper in the Journal of Planning and Inference in 1982 on the problem of non-response when analysing categorical data.

Basu is famous for his counterexamples to frequentist inference, and for Basu’s theorem which states that any bounded complete sufficient statistic is independent of any ancillary statistic.

S.H. Chew’s 1983 Econometrica paper ‘A generalization of the quasi-linear mean with applications to the measurement of income inequality and decision theory resolving the Allais paradox’ affected the viability of some aspects of the foundations of Bayesian decision theory.

Also in 1983, Carl Morris published an important discussion paper in JASA on parametric empirical Bayes inference, in which he shrunk the unbiased estimators for several normal means towards a regression surface. Similar ideas had been expressed by Adrian Smith in Biometrika in 1973, using a hierarchical Bayesian approach, but Carl derived some excellent mean squared error properties for his estimators.

Carl Morris

In contrast to the proposals by Carl Morris and Adrian Smith, Amy Racine-Poon developed a Bayesian approach in her 1985 Biometrics paper to some non-linear random effects models using a Lindley-Smith-style hierarchical prior, where she estimated the hyperparameters by marginal posterior modes using an adaptation of the EM algorithm. Her method performed rather well in small sample simulations. In a real example, her method, when applied to only 40% of the observations, yielded similar results to those obtained from the whole sample.

In their 1981 exposition in JASA, John Deely and Dennis Lindley had proposed a ‘Bayes Empirical Bayes’ approach to some of these problems. They argued that many empirical Bayes procedures are really non-Bayesian asymptotically optimal classical procedures for mixtures. However, an outsider could be forgiven for concluding that the difference between ‘Empirical Bayesian’ and ‘applied Bayesian’ is a question of semantics and a few technical differences, and certainly not something which is worth going to war over.

John Deely

John Deely is Emeritus Professor of Statistics at the University of Canterbury in Christchurch, and continuing itinerate lecturer in the Department of Statistics at Purdue. He has graced the applied Bayesian literature with many fine publications, and is wonderfully frank and honest about everything.

At Valencia 6 in 1998, John was to provide me with several valuable Bayesian insights, though he got plastered while I was discussing them with Adrian Smith. Undeterred, I rose to the occasion during the final evening song and comedy skits, and played Rev. Thomas Bayes returning from Heaven in a white sheet for an interview with Tony O’Hagan, only to get into an unexpected tangle with the microphone.

Tom playing Rev. Thomas Bayes returning from Heaven
Valencia 6 Cabaret 1998. The Master of Ceremonies
Tony O'Hagan played ping pong with Tom at UCL

In similar spirit to the Deely-Lindley Bayes Empirical Bayes approach, Rob Kass and Duane Steffey’s ‘Bayesian hierarchical’ methodology for non-linear models, which they published in JASA in 1989, referred to estimates for the hyperparameters which were calculated empirically from the data. Like Deely and Lindley, they used Laplacian approximations as part of their procedure.

The Kass-Steffey conditional Laplacian approach was to create the basis for many of the routines in the INLA package, some of which refer to marginal mode estimates for the hyperparameters, which was reported by Rue, Martino and Chopin in there extremely long paper in JRSSB, fully twenty years later in 2009.

Back in 1983, Bill DuMouchel and Jeffrey Harris published an important discussion paper in JASA about the combining of results of cancer studies in humans and other species. They employed a two-way interaction model for their estimated response slopes together with a hierarchical prior distribution for the unknown row, column, and interaction effects. There methodology preceded more recent developments in Bayesian meta-analysis.

In his 1984 paper in the Annals of Statistics, the statistician and psychologist Don Rubin proposed three types of Bayesianly justifiable and relevant frequency calculations which he thought might be useful for the applied statistician. This followed Don’s introduction in the Annals of Statistics (1981) of the Bayesian bootstrap, which can, at least in principle, be used to simulate the posterior distribution of the unknown parameters, in rather similar fashion to Galton’s Quincunx.

The Bayesian bootstrap is similar operationally to the more standard frequency-based bootstrap. While the more standard bootstrap is presumably not as old as the Ark, it probably is as old as the first graduate student who fudged his answers by using it. It is however usually credited to Bradley Efron, and does produce wide-rangingly useful results. Both bootstraps can be regarded as helpful rough-and-ready devices for scientists who are unwilling or unprepared to develop a detailed theoretical solution to the statistical problem at hand.

Don Rubin later co-authored the highly recommendable book Bayesian Data Analysis with Andrew Gelman, John Carlin and Hal Stern.

Andrew Gelman is Professor of Statistics and Political Science at Columbia University. He has co-authored four other books:

Teaching Statistics: A Bag of Tricks,

Data Analysis Using Regression and Multilevel/ Hierarchical Models,

Red State, Blue State, Rich State, Poor State: Why Americans Vote the Way They Do

and he co-edited A Quantitative Tour of the Social Sciences with Jeronimo Cortina.

Gelman has received the Outstanding Statistical Applications Award from the ASA, the award for best article published in the American Political Science Review, and the COPPS award for outstanding contributions by a statistician under the age of 40. He has even researched arsenic in Bangladesh and radon in your basement.

Andrew Gelman

In 1985, John S.J. Hsu, nowadays Professor and Head of Statistics and Applied Probability at UCSB, made a remarkable computational discovery while working as a graduate student at the University of Wisconsin-Madison. He found that conditional Laplacian approximations to the marginal posterior densities of an unknown scalar parameter θ can be virtually identical to the exact result, with three decimal point accuracy even way out in the tails of the exact density. He then proceeded to show that this was true in more general terms.

Suppose that θ and a vector ξ of further unknown parameters possess posterior density π (θ, ξ). Assume further that the parameters have already been suitably transformed to ensure that the conditional posterior distribution of ξ given θ is roughly speaking multivariate normal for each fixed θ. Then the conditional Laplacian approximation to the marginal posterior density of θ should be computed by reference to the following three steps:

1. For each θ, replace ξ in π (θ, ξ) by the vector which maximises π (θ, ξ) with respect to ξ for that value of θ.

2. For each θ, divide your maximised density by the square root of the determinant of the conditional posterior information matrix R(θ) of ξ given θ.

3. The function of θ thus derived is proportional to your final approximation. So integrate it numerically over all possible values of θ, and divide your function by the integrand, thus ensuring that it integrates to unity.

Similarly approximations are available to the prior predictive density of a vector of observations y, or the posterior predictive density of y given the realizations of a vector z of previous observations. I justified all of these approximations during 1981 via backwards applications of Bayes theorem. The results were published in JASA in my 1982 comment on a paper by Lejeune and Faulkenberry about predictive likelihood.

John and Serene Hsu, Santa Barbara, California (Christmas 2012)

Following my three month visit to the Lindquist Center of Measurement in Iowa City in the summer of 1984, the approximations described above to marginal posterior densities were published in 1986 by myself and Mel Novick in the Journal of Educational Statistics as part of our ‘Bayesian full rank marginalization’ of two-way contingency tables, which was used to analyse the ‘Marine Corps’ psychometric testing data, and numerically validated by Jim Albert in his wonderful 1988 JASA paper. They were also employed by the now famous geologist and Antarctic explorer Jon Nyquist, then a fellow chess player at the University of Wisconsin, when he was estimating the depth of the Mid-Continental Atlantic Rift. See Nyquist and Wang [69].

During my visit to Iowa City in 1984, I helped Shin-Ichi Makewaya to develop an empirical Bayesian/ EM algorithm approach to factor analysis for his 1985 Ph.D. thesis ‘Bayesian Factor Analysis’, which was supervised by Mel Novick. I don’t know whether this important work ever got published. Shin-Ichi is currently a highly accomplished professor in the Graduate School of Decision Science and Technology at the Tokyo Institute of Technology.

For various generalisations and numerical investigations of conditional Laplacian approximations, see ‘Bayesian Marginal Inference’ which was published in JASA in 1989 by John Hsu, Kam-Wah Tsui and myself, John Hsu’s 1990 University of Wisconsin Ph.D. thesis Bayesian Inference and Marginalization, sections 5.1B and 5.12A of [15] and the further references reported therein, which include an application, with Christian Ritter, of the ‘Laplacian t- approximation’ to the Ratkowsky MG chemical data set.

The conditional Laplacian approximation in my 1982 JASA note, and a Kass, Tierney and Kadane rearrangement thereof, was adopted as a standard technique in Astronomy and Astrophysics. After the two astrophysicists A.J. Cooke and Brian Espey visited the University of Edinburgh STATLAB in 1996, they published their paper [70] with Robert Carswell of the Institute of Astronomy, University of Cambridge, which concerned ionizing flux in high redshifts.

See also Tom Loredo’s Cornell University short course Bayesian Inference in Astronomy and Astrophysics, and the paper by Tom Loredo and D.Q. Lamb in the edited volume Gamma Ray Bursts (Cambridge University Press, 1992).Brian Espey is currently senior lecturer in Physics and Astrophysics at Trinity College Dublin.

Meanwhile, Luke Tierney and Jay Kadane and Tierney, Kass, and Kadane investigated the asymptotic properties of these and related procedures when the sample size n is large, though without demonstrating the remarkable finite n numerical accuracy which John Hsu and I illustrated in our joint papers. This was partly because their computations were computer-package orientated while ours referred to specially devised routines and partly because we always employed a preliminary approximately normalising transformation of the nuisance parameters. In some cases Kass, Tierney and Kadane were able to prove saddle-point accuracy of the conditional Laplacian approximations. However, saddle-point accuracy does not necessarily imply good finite n accuracy. In one of the numerical examples reported in our 1989 JASA paper an over-ambitious saddle-point approximation proposed by Tierney, Kass, and Kadane is way off target.

In their widely cited 1993 JASA paper, Norman Breslow and David Clayton used Laplacian approximations to calculate very precise inferences for the parameters in generalised mixed linear models.

During the 1990s conditional Laplacian approximations were, nevertheless, to become less popular than MCMC, which employed complicated simulations from conditional posterior distributions when attempting, sometimes in vain, to achieve the exact Bayesian solution. However, during the early 21 st. century, the enthusiasm for the time-consuming MCMC began to wear off. For example, the INLA package developed by Håvard Rue of the Norwegian University of Science and Technology in Trondheim, and his associates, in 2008 (see Martino and Rue [71] for a manual for the INLA program) implements integrated nested Laplacian approximations for hierarchical models with some emphasis on spatial processes, and use posterior predictive probabilities and cross-validation to check the adequacy of the proposed models. Rue et al are, for example, able to compute approximate Bayesian solutions for the log-Gaussian doubly stochastic process. The INLA package offers all sorts of attractive possibilities (see my Ch. 7 for more details).

The celebrated Bayesian psychometrician Melvin R. Novick, of Lord and Novick fame, died of a second major heart attack at age 54 while he was visiting ETS in Princeton, New Jersey during May 1986. His wife Naomi was at his side. He had experienced a debilitating heart attack in Iowa City in early 1982 while I was on the way to visit him and Bob Hogg at the University of Iowa to present a seminar on the role of coherence in statistical modeling, and I was devastated to hear about his sudden heart attack from Bob, who was thoroughly distraught, on my arrival.

In 2012, I was contacted from New York by Charlie Lewis, a Bayesian buddy of Mel’s at the American College Testing Program in Iowa City way back in June 1972. At that time, I’d advised Charlie, in somewhat dramatic circumstances, about Bayesian marginalization in a binomial exchangeability model, and this enabled him to co-author the paper ‘Marginal Distributions for the Estimation of Proportions in M Groups’ with Mel Novick and Ming-Mei Wang in Psychometrika (1975) which earned him his tenure at the University of Illinois. While the co-authors briefly cited my A.C.T. technical report on the same topic, they did not acknowledge my advice or offer me a co-authorship. I was paid $400 plus expenses for my work at A.C.T. that summer on a pre-doctoral fellowship.

Beware the ‘We cited your technical report’ trick, folk. It happens all too often in academia, and sometimes leads to misunderstandings and even to the side-lining of the person with the creative ideas in favour of authors in need of a meal ticket. It happened to me on another occasion, when a gigantic practically-minded statistician desperate for a publication wandered into my office, grabbed a technical report from my desk, and walked off with it.

Charlie is nowadays a presidential representative to ETS regarding the fairness and validity of educational testing in the US. After an exchange of pleasantries, forty years on, he came close to agreeing with me that

‘Bayesians never die, but their data analyses go on before them’.

Charlie Lewis receiving a Graduate Professor of the Year Award at Fordham University

THE AXIOMS OF SUBJECTIVE PROBABILITY: In 1986, Peter Fishburn, a mathematician working at Bell Labs in Murray Hill, New Jersey, reviewed the various ‘Axioms of Subjective Probability’, in a splendid article in Statistical Science. These have been regard by many Bayesians as justifying their inferences, even to the extent that a scientific worker can be regarded as incoherent if his statistical methodology is not completely equivalent to a Bayesian procedure based on a proper (countably additive) prior distribution i.e. if he does not act like a Bayesian (with a proper prior).

This highly normative philosophy has given rise to numerous practical and real-life implications, for instance,

(1) Many empirical and applied Bayesians have been ridiculed and even ‘cast out’ for being at apparent variance with the official paradigm.

(2) Bayesians have at times risked having their papers rejected if their methodology is not perceived as being sufficiently coherent, even if their theory made sense in applied terms.

(3) Careers have been put at risk, e.g. in groups or departments which pride themselves on being completely Bayesian.

(4) Some Bayesians have wished to remain controlled by the concept of coherence even in situations where the choice of sampling model is open to serious debate.

(5) Many Bayesians still use Bayes factors e.g. as measures of evidence or for model comparison, simply because they are ‘coherent’, even in situations where they defy common sense (unless of course a Bayesian p-value of the type recently proposed by Baskurt and Evans is also calculated).

Before 1986, the literature regarding the Axioms of Subjective Probability was quite confusing, and some of it was downright incorrect. When put to the question, some ‘coherent’ Bayesians slid from one axiom system to another, and others expressed confusion or could only remember the simple axioms within any particular system while conveniently forgetting the more complicated ones. However, Peter Fishburn describes the axiom systems in beautifully erudite terms, thereby enabling a frank and honest discussion.

Fishburn refers to a binary relation for any two events A and B which are expressible as subsets A and B of the parameter space Θ. In other words, you are required to say whether your parameter θ is more likely (Fishburn uses the slightly misleading term ‘more probable’) to belong to A than to B, or whether the reverse is true, or whether you judge the two events to be equally likely. Most of the axiom systems assume that you can do this for any pair of events A and B in Θ. This is in itself a very strong assumption. For example, if Θ is finite and contains just 30 elements then you will have over a billion subsets and over billion billion pairwise comparisons to consider.

The question then arises as to whether there exists a subjective probability distribution p defined on events in Θ which is in agreement with all your binary relations. You would certainly require your probabilities to satisfy p(A) >p(B) whenever you think that θ is more likely to belong to A than to B. It will turn out that some further, extremely strong, assumptions are needed to ensure the existence of such a probability distribution.

Note that when the parameter space contains a finite number k of elements, a probability distribution must, by definition, assign non-negative probabilities summing to unity to these k different elements. In 1931 De Finetti recommended attempting to justify representing your subjective information and beliefs by a probability distribution, by assuming that your binary relations satisfy a reasonably simple set of ‘coherency’ axioms.

However, in 1959, Charles Kraft, John Pratt, and the distinguished Berkeley algebraist Abraham Seidenberg proved that De Finetti’s ‘Axioms of Coherence’ were insufficient to guarantee the existence of a probability distribution on Θ that was in agreement with your binary relations, and they instead assumed that the latter should be taken to satisfy a strong additivity property that is extremely horribly complicated.

Indeed the strongly additive property requires you to contrast any m events with any other m events, and to be able to do this for all m=2,--, p. In Ch. 4 of his 1970 book Utility Theory for Decision Making, Fishburn proved that if your binary relations are non-trivial then strong additivity is a necessary and sufficient condition for the existence of an appropriate probability distribution.

The strong additivity property is valuable in the sense that it provides a mathematical rigorous axiomatization of subjective probability when the parameter space is finite, where the axioms do not refer to probability in themselves. However, in intuitive terms, it would surely be much simpler to replace it by the more readily comprehendible assumption:

Axiom T: You can represent your prior information and beliefs about θ by assigning non-negative relative weights (to subsequently be regarded as probabilities), summing to unity, to the k elements or ‘outcomes’ in Θ, in the knowledge that you can then calculate your relative weight of any event A by summing the relative weights of the corresponding outcomes in Θ.

I do not see why we need to contort our minds in order to justify using Axiom T. You should be just be able to look at it and decide whether you are prepared to use it or not. When viewed from an inductive scientific perspective, the strong additive property just provides us with highly elaborate window dressing. If you don’t agree with even a tiny part of it, then it doesn’t put you under any compulsion to concur with Axiom T. The so-called concept of coherence is pie in the sky! You either want to be a Bayesian (or act like one) or you don’t.

When the parameter space Θ is infinite, probabilities can, for measure-theoretic reasons only be defined on subsets of θ, known as events, which belong to a σ-algebra e.g. an infinite Boolean algebra of subsets of θ. Any probability distribution which assigns probabilities to members of this σ-algebra must, by definition, satisfy Kolmogorov’s ‘countable additivity property’ in other words your probabilities should add up sensibly when you are evaluating the probability of the union of any finite or infinite sequence of disjoint events. Most axiom systems require you to state the binary relationship between any two of the multitudinous events in the σ-algebra, after considering which of them is more likely.

Bruno De Finetti and Jimmie Savage proposed various quite complicated axiom systems while attempting to guarantee the existence of a probability distribution that would be in agreement with your infinity of infinities of binary relations. However, in his 1964 paper on quality σ-algebras, in the Annals of Mathematical Statistics, the brilliant Uruguayan mathematician Cesareo Villegas proved that a rather strong property is needed, in addition to De Finetti’s four basic axioms, in order to ensure countable additivity of your probability distribution, and he called this the monotone continuity property.

Trivialities can be more essential than generalities, as my friend Thomas the Tank-Engine once said.

Unfortunately, monotone continuity would appear to be more complicated in mathematical terms than Kolmogorov’s countable additivity axiom, and it would seem somewhat easier in scientific terms to simply decide whether you want to represent your subjective information or beliefs by a countably-additive probability distribution, or not. If monotone continuity is supposed to be a part of coherence, then this is the sort of coherence that is likely to glue your brain cells together. Any blame associated with this phenomenon falls on the ways the mathematics was later misinterpreted, and we are all very indebted to Cesareo Villegas for his wonderful conclusions.

Your backwards inductive thought processes might well suggest that most other sets of Axioms of Subjective Probability which are pulled out from under a stone would by necessity include assumptions which are similar in strength to strong additivity and monotone continuity, since they would otherwise not be able imply the existence of a countably additive probability distribution on a continuous parameter space which is in agreement with your subjectively derived binary relations. See for example, my discussion in [8] of the five axioms described by De Groot on pages 71-76 of Optimal Statistical Decisions. Other axiom systems are brought into play by the discussants of Peter Fishburn’s remarkable paper.

In summary, the Axioms of Subjective Probability are NOT essential ingredients of the Bayesian paradigm. They’re a torrent, rather than a mere sprinkling, of proverbial Holy Water. If we put them safely to bed, then we can broaden our wonderfully successful paradigm in order to give it even more credibility in Science, Medicine, Socio-Economics, and wherever people can benefit from it.

Meanwhile, some of the more diehard Bayesian ‘High Priests’ have been living in ‘airy fairy’ land. They have been using the axioms in their apparent attempts to control our discipline from a narrow-minded powerbase, and we should all now make a determined effort to break free from these Romanesque constraints in our search for scientific truth and reasonably evidence-based medical and social conclusions which will benefit Society and ordinary people.

Please see Author’s Notes (below) for an alternative view on these key issues, which has been kindly contributed by Angelika van der Linde, who has recently retired from the University of Bremen.

Cesareo Villegas (1921-2001) worked for much of his career at the Institute of Mathematics in Montevideo before moving to Simon Fraser University where he become Emeritus Professor of Statistics after his retirement. He published eight theoretical papers in the IMS’s Annals journals, and three in JASA. He is well-known for his development of priors satisfying certain invariance properties, was actively involved with Jose Bernardo in creating the Bayesian Valencia conferences, and seems to have been an unsung hero.

Bruno de Finetti's contributions to the Axioms of Subjective Probability were far surpassed by his (de Finetti’s) development of the key concept ‘exchangeability’, both in terms of mathematical rigor and credibility of subsequent interpretation, and it is for the latter that our Italian maestro should be longest remembered.

Bruno de Finetti

Also in 1986, Michael Goldstein published an important paper on ‘Exchangeable Prior Structures’ in JASA, where he argued that expectation (or prevision) should be the fundamental quantification of individuals’ statements of uncertainty and that inner products (or belief structures) should be the fundamental organizing structure for the collection of such statements. That’s an interesting possibility. I’m sure that Michael enjoys chewing the rag with Phil Dawid.

Maybe Michael should be known as ’Mr. Linear Bayes U.K.’. He and his co-authors, including his erstwhile Ph.D. supervisor Adrian Smith, have followed in the footsteps of Kalman and Bucy by publishing many wonderful papers on linear Bayes estimation. See Michael’s overview in the Encyclopedia of Statistical Sciences, and his book Bayes Linear Statistics with David Woolf.

Michael Goldstein

The 1987 monograph Differential Geometry in Statistical Inference by Amari, Barndorff-Nielsen, Kass, Lauritzen and Rao is of fundamental importance in Mathematical Statistics. In his 1989 article ‘The Geometry of Asymptotic Inference’ in Statistical Science, Rob Kass provides readers with a deep understanding of the ideas of Sir Ronald Fisher and Sir Harold Jeffreys as they relate to Fisher’s expected information matrix.

Rob Kass is one of the most frequently cited mathematicians in academia, and he is one of Jay Kadane’s proudly Bayesian colleagues at Carnegie-Mellon University.

Kathryn Chaloner, more recently Professor of Statistics, Biostatistics, and Actuarial Science at the University of Iowa, also proposed some important contributions during the 1980s. These include her optimal Bayesian experimental designs for the linear model and non-linear models with Kinley Larntz, and her Bayesian estimation of the variance components in the unbalanced one-way model. More recently, she and Chao-Yin Chen applied their Bayesian stopping rule for a single arm study to a case study of stem plant transplantation. Then she and several co-authors developed a Bayesian analysis for doubly censored data that used a hierarchical ‘Cox’ model, and this was published in Statistics in Medicine.

Kathryn Chaloner

Kathryn was elected Fellow of the American Association for the Advancement of Science in 2003. After graduating from Somerville College Oxford, she’d obtained her Master’s degree from University College London in 1976 and her Ph.D. from Carnegie-Mellon in 1982.

John Van Ryzin, a pioneer of the empirical Bayes approach died heroically from AIDS in March 1987 at the age of 51, after continuing to work on his academic endeavours until a few days before his death. He was the co-author, with Herbert Robbins, of ‘An Introduction to Statistics’, an authority on medical statistics, survival analysis, and the adverse effects of radiation and toxicity, and a member of the science review panel of the Environment Protection Agency. A fine family man, he was also a most impressive seminar speaker.

In 1987, Gregory Reinsel and George Tiao employed random effects regression/time series models to investigate trends and possible holes in the stratospheric ozone layer at a time when it was being seriously affected by polluting chemicals. Their approach, which they reported in an outstanding paper in JASA, is effectively hierarchical Bayesian.



Greg Reinsel(1948-2005)		George Tiao

In 1988, Jim Albert, previously one of Jim Berger’s Ph.D. students at Purdue, reported some computational methods in JASA, based upon Laplacian approximations to posterior moments, for generalised linear models where the parameters are taken to possess a hierarchical prior distribution. He, for example, analysed a binomial/beta exchangeable logit model, which is a special case of the Leonard-Novick formulation in our 1986 paper in the Journal of Educational Statistics and provided an alternative to my exchangeable prior analysis for binomial logits in my 1972 Biometrika paper.

Jim has also published a series of papers that assume Dirichlet prior distributions, and mixtures thereof, for the cell probabilities in rxs contingency tables, thereby very soundly extending the work of I.J. Good.

THE AXIOMS OF UTILITY: In 1989, Peter Wakker, a leading Dutch Bayesian economist on the faculty of Erasmus University in Rotterdam published his magnum opus Additive Representations of Preferences: A New Foundation of Decision Analysis. Peter for example analysed ‘Choquet expected utility’ and other models using a general trade off technique for analysing cardinal utility and based on a continuum of outcomes. In his later book Prospect Theory for Risk and Ambiguity he analyses Tversky and Kahneman’s cumulative prospect theory as yet another valid alternative to Savage’s expected utility. In 2013, Peter was awarded the prestigious Frank P. Ramsey medal by the INFORMS Decision Analysis Society for his high quality endeavours.

Peter has recently advised me (personal communication) that modifications to Savage’s expected utility which put positive premiums on the positive components of a random monetary reward which are regarded as certain to occur, and negative premiums on those negative components which are thought to be certain to occur, are currently regarded as the state of the art e.g. as it relates to Portfolio Analysis. For an easy introduction to these ideas see Ch. 4 of my 1999 book [15].

In an important special case, which my Statistics 775: Bayesian Decision and Control students, including the Catalonian statistician Josep Ginebra Molins and the economist Jean Deichmann, validated by a small empirical study in Madison, Wisconsin during the 1980s, a positive premium ε can be shown to satisfy ε =2φ-1 whenever a betting probability φ, which should be elicited from the investor, exceeds 0.5.

The preceding easy-to-understand approach, which contrasts with the ramifications of Prospect theory, was axiomatized in 2007 by Alain Chateauneuf, Jürgen Eichberger and Simon Grant in their article ‘Choice under uncertainty with the best and the worst in mind: New additive capacities’ which appeared in the Journal of Economic Theory.

While enormously complex, the new axiom system is nevertheless the most theoretically convincing alternative that I know of to the highly normative Savage axioms. Their seven mathematically expressed axioms, which refer to a preference relation on the space of monetary returns, may be referred to under the following headings:

Axiom 0: Non-trivial preferences

Axiom 1: Ordering (i.e. the preference relation is complete, reflexive and transitive)

Axiom 2: Continuity

Axiom 3: Eventwise monotonicity

Axiom 4: Binary comonotonic independence

Axiom 5: Extreme events sensitivity

Axiom 6: Null event consistency

If you are able to understand all these axioms and wish to comply with them, then that puts you under some sort of obligation to concur with either the Expected Utility Hypothesis, or the simple modification suggested in Ch.4 of [15] or obvious extensions of this idea. Alternatively, you could just replace expected utility by whatever modification or alternative best suits your practical situation. Note that, if your alternative criterion is sensibly formulated, then it might, at least in principle, be possible to devise a similarly complex axiom system that justifies it, if you really wanted to. In many such cases, the cart has been put before the proverbial horse. It’s rather like a highly complex spiritual biblical prophecy being formulated after the event to be prophesied has actually occurred. Maybe the Isaiahs of decision theoretic axiomatics should follow in the footsteps of Maurice Allais and focus a bit more on empirical validations, and the scientific and socio-economic implications of the methodology which they are striving to self-justify by these somewhat over-left-brained, potentially Nobel prize-winning theoretical discourses.

Their 1989 paper in the Journal of Econometrics, the Dutch econometrician Mark Steel proposed, with Jean-François Richard, a Bayesian analysis of seemingly unrelated regression equations that used a recursive extended conjugate prior density. Mark Steel published several papers on related topics during the next few years e.g. on robust Bayesian and marginal Bayesian inferences in skewed sampling distributions and elliptical regression models. Mark’s research ideas are extremely imaginative and he has continued to publish them until this very day.

Mark Steel

Mark Steel is currently Professor of Statistics at the University of Warwick and also holds the Chair of Excellence at the Carlos the Third University of Madrid. His interests have now moved away from Econometrics, and into mainstream theoretical and applied Bayesian Statistics. Mark was awarded his Ph.D. by the Catholic University of Louvain in 1987. His Ph.D. topic was A Bayesian Analysis of Exogeniety: A Monte Carlo Approach.

Susie M. J. Bayarri and Morris De Groot co-authored thirteen joint papers between 1987 and 1993 including ‘A Bayesian view of selection models’, ‘Gaining weight: A Bayesian Approach’, and ‘What Bayesians expect of each other’. This was doubtlessly one of the most prolific Bayesian co-authorships of all time, and it helped Susie to become one of the leading Bayesians in Spain and one of the most successful woman Bayesians in history.

Bayesettes: Alice Carriquiry, Susie Bayarri and Jennifer Hill
(Archived from Brad Carlin's Collection)

The leading Brazilian statistician Carlos Pereira published his Associate Professor thesis in 1985, on the topic of the Bayesian and classical interpretations of multi-dimensional hypothesis testing. He has published over 200 papers in scientific journals, many of them Bayesian, with important applications in a number of disciplines, including medicine and genetics. Carlos has also supervised 21 Ph.D. theses and 23 Masters dissertations, and encouraged numerous other Bayesian researchers. His book The Elements of Bayesian Inference was co-authored with Marlos Viana and published in Portuguese in 1982. Other notable Brazilian Bayesians include Carlos’s brother Basilio Pereira, Gustavo Gilardoni, Heleno Bolfarine, Helio Migon, Dani Gamerman, Jorge Achcar, Alexandra Schmidt, Hedibert Lopes and Vinicius Mayrink.

The paper ‘Aspects of reparametrization in approximate Bayesian Inference’ by Jorge Achcar and Adrian Smith in Bayesian and Likelihood Methods in Statistic and Econometrics (1990) is particularly impressive.



Carlos Pereira		Vinicius Mayrink

Carlos Pereira’s contributions during and beyond the 1980s epitomise the success of the Bayesian approach in evolving into a paradigm that is socially and practically relevant in the way it addresses data sets in the context of their social, medical, and scientific backgrounds, sometimes at a grass roots level, and draws logical conclusions that would be unavailable from within the Fisherian and likelihood paradigms.

Author’s Notes. In Exercise 6.1.a on pages 248-249 of [15] I outline the asymptotic properties of Box’s model-checking criterion [67]. However, my equations (6.1.14) and (6.1.15) should be corrected to include extra terms involving the logs of the determinants of the observed likelihood information matrix and Fisher’s information matrix. In 1980, I’d made the mistake of equating the observed information matrix with its random counterpart. I noticed this error, over thirty-three years later, during the preparation of this article. In 1980, George’s negative reaction to my asymptotics motivated me to only submit a very tame written contribution to the discussion of his article. George advised me afterwards that he hadn’t responded to it because ‘it didn’t say anything’, and I totally agree.

George, and his experiences with the military, inspired the completely fictional interplanetary character Professor Brad Redfoot in my self-published novel Grand Schemes on Qinsatorix. While George’s statistical research at M.R.C. was generously funded by the U.S. Army, he stoically refused to get involved in military applications (to do so was against Wisconsin State Law). On one occasion in California, the generals invited him to board a helicopter to review the troops. George reverted to the Anglo-Saxon vernacular and his reply is unprintable.

On another occasion, the generals persisted in interrupting a Statistics conference in the Naval Postgraduate School in Monterey with their own lengthy party pieces. When one of the stentorian interlopers declared ‘A small amount of brutishness is worth lots of pity’, Bradley Efron (who was there to talk about the Bootstrap) expressed his indignation and George took us for a walk along the beach.

Monterey, California

During my preparation of Ch. 5, Angelika van der Linde kindly sent me the following (now slightly edited) alternative viewpoint regarding axiom systems for subjective probability:

Although I agree that no (Bayesian) statistical approach should be dogmatic or normative, I consider efforts to fix our principles and inferences to be valuable. As a trained mathematician, I am interested in ‘economy of thinking’ which, in the long run can be summarised in (complete or alternative) axiom systems. Also I appreciate the type of rationality that statisticians try to be made explicit, since criticism then gets easier to handle. In particular, the extent to which Bayesian Statistics may be considered as an extension to frequentist statistics may be, or has already been, considered in this way.

Rob Kass’s geometric interpretations, which you mention, are very positive in this respect. From my point of view, it is unfortunate that statistics is hardly accepted as a part of mathematics any more. To me, theory is induced by applications, but requires mathematical efforts in its own terms as well. I, for example, do think that more ‘statistical geometry’, for instance differential geometry, or a ‘topology of distributions’ using information theory, could underpin a much more general theory of utility and decisions in Statistics.

There have indeed been many attempts to develop a comprehensive theory. For example, computer scientists have tried using the principle of minimum descriptive length. But most of these explorations/ branches are not as yet related.

Thank you for your valuable insights, Angelika. Maybe readers would like to focus the full force of their inductive thought processes (i.e. lateral thinking) on these important issues.

Thomas Hoskyns Leonard Blog

Search This Blog

Tuesday, 6 April 2021

BAYESIAN HISTORY:5. THE ENIGMATIC EIGHTIES

No comments:

Post a Comment