Search This Blog

Tuesday 6 April 2021

BAYESIAN HISTORY: 6. TO THE STARS IN THE NINETIES

 


6. TO THE STARS IN THE NINETIES

 

The human race will go to the stars (James A. Koutsky, 1940-1994)

 

James A. Koutsky, Professor of Chemical Engineering, University of Wisconsin-Madison

 

In their highly influential JASA 1990 paper, Alan Gelfand and Adrian Smith projected the Bayesian paradigm towards the stars when they recommended Markov Chain Monte Carlo (MCMC) simulations as a way of computing Bayesian estimates and inferences for the parameters in a wide range of complicated sampling models, in situations where it was well-nigh impossible to achieve a solution using ordinary Monte Carlo or Importance Sampling techniques. Alan and Adrian are to be congratulated, throughout the ages, for their wonderful insights.

    MCMC is a popular special case of the Metropolis-Hastings Algorithm which was first introduced in the Journal of Chemical Physics in 1953 with the objective of facilitating calculations by fast computing machines. The algorithm was generalised by the Canadian mathematician W. Keith Hastings, who introduced it into the Statistics literature in 1970 in a widely cited paper in Biometrika. Acceptance Sampling is another important special case which can be employed in some situations where MCMC is difficult to implement. See the source papers by Nicholas Metropolis, Arianna and Marshall Rosenbluth, and Augusta Teller and Edward Teller [76], and W. Keith Hastings [77], and the easy-to-read article by Siddhartha Chib and Edward Greenberg in JASA (1995).

    MCMC is similar to the Gibbs Sampling approach developed by Stuart and Donald Geman in their 1984 paper ‘Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images’ in IEEE Transactions on Pattern Analysis and Machine Intelligence. The authors reported the first ever proof of the convergence of the simulated annealing algorithm in this, the highest cited paper in Engineering.

    Stuart Geman is James Manning Professor of Applied Mathematics at Brown, and Donald Geman is an internationally eminent professor at Johns Hopkins.


W. Keith Hastings was awarded his Ph.D. at the University of Toronto in 1962. His thesis topic was Invariant Fiducial Distributions, and his Ph.D. supervisors were Don Fraser and Geoffrey Watson.

    Hastings wrote his seminal 1970 paper while an Associate Professor of Mathematics at UT, and after giving consulting advice to John Valeau, a UT Professor of Chemistry.

    Hastings received his tenure from the University of Victoria in British Columbia in 1974. Now retired, he is still living in Victoria. It is unclear to me whether Keith has been made aware of the full impact of his work upon Bayesian inference.


One of the consequences of Gelfand and Smith’s (albeit somewhat politically charged)recommendation was that Bayesian techniques could now be potentially used to analyse a large number of detailed models whose justification was based upon specialised scientific or economic theory, including many of the complex theoretical models that have been devised by experts in many other subjects. More recently it has also been used to calculate the Deviance Information Criterion DIC for model comparison, as an alternative to the occasionally  prohibitively complicated maximum likelihood procedures which are needed to calculate Akaike’s criterion AIC.

    While the Bayesian MCMC ‘cult’ has proved to some to be somewhat all-constraining, MCMC and its extensions have enormously benefited the Bayesian paradigm in all sorts of ways. Areas of application which have benefited from MCMC include medical diagnosis, ecology, geology, computer science, artificial intelligence, machine learning, genetics, astrophysics, archaeology, psychometrics, educational performance, and sports modeling. During the 1990s, many Bayesians focussed their energies on applying MCMC to medical and pharmaceutical data sets, and others to various models in Econometrics.

    Suppose that two vectors of parameters θ and ξ possess a joint posterior density (given the realizations of the observations from a specified sampling model) which is proportional to h(θ,ξ ) as a function of θ and ξ. Then the MCMC simulations may, if technically possible, be completed as follows:

1. Simulate a θ vector from the conditional distribution of θ given the latest simulated vector for ξ.

2. Simulate a ξ vector from the conditional distribution of ξ given the latest simulated vector for θ.

3. Keep cycling between (1) and (2) until you have achieved enough simulations of (θ, ξ) for the purposes that you wish to put them to.

    The conditional distributions in (1) and (2) need to belong to simple families of distributions from which these simulations are possible. To implement this procedure you will need to specify an initial vector for ξ. Then discard, say the first thousand of your simulations for (θ, ξ), after a ‘burn-in’ period, in order to minimise the dependency of your subsequent simulated vectors upon your initial vector for ξ. Finally, pretend that all further simulations for (θ, ξ) are ordinary Monte Carlo simulations, and complete your calculations accordingly.

    The posterior expectation of any bounded function g (θ, ξ ) of θ and ξ  may be computed by averaging g(θ , ξ) with respect to a sufficiently large number of your further simulations for ( θ, ξ). The marginal posterior density of any scalar component λ of (θ, ξ) may be computed, often quite efficiently, as follows:

1. For each fixed, average h (θ, ξ) with respect to your further simulations for the remaining elements of (θ, ξ). This computes an ‘unnormalised marginal posterior density’ of λ.

2. Use another numerical routine to integrate your unnormalised density across all possible values of λ. Then the normalised density may be computed by dividing the unnormalised density by your integrand. If the parameter λ is unbounded then its posterior moments, if they exist, may be computed by reference to this density and the appropriate numerical integrations.

    While ‘ultimate convergence’ of these procedures is theoretically guaranteed, it is typically quite difficult to investigate whether the standard errors of simulation are finite. Indeed MCMC can take a prohibitively long time to converge and sometimes never appears to do so. It may also be difficult to check whether the procedure has converged or not, since it can appear to almost converge before diverging all over the place. It such circumstances it is best to let your simulations run for a while longer to see whether they settle down again. Any lack of convergence could be exaggerated by rounding errors in your calculations.

    In its more general form, MCMC addresses k subvectors of the model parameters, and refers to successive simulations from the conditional posterior distributions of the individual subvectors, given the remaining k-1 subvectors. The procedure converges best in models which have been parsimoniously parametrized and shown to fit the data well via a preliminary data analysis. For poor fitting models, or models with too many parameters, you may well find yourself iterating until the cows come home, and you might wish to go to the beach to take a break.

    This simple account, is based on my experiences and those of my research associate Orestis Papasouliotis when he was analysing multivariate hierarchical ANCOVA models and the Scottish sex offender data. His methodology was reported in 2000 in the first four chapters of Orestis’ University of Edinburgh PhD. Thesis,

    My account is different in several respects from the somewhat tentative suggestions made by Gelfand and Smith in their seminal 1990 paper. Following that monumental leap for mankind, at least one handsome Bayesian was reportedly occasionally given to blinking rather more than usual while he was confirming the convergence of his simulations! However, the MCMC methodology has been considerable refined and generalised over the years. See, for example the excellent books authored by the Brazilian duo Dani Gamerman and Hedibert Lopes, and by Wally Gilks, Sylvia Richardson and David Spiegelhalter.

    Arnold Frigessi, Fabio Martinelli and Julian Stander applied these ideas to Markov random fields in their 1997 paper in Biometrika. Julian Stander and Yuzhi Cai’s 2008 paper about quantile self-exciting threshold autoregressive models is also worth a read. Just grab the right issue of the Journal of Time Series Analysis from your bookcase.

 

Julian Stander

 
    Julian Stander leads a small Bayesian group in my hometown of Plymouth, Devon, where he is a Reader in Statistics. I was born in Flete House, Yealmpton, near the estuary of the Erme, in 1948, shortly after Adrian Smith was born in nearby Dawlish, on the estuary of the Ex, where the waves lash against the windows while the train chugs around the coast. Adrian was raised in the delightful seaside village of Teignmouth. Julian’s contributions add colour to the Bayesian Devonian tradition, and his applied Bayesian colleague David Wright, who lives in Ivybridge, has previously assisted both him and me with our research endeavours. Dennis Lindley lives just over the border, in Somerset, where the cider is pronounced ‘zider’ but isn’t as strong as the Buckfastleigh scrumpy in Devonshire. Folk from the West Country are sometimes known as ‘Janners’.
 
The coast train to Plymouth at Dawlish, Devon
 
[Author's Note: On the fifth of February 2014, part of the seawall at Dawlish was destroyed during a storm, and the trains to Plymouth were cancelled until further notice. Maybe the Gods of Synchronicity have indeed been reading my manuscript.]
 

    Before using MCMC you should always check that your solution cannot be solved analytically or by numerical integration, or sufficiently well approximated by a more readily computable technique that might involve a multivariate normal or a conditional Laplacian approximation. Nowadays, you don’t always need to include MCMC in your paper to get it accepted. Please remember that while MCMC can be used to calculate theoretical solutions, it can’t compensate for any limitations in your statistical data.



During the 1990s, a broad spectrum of fascinating practical applications and fresh theoretical innovations came continuously into view. Their diversity created an impressively complex kaleidoscope through which to view the success of the Bayesian paradigm.

In their 1991 article in the Canadian Journal of Statistics, Joe Gastwirth, Wesley Johnson, and Dana Reneau described an excellent Bayesian analysis of some AIDS/HIV data, and in their 1994 in Biometrics, Scott Zeger and Peter Diggle reported a more complicated empirical hierarchical Bayes approach to another AIDS/HIV problem. The collection and analysis of AIDS/ HIV data was co-ordinated during the 1980s by Steve Lagakos of Harvard University, and he continued to do so until his tragic death in a car accident in 2010. Steve believed that one of the key roles of statisticians was to emphasise what you COULDN’T conclude from haphazardly collected AIDS/HIV data. Bayesians should always beware the possibly profound influences of confounding variables, influences which could, for example, be concealed by the complexities of an apparently all-encompassing hierarchical model. For instance, it is now well-known in the at-risk community that, while usually highly beneficial, anti-viral drugs such as AZT can cause a number of serious physical conditions which were previously thought to have been caused by AIDS itself. When constructing your sampling model it will sometimes be important to take this phenomenon into account.

    Scott Zeger is Professor of Biostatistics at Johns Hopkins University in Baltimore More recently his work has been on Bayesian models for the etiology of children’s pneumonia, and on methods for personalised medicine, which at Johns Hopkins they call ‘personalised health’.

   

    Michael Lavine wrote two single-authored papers in JASA in 1991, on the sensitivity of Bayesian inference, and on Bayesian robustness. Mark Berliner published a paper in JASA on likelihood and Bayesian inferences in chaotic systems.

 

In 1992, Nhu Le and Jim Zidek co-authored an important paper in the Journal of Multivariate Analysis titled ‘Interpolation with uncertain covariances: A Bayesian alternative to kriging’. In 1995, Jim Zidek and Constance van Eeden incorporated their group Bayesian procedures for estimating the exponential mean as a component of their review of the Wald theory in the edited volume Statistical Decision Theory and Related Topics. Jim Zidek first learnt about the nuances of the Bayesian paradigm during his sabbatical year at University College London during 1971-2. After a tentative start during the 1970s, he became one of the leading applied Bayesians in Canada.

 

Jim Zidek FRSC

 

    Constance van Eeden is nowadays an honorary professor of statistics at the University of British Columbia. She co-authored three Bayesian papers with Alec Charras during the early 1990s, and she has also made many Bayes-related contributions to decision theory. She is regarded as the grande dame of Canadian Statistics.

    Canada is indeed rich in Bayesians. For example, Irwin Guttman’s areas of interest include statistical inference, design problems, variable selection, and medical diagnosis. Now a Professor Emeritus at the University of Toronto, Irwin worked in the Department of Statistics there during the 1970s and 80s when it was teeming with Don Fraser’s highly political structural fiducialists. In between time, he served as Chairman of the once-celebrated Department of Statistics a SUNY at Buffalo. Irwin is a compulsive Bayesian, a pal of Norman Draper, and one of our most likeable personalities.

 

Irwin Guttman

 
    Prem Goel and N. Sreenivas Iyengar co-edited the splendid volume Bayesian Analysis in Statistics and Econometrics in 1992. This was based on the papers presented at an Indo-American Bayesian workshop in Bangalore, India, in 1988.
 
Prem Goel advising a student
 

    Prem Goel is Professor of Statistics at Ohio State. His many areas of research interest include Bayesian hierarchical modeling for non-linear dynamic systems, and image processing for automatic pattern recognition of vehicles in airborne and high-resolution satellite images. He is much respected.

    In his outstanding 1992 paper in Statistica Sinica, Jun Shao proposed some ingenious empirical Bayes estimators for several heteroscedastic variances in the linear statistical model. He moreover investigated the invariance, robustness, asymptotic, and mean squared error properties of his estimators. This was a very thorough study out the pre-MCMC era.

    In their 1992 JASA papers, Ludwig Fahrmeir used posterior modes to extend the Kalman filter to non-linear multivariate dynamic models, Isabella Verdinelli and Jay Kadane co-authored the paper ‘Bayesian Design for Maximising Information and Outcome’, and Guido Consonni and Piero Veronese proposed some conjugate priors for exponential families with quadratic variance functions.

 
 

Ludwig Fahrmeir Guido Consonni
 

Still In 1992, John Hsu and I reported a Bayesian procedure in the Annals of Statistics for drawing inferences about the elements of the pxp covariance matrix C of a multivariate normal distribution with a non-conjugate prior distribution with a very flexible parametric structure when contrasted with the quite restrictive parameterization of the conjugate inverted Wishart distribution first proposed by Gwyn Evans in 1965.

    The matrix logarithm A of a positive definite matrix covariance matrix C possesses the same eigenvectors as C, but its eigenvalues are equal to the logs of the corresponding eigenvalues of C. Let q=p(p+1)/2, and let α denote a q x1 vector consisting of the diagonal and upper triangular elements of A, arranged in some specified order. Then α is unconstrained in q-dimensional real space. John Hsu and I took the prior distribution of α to be multivariate normal with specified mean vector μ and covariance matrix D. There are q(q+3)/2 prior parameters in this specification, when compared with the q+1 parameters appearing in the inverted Wishart distribution.

    In the special case where the prior mean matrix of A assumes intra-class form, we propose specifying two further prior parameters, so that there are in total four prior parameters. This expresses more flexible degrees of belief than the inverted Wishart prior formulation employed by Chan-Fu Chen in his 1979 JRSSB paper, which invokes three prior parameters.

    We applied our general prior specification to the situation where the observation vectors constitute a random sample from a multivariate normal distribution with zero mean vector μ and covariance matrix C. However, with minor algebraic adjustments (e.g. replacing n by n-1), our procedure can also be employed in the situation where the unknown population mean vector θ is a priori uniformly distributed over p-dimensional Euclidean space.

    Our algebraically explicit approximate Bayesian inferences and exact importance sampling procedures (which are based on simulations from a generalised multivariate t-distribution) refer to the mathematical physicist Richard Bellman’s recursive solution of a Volterra integral equation, a second-order Taylor Series approximation to the log-likelihood of α, and a spectral representation, in terms of its eigenvalues and eigenvectors of the maximum likelihood matrix of A.

    In their insightful 2013 paper in the Journal of Computational Graphics and Statistics, Xinwei Deng and Kam-Wah Tsui generalise our multivariate normal approximation to the likelihood of α to the situation where the sample covariance is non-singular, and show that the quadratic form in their special case prior density behaves like a roughness penalty. Their posterior estimate of C simultaneously regularizes the smallest and largest eigenvalues of the covariance matrix.

    In 1996, Tom Y.M. Chiu, Kam-Wah Tsui and I extended our 1992 paper in an article in JASA by formulating the matrix logarithmic covariance model for n independent multivariate normal vectors with different covariance matricesWe indicated that the model could be analysed using Bayesian, as well as maximum likelihood, methods.

    Our methodologies relating to the matrix logarithm of a covariance matrix have since been referenced by a number of authors in the Econometrics literature, and applied and extended to spatial processes, random effects models and multivariate time series models, some of which provide alternatives to the classic stochastic volatility models, of the type described by Neil Shephard and Stephen Pitt in their 1996 paper in Bayesian Statistics 6.

    See, for example, p224 of Introduction to Spatial Econometrics by James Le Sage and Kelley Pace, the review of multivariate stochastic volatility by Manabu Asai, Michael McAleer and Jun Wu in Econometric Reviews (2006), and the matrix exponential GARCH time series models investigated by Hiroyuki Kawataksu in the Journal of Econometrics (2006).

 
 

   
R. Kelley Pace Manabu Asai
 
    In their 2013 CIRJE discussion paper ‘Matrix exponential stochastic volatility with cross leverage’, Tsunehiro Ishihara, Yashiro Omori, and Manabu Asai extend our Bayesian ideas in quite brilliant fashion, while describing the 1996 paper by Chiu et al as ‘seminal’.
 
    I have frequently been ridiculed for suggesting that the methodology for the linear model approach with unequal variances described in my 1975 Technometrics paper yielded, as a special case, an early approach for handling stochastically volatile data. However, several authors have now shown that multivariate extensions of my hierarchical 1975 model can handle stochastically volatile data in even more general terms. It's time to stop laughing, Nick Polson. You too, Neil Shephard!
 
 

In 1993, Julian Besag and Peter Green read a paper to the Royal Statistical Society about modern Bayesian computational methods in spatial statistics. It was well-received.

    Julian Besag F.R.S. (1945-2010) was known chiefly for his work in spatial statistics, with applications in epidemiology, image analysis, and agricultural science, and in Bayesian computation using MCMC. I recall drinking with him in Newcastle, Leamington, Oxford, Madison, and San Francisco (see Author’s Notes below), and him advising me in the early 1970s that ‘the overriding problem with Bayesian inference is that the model’s never right’. I will always remember him. Both of us had a tendency for speaking the honest truth, and I, personally, have no problem with that.

    Also in 1993, the eminent medical statistician Jane Hutton co-authored ‘Bayesian sample size calculations and prior beliefs about sexual abuse’ with R.G. Owens, and ‘A Bayesian analysis for case control studies in cancer epidemiology’ with Deborah Ashby and Magnus McGee. She published a further article about Bayesian epidemiology with Deborah Ashby in 1996. In 2012, she, Lorna Barclay, and Jim Smith reported their experiences while embellishing a Bayesian network using a chain event graph, and also co-authored a paper entitled ‘Chain graphs for informed missingness’ in Bayesian Analysis. Jane always mixes high quality mathematics with practical common sense.

    In their 1993 papers in JASA, Kung-Sik Chan reported on the asymptotic behaviour of the Gibbs sampler, Richard McKelvey and Thomas Palfrey described their Bayesian sequential study of learning in games, and Jim Berger and Dongchu Sun reported their Bayesian analysis of the Poly-Weibull distribution.


Still in 1993, the International Society for Bayesian Analysis (ISBA), held their first ever conference, in the Hotel Nikko in San Francisco, concurrently with the annual meetings of the Institute of Mathematical Statistics (IMS) and the American Statistical Association(ASA).

    J. Stuart Hunter wrote that he would address the conference dinner with the following words:

You asked for a ‘title of my presentation’. I do not plan to do more than confess my Bayesianism, and say a few words of greetings as the president of the ASA

Since then, ISBA has evolved into a broadly-based interdisciplinary organization, linking many areas of science, medicine and socio-economics. The Society’s electronic journal Bayesian Analysis has become a popular and much-respected resource, thanks to the efforts of Rob Kass, Brad Carlin, and several other leading statisticians including Angelika van der Linde. Bayesian Analysis currently makes the sixth highest impact in the list of 117 Statistics and Probability journals. ISBA later took over the organizational responsibilities for the Bayesian Valencia Conferences, but with Jose Bernardo still playing a leading role.

 

Tom Leonard, Arnold Zellner and other Bayesians attending the inaugural ISBA conference in the Hotel Nikko in San Francisco in 1993.
The four other Bayesians are Gordon Kaufmann, Wes Johnson, Carl Morris and Shanti Gupta

 

Brad Carlin and his colleagues made numerous and various important contributions during the 1990s. They include an article with Stuart Klugman about Hierarchical Bayesian Whittaker graduation, which appeared in the Scandinavian Actuarial Journal, a discussion paper  with Kathleen Chaloner, Tom Louis and Frank Rhame on elicitation, monitoring and analysis for an AIDS/ HIV trial, an application of MCMC to model choice with Sid Chib, ‘Bayesian Tobit modeling of longitudinal ordinal clinical trial compliance data with non-ignorable missingness,’, with Mary Kathryn Cowles and John Connett , which related to a lung health study  and an excellent 1996 comparative review in Statistical Science, with Mary Cowles, of convergence diagnostics for MCMC, where blinking was not even mentioned.

    Brad Carlin and Tom Louis’s bestselling text Bayesian Methods for Data Analysis, combines the best elements of Bayesian and Fisherian Statistics. And Brad is doubtlessly the most humorous musical Bayesian in the entire world, as evidenced by his Bayesian Songbook.

    In 1994, the much respected Indian statistician Prakash Laud was appointed Acting Director and Professor in the Division of Biostatistics at the University of Missouri. His medical interests include the treatment of injuries, and breast cancer and osteoporosis screening. He specialises in parametric and semi-parametric Bayesian techniques for generalised linear and mixed effects models, models for time to event data, and genetic association studies.

    In two joint papers published in 1994 in JRSSB and Biometrika, Nick Polson and Gareth Roberts investigated the geometric convergence of the Gibbs sampler, and Bayes factors for discrete observations from diffusion processes. In the same year, Eric Jacquier, Nick Polson and Peter Rossi published their wonderful invited discussion paper ‘Bayesian Analysis of Stochastic Volatility Models’. This seminal contribution was named one of the most influential articles in the 20 th. Anniversary issue of the Journal of Business and Economic Studies.

    The economic applications were important. The authors used their highly complex MCMC computations for the analysis of stocks and portfolios. Maybe their models were simple enough to facilitate a more algebraically explicit approximate Bayesian analysis. It might, for example, be possible to find some stochastic approximations which yield easily applicable updating formulae of the Kalman type

    Nick Polson obtained his Ph.D. in 1988 from the University of Nottingham, where he was supervised by Adrian Smith. He is currently Robert Law Jr. Professor of Econometrics and Statistics at the University of Chicago. He is also an excellent gossip. I don’t know what historians would do without the likes of Professor Polson.

    Jon Forster and Allan Skene published some computing algorithms in Statistics in Computing in 1994 for the marginal posterior densities of the parameters of multinomial distributions. In 1998, Jon Forster and Fred Smith reported their model-based inferences from categorical survey data subject to non-ignorable nonresponse in JRSSB.

    Jon Forster has an outstanding track record in the Bayesian analysis of categorical and multivariate ordinal data, with a view to applications in the Social Sciences. He is Professor of Mathematics at the University of Southampton, where he is a very fine teacher of Bayesian technology.

    In their 1994 papers in JASA, Thomas Severini showed how to derive approximate Bayesian inferences when the prior information is summarised by a system of interval estimates, Ming-Hui Chen developed a procedure for importance weighted marginal posterior density estimation, Daniel Phillips and Adrian Smith constructed faces using hierarchical template modeling.

 

During the mid-1990s there were only a handful of active Bayesian statisticians in Germany. In 1995, Ludwig Fahrmeir and Gerhard Tutz helped to spread Bayesian ideas with their book MultivariatStatistical Modeling based on Generalised Linear Models. In 1995, Angelika van der Linde published her Bayesian interpretation of smoothing splines in Test, and in 2000 she reported her reference priors for smoothing and shrinkage parameters, in the Journal of Planning and Inference.

    Angelika is a well-known and highly accomplished Bayesian, with an excellent perspective on life. She recently retired as Extraordinary Professor of Mathematics at the University of Bremen.

 

Tom's former colleague Angelika van der Linde (Oldenburg, 2012)

 

    Ludwig Fahrmeir is Emeritus Professor of Statistics at the University of Munich. His recent research interests include Bayesian inference, regularisation, smoothing and prediction, and applications in many areas, including childhood morbidity, forest health, and survival data.

    In 1997, Katja Ickstadt of the University of Dortmund published an important joint paper in JASA with Nicky Best and Robert Wolpert, on spatial Poisson regression for health and exposure data. Katja has published numerous Bayesian papers on medical topics ever since, and nowadays is one of the leading Bayesian statisticians in Germany.

 
Katja Ickstadt

 

In 1995, James A. Smith, Michael Goldstein, Peter Craig and Allan Seheult read an invited paper to the Royal Statistical Society about their linear Bayes approach for matching hydrocarbon reservoir history. The expert knowledge of the reservoir engineer was incorporated. The contributions of the fledgling first co-author were, according to one of his co-authors, close to zilch (and reportedly even less than zilch!). This was apparently since he was quite frequently side-tracked, when working on his PC, by his socially important obligations as a Boy Scout cub leader.

    Christopher Bishop’s 1995 text Neural Networks for Pattern Recognition took many previously published advanced Bayesian techniques into Artificial Intelligence (A.I.).

    Bishop was to take numerous pre-existing Bayesian semi-parametric and hierarchical Bayes techniques even further into Machine Intelligence in his 2005 book Pattern Recognition and Machine Learning. The positive impact of this transfer of knowledge upon the disciplines of Artificial Intelligence and Machine Intelligence has been enormous. It’s rather like the way the Moors handed over much of their vast knowledge of the ancient Greek and Roman cultures to the Christians during the 11 th. Century, in the Spanish City of Toledo.

    Christopher Bishop is to be congratulated on his tremendous perceptions and insights which have led to immense advances in Machine Intelligence, following very much in the spirit of Alan Turing. He is a Distinguished Scientist at Microsoft Research Ltd. In Cambridge, England, and Professor of Computer Science at the University of Edinburgh.

    Michael I. Jordan has made similarly impressive efforts. He is Pehong Chen Distinguished Professor in the Departments of Electrical Engineering and Computer Science, and Statistics at the University of California at Berkeley.

    In 1995 and 1996, Rob Kass published two major review articles in JASA, the first with Adrian Raftery on Bayes factors and the second with Larry Wasserman on prior distributions.

    Kass was to later co-author two influential reviews on Statistics in neuroscience. He was the founding editor-in-chief in 2006 of ISBA’s interdisciplinary journal Bayesian Analysis.

    In their 1995 JASA paper, Guido Consonni and Piero Veronese proposed using hierarchical partition models to combine results from several binomial experiments. In similar spirit, Peter Green proposed a method for model determination in Biometrika (1995) that constructs Markov chain samplers which irreversibly jump between parameter spaces of different dimensionality. The convergence problems were phenomenal. Nevertheless, Peter’s paper has received over 3000 citations in the scientific literature. He certainly seemed to make everybody jump.

    Guido Consonni is Professor of Statistics at the Catholic University of Milan. He and Fabrizio Ruggeri, a research director with the National Research Council in Milan and a previous president of ISBA, are two of the leading Bayesians in Italy.

    Piero Veronese is Professor of Statistics at Bocconi University. His research interests include the main theoretical issues that Bayesians should be involved in. As a sportsman, he believes that ‘the important thing is to spend your free time outdoors’. He is fond of mountain ski-ing, tracking with snowshoes and crampons, and fantastic experiences in the Algerian desert near Tassili n’Ajjer.

 
 
Piero Veronese
 

    In their 1995 JASA paper, Michael Escobar and Mike West proposed their Bayesian methodology for estimating densities by mixtures, and, in his 1995 JRSSB article, Peter Green combined his adventures with reversible jump MCMC with Bayesian inference in complex stochastic systems and spatial processes, forensic genetics, Bayesian semi-parametrics, and graphical procedures.

    An erstwhile President of the Royal Statistical Society, and the recipient of Guy Medals in silver and bronze, Peter is an Emeritus Professor of Statistics at the University of Bristol where he worked for many years with Bernie Silverman. Their high quality book Non-Parametric Regression and Generalised Linear Models: Roughness Penalty Approach was published in 1994.

    Sylvia Richardson and Wally Gilks reported two Bayesian methods in 1993 and 1994, in Statistics in Medicine and the American Journal of Epidemiology, for analysing conditional independence models for epidemiological data.  In 1997, Sylvia Richardson and Peter Green co-authored a widely-cited paper in JRSSB concerning their Bayesian analysis of mixtures with unknown numbers of components. Since then, Sylvia has become a very keen proponent of MCMC and other stochastic algorithms, which she applies to many areas of genetics and medicine, including genomics and meta-analysis. She advocated analysing Bayesian hierarchical models using the MCMC and acceptance sampling procedures in WinBUGS.

 

Sylvia RIchardson

 

    Sylvia Richardson, who is one of the leading French Bayesians, held the Chair of Biostatistics at Imperial College London until 2012. She is now Professor of Biostatistics and Director of the MRC Biostatistics Unit at the University of Cambridge. She’d met a number of similarly enthusiastic Bayesians during the 1970s while she was a lecturer in Statistics at the University of Warwick.

    Chris Glasbey and Graham Horgan published their book Image Analysis for the Biological Sciences in 1995 with John Wiley. The authors worked at BIOSS (Biomathematics and Statistics Scotland) which is housed in the King’s Buildings, Edinburgh, and applied their methodology to microscopy, medical image systems, and remote sensing. Chris was soon to be a Doctor of Science and Honorary Professor of the University of Edinburgh. He was later elected Fellow of the Royal Society of Edinburgh for his somewhat Bayesian contributions, many of which found application in agriculture. Chris also worked with the Roslin Institute e.g. with Caroline Robinson who used a Bayesian template method to estimate the amount of meat in sheep.

 

In 1996, the eminent Irish statistician Adrian Raftery published ‘Approximate Bayes factors and accounting for model uncertainty in generalised linear models’ in Biometrika. In their 1997 article in Applied Statistics, Chris Volinsky, David Madigan, Adrian Raftery, and Richard Kronmal used their Bayesian model averaging procedures for proportionate hazards models to assess the risk of strokes. A wonderful accomplishment!

    Adrian Raftery is Professor of Statistics and Sociology at the University of Washington in Seattle. He was the world’s most cited mathematician for the entire decade 1995-2005. In 2012 he was awarded the prestigious Parzen Prize by Texas A&M University. His citation mentioned his Bayesian applications in probabilistic forecasting, model-based clustering and classification, time series, image analysis, sociology, demography, environmental sciences, and health sciences. He was recently elected to the Irish Royal Academy.

    In their 1996 JASA papers, Valen Johnson reported his Bayesian analysis of multi-ratio ordinal data, with an application to automated essay grading, Michael Newton, Claudio Czardo and Rick Chappell described their Bayesian inferences for semi-parametric binary regression, and Cinzia Carota, Giovanni Parmigiani and Nick Polson investigated some diagnostic measures for model criticism.

    Also in 1996, Keith Abrams, Deborah Ashby, and Doug Errington reported their Bayesian analysis of Weibull survivor time models in Lifetime Data Analysis, together with their applications to cancer trials. In his 1990 Ph.D. thesis, John Hsu found that mixtures of Weibull distributions gave a better fit to cancer survival data. In 1997, Fayer, Ashby and Parmar published a biostatistics tutorial on Bayesian monitoring in clinical trials.

 

Deborah Ashby

 

    Deborah Ashby is a Bayesian biostatistician with interests in many areas of medicine. She received her O.B.E. in 2009 and was elected to the Academy of Medical Sciences in 2012. She holds the Chair in Medical Statistics and Clinical Trials at Imperial College London.

    Again in 1996, James Bennett, Amy Racine-Poon and Jon Wakefield described how MCMC can be employed for the analysis of non-linear hierarchical models. Their paper appeared in Markov Chain Monte Carlo in Practice (edited by Wally Gilks, Sylvia Richardson and David Spiegelhalter).

    In that same year, Stephen Walker and Jon Wakefield reported their Bayesian semi-parametric approach, in Bayesian Statistics 5, for the population modeling of a monotonic dose response curve.

    In their 1997 paper in JASA, Peter Müller and Gary Rosner applied a Bayesian population model with non-linear hierarchical mixture priors to blood count data.

    The Austrian Bayesian Peter Müller is Professor of Statistics at the University of Texas at Austin, and a past president of ISBA. He works on semi-parametric Bayesian inference, design problems, biomedical research, dependence structures, graphical models, high throughput genomic data, and population pharmokinetic and pharmodynamic models.

    Radford Neal published his book Bayesian Learning for Neural Networks in 1996. Radford currently holds the Canada Research Chair in Statistics and Machine Learning at the University of Toronto, and he has since published numerous high quality Bayesian papers, many with a view to applications in Machine Intelligence, but some with more general applicability.

 
Radford Neal
 

    In 1996, Stuart Coles and Elwyn Powell reviewed the ongoing developments of Bayesian methods in extreme value modeling in the International Statistical Review.

    Stuart Coles and Antony Davison co-authored their book An Introduction to Statistical Modeling of Extreme Values in 2001.

    Stuart is currently an Associate Professor of Statistics at the University of Padua.

    

In 1997, Jon Wakefield, Leon Aarons and Amy Racine-Poon co-authored a Bayesian approach to pharmokinetic/pharmacodynamic modeling in Case Studies in Bayesian Statistics (edited by Brad Carlin and six equally worthy co-editors)

    One of the co-authors’ principle aims was to discover, for a particular drug, the relationship between dose administered , drug concentrations in the body and efficacy/toxicity, They derived a sophisticated three-stage hierarchical model from a set of differential equations, and this helped them to achieve their objectives.

    Amy Racine-Poon worked for the Pharma division of Novartis in Basel, Switzerland. She has co-authored a number of high quality applied Bayesian papers which also refer to elegant mathematical theory, and she ranks highly among mainland European Bayesians.

    Jon Wakefield is Professor of Statistics and Biostatistics at the University of Washington in Seattle. He has authored important Bayesian applied papers in numerous cases, including Genetic epidemiology and Genome-wide association studies. He is always genuinely concerned about the frequency properties of his Bayesian procedures, and should perhaps, like Brad Carlin, be regarded in philosophical terms as a ‘Bayesian-Fisherian’ statistician.

    Jon Wakefield is yet another of Sir Adrian Smith’s highly successful former Ph.D. students.
 

    Sir Adrian’s influence across the discipline was by the mid-1990s becoming enormous. After assuming a number of important leadership roles, he is currently Vice-Chancellor of London University, and also deputy head of the U.K. Statistics Authority. Adrian has supervised 41 successful Bayesian Ph.D. students altogether, most of whom have gone on to achieve greater heights. They include Michael Goldstein, Uri Makov, Allan Skene, Lawrence Pettit, John Naylor, Ewart Shaw, Susan Hills, Nick Polson, David Spiegelhalter and Mike West, a phenomenal achievement. Both John Naylor and Ewart Shaw provided Adrian with remarkably sound computing expertise during the 1980s and before Bayesian MCMC came into vogue. They, for example, developed a computer package known as Bayes 4 which employed some reassuringly convincing, algebraically expressed, approximate Bayesian techniques. Bayes 4 has recently been incorporated into Ewart Shaw’s larger package BINGO. Maybe Sir Adrian should be regarded as the Sir Isaac Newton of modern Bayesian Statistics.

 
Ewart Shaw
 

    Also in 1997, Bob Mau and Michael Newton proposed using MCMC, in their article in the Journal of Computational and Graphical Statistics, when addressing phylogenetic inference for binary data on dendograms.

    In the same year, Bruce Craig, Michael Newton, Robert Garrott, John Reynolds and J. Ross Wilcox proposed using MCMC, in their Biometrics paper, to analyse aerial survey data on Florida Manatee. In 1996, Bruce, the son of University of Wisconsin Dean Judy Craig, had won an ENAR student paper prize for his endeavours.

    Michael Newton, who hails from Nova Scotia, Canada, is currently director of the Biostatistics program and co-director of the Cancer Genetics program at the University of Wisconsin-Madison, and is one of our most brilliant younger middle-aged Bayesians. He was the recipient of the George Snedecor Award and the COPSS Presidents Award, in 1997 and 2004, and he has received several further top honours.

 
Michael Newton
 

    In 1997, Edward George and Robert McCulloch reported their Bayesian approaches to variable selection in Statistica Sinica.

    Ed George is Universal Furniture Professor of Statistics at the University of Pennsylvania. He is interested in the Bayes/ empirical Bayes compromise, and his application areas include business, Bayesian ensemble learning, and serial genetics. He has made many wonderful contributions.

    Rob McCulloch is Katherine Dusak Miller Professor of Econometrics and Statistics at the University of Chicago. Similarly prolific, he is also interested in machine learning.

    John Kent reviewed the literature of Bayesian methods for image analysis by deformable templates in 1997 in The Proceedings in the Art and Science of Image Analysis. The observation vectors are usually modelled by a mixture of multivariate normal distributions with fixed locations and simple covariance structures and (assumption sensitive) prior distributions are then assigned to the mixing probabilities and a single dispersion parameter. John co-authored a paper with Duncan Lee and Kanti Mardia in the same proceedings, where they used a related Bayesian approach to tag cardiac MR images.

    In their 1997 papers in JASA, Iain Weir reported his fully Bayesian reconstructions for single photon emission computed tomography data, Michael Evans, Zvi Gillula, Irwin Guttman and Tim Schwarz described their Bayesian analysis of stochastically ordered distributions of categorical variables, Cindy Christiansen and Carl Morris analysed their hierarchical Poisson regression models, and Jim Albert and Sid Chib proposed their Bayesian tests and diagnostics in conditionally independent hierarchical models.


Newton Bowers, James Hickman, Cecil Nesbit, Donald Jones and Hans Gerber published their magnum opus Actuarial Mathematics in 1997.

    James C. Hickman (1927-2006) was the predominant force in introducing Bayesian inference into the actuarial sciences. He was the erstwhile Dean of the University of Wisconsin-Madison Business School. I first met him and his lovely wife in Iowa City in 1972. He was a man of vision who was always ready to incorporate subjective information into his analyses e.g. when trying to set the insurance premiums for Jumbo jets in the era before any Jumbos had crashed.

 

James C. Hickman (1927-2006)

 
Christian Robert co-authored six important Bayesian papers during 1998. He and Constantinos Goutis used Kullback-Liebler projections to make Bayesian choices between competing generalised linear models. To cap that, Christian combined with Mike Titterington to develop some reparametrization strategies for hidden Markov models, and Bayesian approaches to maximum likelihood estimation. Christian also published four joint papers during 1998 on MCMC and its applications to Bayesian inference.
 
Christian Robert
 

    Christian Robert has been an impressive advocate of the Bayesian paradigm ever since, as epitomised by his influential book The Bayesian Choice. He is currently Professor of Statistics at Université Paris Dauphine. Like his compatriot Sylvia Richardson, he was very much influenced by his earlier good times with the Bayesian school at the University of Warwick.

    Also in 1998, Malay Ghosh, Kannan Nataragan, Tom Stroud, and Brad Carlin reported some MCMC procedures in JASA for the analysis of sample survey data via generalised linear models for small area estimation. They extended their general theorem to the case of spatial models and reviewed the related literature.

    Malay Ghosh and Glen Meeden had published their important text Bayesian Methods for Finite Populations in 1997. Ghosh, an eminent Indian statistician who was born in Bengal, is a Distinguished Professor at the University of Florida. He served from 1996 to 2001 on the US Census Advisory Committee.

    Glen Meeden is Professor of Statistics at the University of Minnesota. He has published extensively in Bayesian inference and decision theory. When I first met Glen, at Arbor Michigan in 1978, he exclaimed, ‘Oh, you’re the guy who obtained that neat estimate for the mean of a normal distribution’.

    Glen was presumable referring to my note in Biometrika 1974, when I modified the usual Bayes estimate of a normal mean by reference to a prior with infinitely thick tails. If I’d used the posterior mean rather than the mode, then my highly robustified generalised Bayes estimate would have been even more convincing. Maybe a keen student somewhere would like to work out the algebra.

    Tom Stroud has now retired from Queen’s University, Kingston, Ontario, after a very productive career in applied Bayesian Statistics during which he worked with Louis Broekhoven in his department’s STATLAB.

    Louis made various contributions to spline theory and Bayesian applications. He also contributed lots of the expertise during the preparation of our joint paper with Jim Low in the American Journal of Obstetrics and Gynaecology (1981). Louis was a former student of Florence David’s at UCL, where he and a cohort of further postgraduates were expected to grind out endless asymptotic expressions.

    Florence once famously declared ‘You’re not getting into my car, George Box!’ after George had criticised her talk on a similarly boring topic to the Royal Statistical Society.


In their JASA 1998 papers, Claudia Tebaldi and Mike West reported their Bayesian inferences for network traffic data, Jim Dickey and Thomas Jiang described their filtered-variate prior distributions for histogram smoothing, and Babette Brumback and John Rice presented a prestigious discussion paper on smoothing spline models for the analysis of nested and crossed samples of curves.

    The volume Maximum Entropy and Bayesian Methods was co-edited in Garching, Germany in 1998 by Wolfgang von den Linden, Volker Dose, Rainer Fischer and Roland Preuss as a contribution to the fundamental theories of Physics.

    This outstanding volume includes a dedication to Edwin Thompson Jaynes (1922-98) by G. Larry Bretthorst. Jaynes was born and raised in Iowa.

    Other valuable contributions include:

A. A paper by Fisher, Jacob, von den Linden and Dose on the Bayesian reconstruction of electron energy distributions from emission line intensities.

B. An article by Richard Silver of the Los Alamos National Laboratory in New Mexico on quantum entropy regularization.

C. A Bayesian reflection on surfaces by David R. Wolf.

    Larry Bretthorst is a professor in the Department of Chemistry and Radiology at Washington University in St. Louis. He has published numerous exciting papers e.g. on Bayesian spectrum analysis.

 

In 1999, Bob Mau, Michael Newton and Bret Larget reported their most recent MCMC procedures for Bayesian phylogenetic inferences, in Biometrics.

    Bob Mau is a senior scientist in the Genome Evolutionary Laboratory at the University of Wisconsin-Madison, and he has co-authored several Bayesian articles in his area of specialism.

    In their 1999 paper in Statistics in Medicine, Luke Tierney and Antonietta Mira of the University of Minnesota in Minneapolis developed some adaptive strategies which adjust the MCMC algorithm to a particular context based upon information obtained during sampling together with information provided by the problem. The authors used their adaptive MCMC analysis of a pharmokinetic model to investigate the plasma concentrations of the drug Caldralazine in cardiac failure patients.

    Also in 1999, the New Zealand Bayesian Russell Millar [78] used the WinBUGS code, with Renate Meyer for fitting a state space surplus production model.

    Russell Millar is Associate Professor of Statistics at the University of Auckland, and editor of the Australian and New Zealand Journal of Statistics. Russell has co-authored several papers on the Bayesian state-space modeling of fisheries dynamics, including surplus production and age-structured models. He moreover published his book Maximum Likelihood and Inference with John Wiley in 2011.

 
Russell Millar
 

    In the same year, Paula Macrossan and five co-authors from the Queensland University of Technology reported their Bayesian neural network for prediction in the Australian dairy industry to the Third International Symposium on Intelligent Data Analysis in the Netherlands. One of the co-authors, Hussein Abbass worked in his university’s machine learning centre. The authors were able to successfully predict dairy daughter milk production from dairy dam, sire, herd and environmental factors.

    To cap that, Steven MacEachern and Merlise Clyde suggested reported their sequential importance sampling simulations for semi-parametric Bayesian models in the Canadian Journal of Statistics, and Merlise developed some Bayesian model averaging and model search strategies in Bayesian Statistics 6. Her model average procedures do seem to be potentially quite sensitive to small changes in the highly complex prior distributional assumptions.

    Merlise Clyde is a professor in the all-Bayesian Department of Statistical Science at Duke University. She applies her techniques to applications in proteomics, bioinformatics, astro-statistics, air pollution and health effects, and environment sciences.

 
Merlise Clyde, ISBA President 2013
 

    Merlise is the current (for 2013), enormously productive and somewhat scatter-brained, President of the International Society for Bayesian Analysis (ISBA).

     Not to be outdone, Helio Migon and Dani Gamerman published their splendid text Statistical Inference: An Integrated Approach, also in 1999. Once postgraduate students at the University of Warwick, the ‘Boys from Brazil’ are nowadays two of their country’s leading Bayesians.

    In their papers in JASA during 1999, Jim Berger and Julia Mortera proposed some default Bayes factors for one-sided hypothesis testing, Alan Gelfand and Sujit Sahu addressed identifiability problems, improper priors and Gibbs sampling for non-linear models, and Florence Forbes and Adrian Raftery co-authored the fascinating paper ‘Bayesian morphology: Fast unsupervised Bayesian image analysis’.


A leading UCL Bayesian was very much involved in the eventually very tragic 1999 Sally Clark case, when a 35 year old solicitor was convicted of murdering her two babies, deaths which had originally been attributed to sudden infant death syndrome.

 
Sally Clark
 

    Professor Philip Dawid, an expert called by Sally Clark’s team during the appeals process pointed out that applying the same flawed method to the statistics for infant murder in England and Wales in 1996 could suggest that the probabilities of two babies in one family being murdered was 1 in 2,152,224,291 (that is 1 in 2152 billion) a sum even more outlandish than 1 in 73 million (as suggested by the highly eminent prosecution expert witness Professor Sir Roy Meadows, when the relevant probability was a 2 in 3 chance of innocence. Sir Roy was later disbarred as a medical practitioner, but was reinstated upon appeal. Sally Clark committed suicide after being released from prison following several appeals.


As the decade drew to a close, Robert Cowell, Phil Dawid, David Spiegelhalter and Steffen Lauritzen combined their grey and silver matter to co-author the outstanding book Probability Networks and Expert Systems, which earned them the prestigious 2001 De Groot prize from ISBA. Here are their chapter headings:

 
 1. Introduction
 2. Logic, Uncertainty and Probability
 3. Building and using Probabilistic Networks
 4. Graph Theory
 5. Markov Properties of Graphs
 6. Discrete Networks
 7. Gaussian and Mixed Discrete Gaussian Networks
 8. Discrete Multi-Stage Decision Networks
 9. Learning about Probabilities
 10. Checking Models against data
 

    The authors’ epilogue includes the standard conjugate analyses for discrete data models, a discussion of Gibbs sampling, and an information section about software on the worldwide web.

    This was one of the most exciting Bayesian books since Morris De Groot published Optimal Statistical Decisions in 1970. It is a must for all criminal investigators and forensic scientists, and a good bedtime read for all fledgling Bayesians.

    This is a good point to take a chronological break, I will discuss the Bayesian advances during Anno Domini 2000 in the next chapter. Dennis Lindley’s read paper will be pivotal in my discussion of the transition of Bayesian ideas from the twentieth century to the next, and several of the papers published in 2000 were very relevant to future developments. These include the American Statistical Association vignettes composed by Jim Berger and several other Bayesians.



The Bayesian history of the twentieth century would not, however, be complete without a discussion of the ways misguided versions of Bayes theorem were misused in the O.J. Simpson murder and Adams rape trials during attempts to quantify DNA evidence in the Courtroom.

    Suppose more generally that some traces of DNA are found at a crime scene, and that a suspect is subsequently apprehended and charged. Then it might be possible for the Court to assess the ‘prior odds’ Φ/ (1-Φ) that the trace came from the DNA of the defendant, where the prior  probability Φ  refers to the other, human, evidence in the case. Then according to an easy rearrangement of Bayes theorem, the ‘posterior odds of a perfect match’ multiplies the prior odds by a Bayes factor R which is a ‘measure of evidence’ in the sense that it measures the information provided by the forensic sciences relating to their observations of the suspect’s DNA. This has frequently been based on their observations of the suspect’s allele lengths during 15 purportedly independent probes.

    In the all too frequent special case where Φ is set equal to 0.5, the posterior probability of a perfect match is λ=R/(R+1). In the United Kingdom, R turns out to be a billion with remarkable regularity, in which case 1-λ, loosely speaking the posterior probability that the trace at the crime scene doesn’t come from the defendant’s DNA, is one in a billion and one, and this is usually regarded as overwhelming evidence against the defendant.

    While this procedure is being continuously modified, it has frequently been, at least in the past, outrageously and unscientifically incorrect, for the following reasons:

1.If there is no prior evidence, then, according to Laplace’s Principle of Insufficient Reason, Φ should be set equal to 1/N, where N is the size of the population of suspects, for example the size, maybe around 25 million, of the population of eligible males in the United Kingdom.

    In contrast, a prior probability of 0.5 is often introduced into the courtroom via Erik Essen-Möller’s shameless device [79] of a ‘random man’, which is just a mathematical trick, or by similarly fallacious arguments which have even been advocated by some leading proponents of the Bayesian paradigm. Essen-Möller’s formula was published in Vienna in 1938 around the time of the Nazi annexation of Austria. Maybe Hitler was the random man!

 
Erik Essen-Möller
 

    Erik Essen-Möller later made various influential contributions to the ‘genetic study’ of psychiatry and psychology, include schizophrenia, and was much fêted in his field. Jesus wept!

2. The genocrats don’t represent R by a Bayes factor, but rather a likelihood ratio. In the case where 15 DNA probes are employed, the combined likelihood ratio R incorporates empirical estimates of 15 population distributions of allele lengths. These non-parametric empirical estimates were, during the 1990s, frequently derived from 15 small, non-random, samples of the allele lengths and can be highly statistically deficient in nature. See, for example, the excellent 1994 discussion paper and review by Kathryn Roeder in Statistical Sciences.

3. The overall, combined likelihood ratio R has typically been calculated by multiplying 15 individual likelihood ratios together. The multiplications would be open to some sort of justification if the empirical evidence from the 15 probes could be regarded as statistically independent. The genocrats and forensic scientists seek to justify statistical independence by the genetic independence which occurs in homogeneous populations which are in a state of Hardy-Weinberg equilibrium e.g when individuals choose their mates at random from all individuals of opposite gender in the population. However, most populations are highly heterogeneous and most people I know don’t choose their partners at random.  For example, most heterosexual people choose heterosexual or bisexual partners.

    An inappropriate homogeneity assumption can greatly inflate the overall likelihood ratio R as a purported measure of evidence against the defendant. For example, if the individual likelihood ratios are each equal to four, then R is four to the power fifteen, which exceeds a billion, in situations where the ‘true’ combined measure of evidence might be quite small.

    During the trial (1995-6) of the iconic American football star O.J. Simpson for the murder of his long-suffering wife Nicole, the celebrated prosecution expert witness Professor Bruce Weir of the University of North Carolina attempted to introduce blood and DNA evidence into the Courtroom via a similarly misguided misapplication of Bayes Theorem. The scientific content of his evidence was refuted by our very own Aussie Bayesian, Professor Terry Speed of the University of California at Berkeley. Terry was assisted in his efforts by an arithmetic error made by Professor Weir during the presentation of the prosecution testimony. However, when Simpson was found innocent in criminal court, the media largely attributed this to the way a police officer had planted the forensic evidence by throwing it over a garden wall.

    During the early 1990s, the defendant in the Denis Adams rape case was convicted on the basis of the high value of a purported likelihood ratio, even though the victim firmly stated that he wasn’t the man who’d raped her. Adams moreover appeared to have had a cast iron alibi, since several witnesses said that he was with them a many miles away at the time of the crime.

    The defence expert witness, the much-respected Professor Peter Donnelly of the University of Oxford, asked the members of the jury to assess their prior probabilities by referring to the human evidence in the case, in a valiant attempt to counter the influence of the large purported combined likelihood ratio. In retrospect, Peter should have gone straight for the jugular by refuting the grossly inflated likelihood ratio along the applied statistical lines I describe above.

    When the Court of Appeals later upheld Adams’ conviction, they strongly criticised the use of prior probabilities as a mode for assessing human evidence and effectively threw Bayes Theorem out of Court, in that particular case at least. Adrian Smith, at that time the President of the Royal Statistical Society, was not at all amused by that rude affront to his raison d’être.

    For alternative viewpoints regarding the apparent misapplications of Bayes theorem in legal cases, see the well-cited books by Colin Aitken and Franco Taroni, and David Balding, and the 1997 Royal Statistical Society invited discussion paper by L.A. Foreman, Adrian Smith and Ian Evett.

    Colin Aitken is Professor of Forensic Statistics at the University of Edinburgh. Jack Good’s earlier ‘justification’ of the widespread use of Bayes factors as measures of evidence is most clearly reported on page 247 of Colin and Franco’s book widely cited book Statistical Evaluation of Evidence for Forensic Scientists, and on page 389 of Good’s 1988 paper ‘The Interface between Statistics and Philosophy of Science’ in Statistical Science.

    Jack Good invokes Themis, the ancient Greek goddess of justice, who was said to be holding a pair of scales on which she weighed opposing arguments. Colin Aitken is a leading advocate of the use of Bayes factors in criminal cases, and his officially documented recommendations to British Courts of Law depend heavily on Good’s key conclusion that any sensible additive weight of evidence must be the log of a Bayes factor. That creates visions of the Goddess Themis weighing the logs of Bayes factors on her scales and putting herself at loggerheads with the judge.

    Maybe I’m missing something, though perhaps not. Good’s ‘justification’ seems to be rather circuitous, and indeed little more than a regurgitation of the additive property,

 

Posterior log-odds = Prior log-odds +log (Bayes factor),

 

which can of course be extended to justify the addition of the Bayes factors from successful experiments. This simple rearrangement of Bayes Theorem would, at first sight, appear to justify Good’s apparently seminal conclusion. However, since Bayes factors frequently possess counterintuitive properties (see Ch. 2), the entire idea of assigning of assigning a positive probability to a ‘sharp’, i.e. simple or only partly composite, null hypothesis, is open to serious question in situations where the alternative hypothesis is composite.

    The approach to multivariate binary discrimination described by John Aitchison and Colin Aitken in their 1976 paper in Biometrika is much more convincing, as they use a non-parametric kernel method to empirically estimate the denominator in the Bayes factor. But a posterior probability cannot justifiably be associated with it, even as a limiting approximation.

    Colin’s parametric Bayes factors are of course still useful if they are employed as test statistics, in which case they will always possess appealing frequency properties. See [15], pp 162-163. It’s when you try to use Bayes theorem to directly convert a Bayes factor into a posterior probability that there’s been trouble at Mill. The trouble could be averted by associating each of Colin’s Bayes factors with a Baskurt-Evans style Bayesian p-value. Perhaps Colin should guide the legal profession further, by writing another book on the subject.

    In [2] and [3], Jack Good used Bayes factors as measures of evidence when constructing an ambiguously defined measure of explicativity, which is, loosely speaking, ‘the extent to which one proposition or event explains why another should be believed’. I again find Jack’s mathematical formulation of an interesting, though not all pervading, philosophical concept to be a bit too fanciful.

    The co-authors of a 1997 R.S.S. invited paper on topics relating to the genetic evaluation of evidence included Dr. Ian Evett of the British Forensic Science Service and my nemesis Adrian Smith. They, however, chose not to substantively reply, in their formal response to the discussants, to the further searching questions which I included in my written contribution to the discussion of their paper.

    These were inspired by my numerous experiences as a defence expert witness in U.S. Courts In 1992, I’d successfully challenged an alleged probability of paternity of 99.99994% in Phillips, Wisconsin, and district attorneys in the Mid-West used to settle paternity testing cases when they heard I was coming. The 1992 ‘Rosie and the ten construction workers case’, when I refuted a prior probability of paternity of 0.5 and a related posterior probability of over 99.99% in Decorah, Iowa, seemed to turn a Forensic Statistics conference in Edinburgh in 1996 head over heels, and Phil Dawid and Julia Mortera made lots of amusing jokes about it over dinner while Ian Evett and Bruce Weir fumed in the background. I, however, always declined to participate in the gruesome rape and murder cases in Chicago.

    Nevertheless, during a subsequent light-hearted public talk on a different topic at the 1997 Science Festival in Edinburgh, Adrian somewhat petulantly singled me out as a person who disagreed with him. I’m glad that I had the temerity to do so.

    Our leading statisticians, and forensic scientists, should be ultra-careful not to put the public or the remainder of their profession into a state of mystification. Jimmie Savage had a habit of putting down statisticians who questioned him e.g. his very sharp, though self-effacing, brother-in-law Frank Anscombe, who was an occasional Bayesian, and John Tukey is said to have put pressure on George Box to leave Princeton in 1960 for related reasons, after Tukey received some of the same medicine when he visited Ronald Fisher for afternoon tea. Holier-than-thou statisticians should realise that they are likely to be wrong some of the time, at least in concept, along with everybody else. As George Box once said, ‘We should always be prepared to forgive ourselves when we screw up.’

    Some further problems inherent in the Bayesian evaluation of legal evidence are satirized in Ch. 14: Scottish Justice of my self-published novel In the Shadows of Calton Hill.

    At the risk of provoking further negative reactions e.g from the genetics and forensic science professions, here are some possible suggestions for resolving the immensely socially damaging DNA evidence situation:

1. Since most of the genetic theory that underpin them is both suspect and subject to special assumption, combined likelihood ratios and misapplications of Bayes theorem should be abandoned altogether. In the case where there are fifteen DNA probes, an exploratory data analysis should be performed of the 15 corresponding (typically non-independent) samples of the allele lengths or their logs, and used to contrast the 15x1 vector X of log allele lengths measured from the trace at the crime scene with the vector Y of the log allele lengths measured from the suspect’s DNA. The quality of the data collection would need to be greatly improved to justify doing this.

2. A Bayesian analysis of the n 15x1 vectors of the log-allele lengths should then be employed to obtain a posterior predictive distribution for the vector Z of the 15 log-allele lengths for a randomly chosen individual from the reference population. Various predictive probabilities may then be used to contrast the elements of X and Y, and some criterion would need to be decided upon by the Courts to thereby judge whether X and Y are close enough to indicate a convincing enough match.

3. As an initial suggestion, it might be reasonable to assume that the n vectors of log-allele lengths constitute a random sample from a multivariate normal distribution with unknown and unconstrained mean vector μ and covariance matrix C. If the prior distribution of μ and C is taken to belong to the conjugate multivariate normal/ inverted Wishart family (e.g. [15], p290), then the posterior predictive distribution of Z is generalised multivariate-t. This facilitates technically feasible inferential Bayesian calculations for contrasting the elements of X and Y, and prior information about μ and C can be incorporated if available e.g. by reference to other samples.

     In the meantime, thousands of potentially innocent people are still being falsely convicted using versions of Bayes Theorem. This was easily the most terrifying misapplication of the Bayesian paradigm of the twentieth century, and it is something which the Bayesian profession should not be proud of.

[In early January 2014, the Bayesian forensics expert Professor David Balding of University College London kindly advised me that many of the statistical problems inherent in the evaluation of DNA evidence are now in the process of being overcome. See, for example David’s paper ‘Statistical Evaluation of Forensic DNA Profile Evidence’ (with Christopher Steele, Ann. Rev. Stat. Appl., 2014)]

 
 

Author’s Notes;  Julian Besag was one of the finest and most hardworking applied Bayesians I have ever known. I learnt lots of statistical ideas from him during drinking parties with our colleagues. On 14 th. November 1974, and during Julian’s sabbatical year in Oxford, Jeff Harrison, always the generous host, plied him with sherry and wine before and after his excellent seminar at the University of Warwick. We were consequently delighted when he bought all eight of us, including Jim Smith and Keith Ord, a pint of beer during a festive dinner that evening in Leamington. He however studiously sat there until each of us had bought him a pint back. Feeling replete, he then jumped into his sports car and drove back to Oxford.

    

     Following his similarly impressive seminar in Madison, Wisconsin during the early 1990s, I took Julian and my colleagues for pints of Milwaukee beer in the Essenhaus restaurant, just off the Capitol Square. Julian clearly didn’t like the place. When the German band came on, and played some jolly, though rather camp, music, Julian said, ‘Tom, this is the worst bar that I’ve ever been taken to’. Sometime later, when Julian was living on a houseboat in Peugeot Sound, he and I talked at length in a much classier bar which overlooked Fisherman’s Wharf in San Francisco. I left around one in the morning, after the remainder of our colleagues had departed in various states of intoxication. However, Julian just sat there, staring soulfully into the night.

 
Julian Ernst Besag F.R.S. (1945-2010)
 

    I left Wisconsin for the University of Edinburgh in August 1995 to be closer to my children in England, but my research career in academia effectively ended with unexpected abruptness during a period of ill-health in the Summer of 2000, soon after publishing a couple of books and getting involved in Bayesian applications in Geophysics (oilwells and earthquakes) with the eminent geophysicist Ian Main, Kes Heffer of BP, and Orestis Papasouliotis, and while I was still trying to sort out Bayes factors. I retired early in August 2001 at the age of 53, and did not recover my cognition until October 2011

    Anno Domino 2000 was a very busy year for Bayesians and it catalysed many wonderful advances during the early part of the 21 st. century, even though 9/11 affected our philosophies, societies and economies for many years to come. Maybe the Goddess of Statistics was watching over us. Maybe she is the pagan Goddess Fortune.

 

No comments:

Post a Comment