## is there such a thing as optimal subsampling?

**T**his idea of optimal thinnin and burnin has been around since the early days of the MCMC revolution and did not come up with a definite answer. For instance, from a pure estimation perspective, subsampling always increases the variance of the resulting estimator. My personal approach is to ignore both burnin and thinnin and rather waste time on running several copies of the code to check for potential discrepancies and get a crude notion of the variability. And to refuse to answer to questions like is 5000 iterations long enough for burnin?

A recent arXival by Riabiz et al. readdresses the issue. In particular concerning this notion that the variance of the subsampled version is higher: this only applies to a deterministic subsampling, as opposed to an MCMC-based subsampling (although this intricacy only makes the problem harder!). I however fail to understand the argument in favour of subsampling based on storage issues (p.4), as a dynamic storage of the running mean for all quantities of interest does not cost anything if the integrand is not particularly demanding. I also disagree at the pessimistic view that the asymptotic variance of the MCMC estimate is hard to estimate: papers by Flegal, Hobert, Jones, Vat and others have rather clearly shown how batch means can produce converging estimates of this asymptotic variance.

“We do not to attempt to solve a continuous optimisation problem for selection of the next point[in the sample].Such optimisation problems are fundamentally difficult and can at best be approximately solved. Instead, we exactly solve the discrete optimisation problem of selecting a suitable element from a supplied MCMC output.”

One definitely positive aspect of the paper is that the (thinning) method is called Stein thinning, in connection with *Stein’s discrepancy*, and this honours Charles Stein. The method looks at the optimal subsample, with optimality defined in terms of minimising Stein’s discrepancy from the true target over a reproducible kernel Hilbert space. And then over a subsample to minimise the distance from the empirical distribution to the theoretical distribution. The kernel (11) is based on the gradient of the target log density and the solution is determined by greedy algorithms that determine which next entry to add to the empirical distribution. Which is of complexity *O(nm ^{2})* if the subsample is of size

*m*. Some entries may appear more than once and the burnin step could be automatically included as (relatively) unlikely values are never selected (at least this was my heuristic understanding). While the theoretical backup for the construct is present and backed by earlier papers of some of the authors, I do wonder at the use of the most rudimentary representation of an approximation to the target when smoother versions could have been chosen and optimised on the same ground. And I am also surprised at the dependence of both estimators and discrepancies on the choice of the (sort-of) covariance matrix in the inner kernel, as the ODE examples provided in the paper (see, e.g., Figure 7). (As an aside and at a shallow level, the approach also reminded me of the principal points of my late friend Bernhard Flury…) Storage of all MCMC simulations for a later post-processing is of course costly in terms of storage, at O(nm). Unless a “secretary problem” approach can be proposed to get sequential. Another possible alternate would be to consider directly the chain of the accepted values (à la vanilla Rao-Blackwellisation). Overall, since the stopping criterion is based on a fixed sample size, and hence depends on the sub-efficiency of evaluating the mass of different modes, I am unsure the method is anything but what-you-get-is-what-you-see, i.e. prone to get misled by a poor exploration of the complete support of the target.

“This paper focuses on nonuniform subsampling and shows that it is more efficiency than uniform subsampling.”

Two weeks later, Guanyu Hu and Hai Ying Wang arXived their Most Likely Optimal Subsampled Markov Chain Monte Carlo, in what I first thought as an answer to the above! But both actually have little in common as this second paper considers subsampling on the data, rather than the MCMC output, towards producing scalable algorithms. Building upon Bardenet et al. (2014) and Korattikara et al. (2014). Replacing thus the log-likelihood with a random sub-sampled version and deriving the sample size based on a large deviation inequality. By a Cauchy-Schwartz inequality, the authors find sampling probabilities proportional to the individual log-likelihooods. Which depend on the running value of the MCMC’ed parameters. And thus replaced with the values at a fixed parameter, with cost O(n) but only once, but no so much optimal. (The large deviation inequality therein is only concerned with an approximation to the log-likelihood, without examining the long term impact on the convergence of the approximate Markov chain as this is no longer pseudo-marginal MCMC. For instance, both current and prospective log-likelihoods are re-estimated at each iteration. The paper compares with uniform sampling on toy examples, to demonstrate a smaller estimation error for the statistical problem, rather than convergence to the true posterior.)

June 12, 2020 at 4:03 am

It could be ignorance on my part, but I have often sensed a dissonance between MCMC-seen-as-sampling and some of the attitudes taken by people clearly skilled in the business of survey sampling. I am most familiar with sampling for ecological and natural sciences as represented by Hankin, Mohr, and Newman’s book,

Sampling Theory, 2019. But the author who continues to make a deep impression upon me in most of his work, is Professor Yves Tillé (https://www.unine.ch/yves.tille), especially in his books likeSampling Algorithms.These of course are extensions of the work in model-assisted survey sampling done elsewhere, but Professor Tillé has often expressed definite ideas on things, which I greatly respect and try to understand as best I can. In

Sampling Algorithmshe states, for instance:Professor Tillé then goes on (on the opening pages of

Sampling Algorithmsto illustrate with the task of estimating iron production in a country whose iron production is dominated by two large companies. He argues that the two companies are what matter, and the remainder of the industry should be included, but according to a sampling design.The relevance, as I see it, to MCMC and the like is that sampling ought to serve the estimator being sought. While this, in the technology of MCMC, breaches a point of standardization which makes MCMC attractive, I can see and understand that if the estimator offers a lens with which to see the population through, there might be things it implies about how the sampling is done.

Of course, practical matters attend. If generating the MCMC just took days and a scholar isn’t sure that this particular estimator is the only one of interest, it may serve to pick the sample as the setup for using a broader class of estimators, and supporting the wider field.

June 13, 2020 at 7:11 am

This is a very good point (and not only for citing Yves whom I came to know when we were together at CREST). There are many analogues to unbalanced survey sampling in simulation, from tempering to importance sampling, SMC, and to Wang-Landau for MCMC.