# David B. Dunson - Duke University

## Contact Details

NameDavid B. Dunson |
||

AffiliationDuke University |
||

CityCedar Rapids |
||

CountryUnited States |
||

## Pubs By Year |
||

## External Links |
||

## Pub CategoriesStatistics - Methodology (24) Mathematics - Statistics (13) Statistics - Theory (13) Statistics - Computation (12) Statistics - Machine Learning (11) Statistics - Applications (10) Computer Science - Learning (5) Computer Science - Distributed; Parallel; and Cluster Computing (2) Physics - Computational Physics (1) Quantitative Biology - Genomics (1) Computer Science - Computational Complexity (1) |

## Publications Authored By David B. Dunson

There has been substantial recent interest in record linkage, attempting to group the records pertaining to the same entities from a large database lacking unique identifiers. This can be viewed as a type of "microclustering," with few observations per cluster and a very large number of clusters. A variety of methods have been proposed, but there is a lack of literature providing theoretical guarantees on performance. Read More

Data augmentation is a common technique for building tuning-free Markov chain Monte Carlo algorithms. Although these algorithms are very popular, autocorrelations are often high in large samples, leading to poor computational efficiency. This phenomenon has been attributed to a discrepancy between Gibbs step sizes and the rate of posterior concentration. Read More

Non-linear latent variable models have become increasingly popular in a variety of applications. However, there has been little study on theoretical properties of these models. In this article, we study rates of posterior contraction in univariate density estimation for a class of non-linear latent variable models where unobserved U(0,1) latent variables are related to the response variables via a random non-linear regression with an additive error. Read More

There is considerable interest in studying how the distribution of an outcome varies with a predictor. We are motivated by environmental applications in which the predictor is the dose of an exposure and the response is a health outcome. A fundamental focus in these studies is inference on dose levels associated with a particular increase in risk relative to a baseline. Read More

In studying structural inter-connections in the human brain, it is common to first estimate fiber bundles connecting different regions of the brain relying on diffusion MRI. These fiber bundles act as highways for neural activity and communication, snaking through the brain and connecting different regions. Current statistical methods for analyzing these fibers reduce the rich information into an adjacency matrix, with the elements containing a count of the number of fibers or a mean diffusion feature (such as fractional anisotropy) along the fibers. Read More

Studying the neurological, genetic and evolutionary basis of human vocal communication mechanisms using animal vocalization models is an important field of neuroscience. The data sets typically comprise structured sequences of syllables or `songs' produced by animals from different genotypes under different social contexts. We develop a novel Bayesian semiparametric framework for inference in such data sets. Read More

Variational inference (VI) provides fast approximations of a Bayesian posterior in part because it formulates posterior approximation as an optimization problem: to find the closest distribution to the exact posterior over some family of distributions. For practical reasons, the family of distributions in VI is usually constrained so that it does not include the exact posterior, even as a limit point. Thus, no matter how long VI is run, the resulting approximation will not approach the exact posterior. Read More

The Bayesian paradigm provides a natural way to deal with uncertainty in model selection through assigning each model in a list of models under consideration a posterior probability. Unfortunately, this framework relies on the assumption that one of the models in the list is the true model. When this assumption is violated and all the models are imperfect, interpretation of posterior model probabilities is unclear. Read More

We study full Bayesian procedures for sparse linear regression when errors have a symmetric but otherwise unknown distribution. The unknown error distribution is endowed with a symmetrized Dirichlet process mixture of Gaussians. For the prior on regression coefficients, a mixture of point masses at zero and continuous distributions is considered. Read More

There is increasing interest in learning how human brain networks vary as a function of a continuous trait, but flexible and efficient procedures to accomplish this goal are limited. We develop a Bayesian semiparametric model, which combines low-rank factorizations and flexible Gaussian process priors to learn changes in the conditional expectation of a network-valued random variable across the values of a continuous predictor, while including subject-specific random effects. The formulation leads to a general framework for inference on changes in brain network structures across human traits, facilitating borrowing of information and coherently characterizing uncertainty. Read More

Two-component mixture priors provide a traditional way to induce sparsity in high-dimensional Bayes models. However, several aspects of such a prior, including computational complexities in high-dimensions, interpretation of exact zeros and non-sparse posterior summaries under standard loss functions, has motivated an amazing variety of continuous shrinkage priors, which can be expressed as global-local scale mixtures of Gaussians. Interestingly, we demonstrate that many commonly used shrinkage priors, including the Bayesian Lasso, do not have adequate posterior concentration in high-dimensional settings. Read More

Many modern applications collect large sample size and highly imbalanced categorical data, with some categories being relatively rare. Bayesian hierarchical models are well motivated in such settings in providing an approach to borrow information to combat data sparsity, while quantifying uncertainty in estimation. However, a fundamental problem is scaling up posterior computation to massive sample sizes. Read More

There is a lack of simple and scalable algorithms for uncertainty quantification. Bayesian methods quantify uncertainty through posterior and predictive distributions, but it is difficult to rapidly estimate summaries of these distributions, such as quantiles and intervals. Variational Bayes approximations are widely used, but may badly underestimate posterior covariance. Read More

High-throughput genetic and epigenetic data are often screened for associations with an observed phenotype. For example, one may wish to test hundreds of thousands of genetic variants, or DNA methylation sites, for an association with disease status. These genomic variables can naturally be grouped by the gene they encode, among other criteria. Read More

Hamiltonian Monte Carlo (HMC) has become routinely used for sampling from posterior distributions. Its extension Riemann manifold HMC (RMHMC) modifies the proposal kernel through distortion of local distances by a Riemannian metric. The performance depends critically on the choice of metric, with the Fisher information providing the standard choice. Read More

Hybrid Monte Carlo (HMC) generates samples from a prescribed probability distribution in a configuration space by simulating Hamiltonian dynamics, followed by the Metropolis (-Hastings) acceptance/rejection step. Compressible HMC (CHMC) generalizes HMC to a situation in which the dynamics is reversible but not necessarily Hamiltonian. This article presents a framework to further extend the algorithm. Read More

We consider the problem of shape restricted nonparametric regression on a closed set X ?\in R; where it is reasonable to assume the function has no more than H local extrema interior to X: Following a Bayesian approach we develop a nonparametric prior over a novel class of local extrema splines. This approach is shown to be consistent when modeling any continuously differentiable function within the class of functions considered, and is used to develop methods for hypothesis testing on the shape of the curve. Sampling algorithms are developed, and the method is applied in simulation studies and data examples where the shape of the curve is of interest. Read More

We develop a generalized method of moments (GMM) approach for fast parameter estimation in a new class of Dirichlet latent variable models with mixed data types. Parameter estimation via GMM has been demonstrated to have computational and statistical advantages over alternative methods, such as expectation maximization, variational inference, and Markov chain Monte Carlo. The key computational advan- tage of our method (MELD) is that parameter estimation does not require instantiation of the latent variables. Read More

Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space). Read More

Hamiltonian Monte Carlo (HMC) and related algorithms have become routinely used in Bayesian computation with their utilities highlighted by the probabilistic programming software packages Stan and PyMC. In this article, we present a simple and provably accurate method to improve the efficiency of HMC and related algorithms with essentially no extra computational cost. This is achieved by recycling the intermediate leap-frog steps used in approximating the trajectories of Hamiltonian dynamics. Read More

Asymptotic theory of tail index estimation has been studied extensively in the frequentist literature on extreme values, but rarely in the Bayesian context. We investigate whether popular Bayesian kernel mixture models are able to support heavy tailed distributions and consistently estimate the tail index. We show that posterior inconsistency in tail index is surprisingly common for both parametric and nonparametric mixture models. Read More

There is growing interest in understanding how the structural interconnections among brain regions change with the occurrence of neurological diseases. Diffusion weighted MRI imaging has allowed researchers to non-invasively estimate a network of structural cortical connections made by white matter tracts, but current statistical methods for relating such networks to the presence or absence of a disease cannot exploit this rich network information. Standard practice considers each edge independently or summarizes the network with a few simple features. Read More

In a variety of application areas, there is a growing interest in analyzing high dimensional sparse count data, with sparsity exhibited by an over-abundance of zeros and small non-zero counts. Existing approaches for analyzing multivariate count data via Poisson or negative binomial log-linear hierarchical models with zero-inflation cannot flexibly adapt to the level and nature of sparsity in the data. We develop a new class of continuous local-global shrinkage priors tailored for sparse counts. Read More

Complex network data problems are increasingly common in many fields of application. Our motivation is drawn from strategic marketing studies monitoring customer choices of specific products, along with co-subscription networks encoding multiple purchasing behavior. Data are available for several agencies within the same insurance company, and our goal is to efficiently exploit co-subscription networks to inform targeted advertising of cross-sell strategies to currently mono-product customers. Read More

This article proposes a Bayesian approach to regression with a scalar response against vector and tensor covariates. Tensor covariates are commonly vectorized prior to analysis, failing to exploit the structure of the tensor, and resulting in poor estimation and predictive performance. We develop a novel class of multiway shrinkage priors for the coefficients in tensor regression models. Read More

We propose a novel approach WASP for Bayesian inference when massive size of the data prohibits posterior computations. WASP is estimated in three steps. First, data are divided into smaller computationally tractable subsets. Read More

The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. Read More

We propose an extrinsic regression framework for modeling data with manifold valued responses and Euclidean predictors. Regression with manifold responses has wide applications in shape analysis, neuroscience, medical imaging and many other areas. Our approach embeds the manifold where the responses lie onto a higher dimensional Euclidean space, obtains a local regression estimate in that space, and then projects this estimate back onto the image of the manifold. Read More

In many contexts, there is interest in selecting the most important variables from a very large collection, commonly referred to as support recovery or variable, feature or subset selection. There is an enormous literature proposing a rich variety of algorithms. In scientific applications, it is of crucial importance to quantify uncertainty in variable selection, providing measures of statistical significance for each variable. Read More

We consider the problem of flexible modeling of higher order Markov chains when an upper bound on the order of the chain is known but the true order and nature of the serial dependence are unknown. We propose Bayesian nonparametric methodology based on conditional tensor factorizations, which can characterize any transition probability with a specified maximal order. The methodology selects the important lags and captures higher order interactions among the lags, while also facilitating calculation of Bayes factors for a variety of hypotheses of interest. Read More

The standard approach to Bayesian inference is based on the assumption that the distribution of the data belongs to the chosen model class. However, even a small violation of this assumption can have a large impact on the outcome of a Bayesian procedure. We introduce a simple, coherent approach to Bayesian inference that improves robustness to perturbations from the model: rather than condition on the data exactly, one conditions on a neighborhood of the empirical distribution. Read More

We utilize copulas to constitute a unified framework for constructing and optimizing variational proposals in hierarchical Bayesian models. For models with continuous and non-Gaussian hidden variables, we propose a semiparametric and automated variational Gaussian copula approach, in which the parametric Gaussian copula family is able to preserve multivariate posterior dependence, and the nonparametric transformations based on Bernstein polynomials provide ample flexibility in characterizing the univariate marginal posteriors. Read More

Learning of low dimensional structure in multidimensional data is a canonical problem in machine learning. One common approach is to suppose that the observed data are close to a lower-dimensional smooth manifold. There are a rich variety of manifold learning methods available, which allow mapping of data points to the manifold. Read More

The modern scale of data has brought new challenges to Bayesian inference. In particular, conventional MCMC algorithms are computationally very expensive for large data sets. A promising approach to solve this problem is embarrassingly parallel MCMC (EP-MCMC), which first partitions the data into multiple subsets and runs independent sampling algorithms on each subset. Read More

Ordinary least squares (OLS) is the default method for fitting linear models, but is not applicable for problems with dimensionality larger than the sample size. For these problems, we advocate the use of a generalized version of OLS motivated by ridge regression, and propose two novel three-step algorithms involving least squares fitting and hard thresholding. The algorithms are methodologically simple to understand intuitively, computationally easy to implement efficiently, and theoretically appealing for choosing models consistently. Read More

Our focus is on realistically modeling and forecasting dynamic networks of face-to-face contacts among individuals. Important aspects of such data that lead to problems with current methods include the tendency of the contacts to move between periods of slow and rapid changes, and the dynamic heterogeneity in the actors' connectivity behaviors. Motivated by this application, we develop a novel method for Locally Adaptive DYnamic (LADY) network inference. Read More

The major goal of this paper is to study the second order frequentist properties of the marginal posterior distribution of the parametric component in semiparametric Bayesian models, in particular, a second order semiparametric Bernstein-von Mises (BvM) Theorem. Our first contribution is to discover an interesting interference phenomenon between Bayesian estimation and frequentist inferential accuracy: more accurate Bayesian estimation on the nuisance function leads to higher frequentist inferential accuracy on the parametric component. As the second contribution, we propose a new class of dependent priors under which Bayesian inference procedures for the parametric component are not only efficient but also adaptive (w. Read More

Variable screening is a fast dimension reduction technique for assisting high dimensional feature selection. As a preselection method, it selects a moderate size subset of candidate variables for further refining via feature selection to produce the final model. The performance of variable screening depends on both computational efficiency and the ability to dramatically reduce the number of variables without discarding the important ones. Read More

In cargo logistics, a key performance measure is transport risk, defined as the deviation of the actual arrival time from the planned arrival time. Neither earliness nor tardiness is desirable for customer and freight forwarders. In this paper, we investigate ways to assess and forecast transport risks using a half-year of air cargo data, provided by a leading forwarder on 1336 routes served by 20 airlines. Read More

Network data are increasingly collected along with other variables of interest. Our motivation is drawn from neurophysiology studies measuring brain connectivity networks for a sample of individuals along with their membership to a low or high creative reasoning group. It is of paramount importance to develop statistical methods for testing of global and local changes in the structural interconnections among brain regions across groups. Read More

We discuss functional clustering procedures for nested designs, where multiple curves are collected for each subject in the study. We start by considering the application of standard functional clustering tools to this problem, which leads to groupings based on the average profile for each subject. After discussing some of the shortcomings of this approach, we present a mixture model based on a generalization of the nested Dirichlet process that clusters subjects based on the distribution of their curves. Read More

Graphical models express conditional independence relationships among variables. Although methods for vector-valued data are well established, functional data graphical models remain underdeveloped. We introduce a notion of conditional independence between random functions, and construct a framework for Bayesian inference of undirected, decomposable graphs in the multivariate functional data context. Read More

Although Bayesian density estimation using discrete mixtures has good performance in modest dimensions, there is a lack of statistical and computational scalability to high-dimensional multivariate cases. To combat the curse of dimensionality, it is necessary to assume the data are concentrated near a lower-dimensional subspace. However, Bayesian methods for learning this subspace along with the density of the data scale poorly computationally. Read More

For massive data sets, efficient computation commonly relies on distributed algorithms that store and process subsets of the data on different machines, minimizing communication costs. Our focus is on regression and classification problems involving many features. A variety of distributed algorithms have been proposed in this context, but challenges arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. Read More

Our focus is on constructing a multiscale nonparametric prior for densities. The Bayes density estimation literature is dominated by single scale methods, with the exception of Polya trees, which favor overly-spiky densities even when the truth is smooth. We propose a multiscale Bernstein polynomial family of priors, which produce smooth realizations that do not rely on hard partitioning of the support. Read More

Bayesian sparse factor models have proven useful for characterizing dependence, but scaling computation to high dimensions is problematic. We propose expandable factor analysis for scalable estimation. The method relies on a novel multiscale generalized double Pareto shrinkage prior that allows efficient estimation of low-rank and sparse loadings matrices through weighted $\ell_1$-regularized regression. Read More

Replicated network data are increasingly available in many research fields. In connectomic applications, inter-connections among brain regions are collected for each patient under study, motivating statistical models which can flexibly characterize the probabilistic generative mechanism underlying these network-valued data. Available models for a single network are not designed specifically for inference on the entire probability mass function of a network-valued random variable and therefore lack flexibility in characterizing the distribution of relevant topological structures. Read More

We present a data augmentation scheme to perform Markov chain Monte Carlo inference for models where data generation involves a rejection sampling algorithm. Our idea, which seems to be missing in the literature, is a simple scheme to instantiate the rejected proposals preceding each data point. The resulting joint probability over observed and rejected variables can be much simpler than the marginal distribution over the observed variables, which often involves intractable integrals. Read More

Nonparametric regression for massive numbers of samples (n) and features (p) is an increasingly important problem. In big n settings, a common strategy is to partition the feature space, and then separately apply simple models to each partition set. We propose an alternative approach, which avoids such partitioning and the associated sensitivity to neighborhood choice and distance metrics, by using random compression combined with Gaussian process regression. Read More

In broad applications, it is routinely of interest to assess whether there is evidence in the data to refute the assumption of conditional independence of $Y$ and $X$ conditionally on $Z$. Such tests are well developed in parametric models but are not straightforward in the nonparametric case. We propose a general Bayesian approach, which relies on an encompassing nonparametric Bayes model for the joint distribution of $Y$, $X$ and $Z$. Read More