Statistics - Computation Publications (50)


Statistics - Computation Publications

In this article we consider static Bayesian parameter estimation for partially observed diffusions that are discretely observed. We work under the assumption that one must resort to discretizing the underlying diffusion process, for instance using the Euler-Maruyama method. Given this assumption, we show how one can use Markov chain Monte Carlo (MCMC) and particularly particle MCMC [Andrieu, C. Read More

Although applications of Bayesian analysis for numerical quadrature problems have been considered before, it's only very recently that statisticians have focused on the connections between statistics and numerical analysis of differential equations. In line with this very recent trend, we show how certain commonly used finite difference schemes for numerical solutions of ordinary and partial differential equations can be considered in a regression setting. Focusing on this regression framework, we apply a simple Bayesian strategy to obtain confidence intervals for the finite difference solutions. Read More

We introduce Fisher consistency in the sense of unbiasedness as a criterion to distinguish potentially suitable and unsuitable estimators of prior class probabilities in test datasets under prior probability and more general dataset shift. The usefulness of this unbiasedness concept is demonstrated with three examples of classifiers used for quantification: Adjusted Classify & Count, EM-algorithm and CDE-Iterate. We find that Adjusted Classify & Count and EM-algorithm are Fisher consistent. Read More

HYGARCH process is the commonly used long memory process in modeling the long-rang dependence in volatility. Financial time series are characterized by transition between phases of different volatility levels. The smooth transition HYGARCH (ST-HYGARCH) model is proposed to model time-varying structure with long memory property. Read More

Affiliations: 1Harvard University, 2Harvard University, 3University of Bristol, 4Université Paris-Dauphine PSL and University of Warwick

In purely generative models, one can simulate data given parameters but not necessarily evaluate the likelihood. We use Wasserstein distances between empirical distributions of observed data and empirical distributions of synthetic data drawn from such models to estimate their parameters. Previous interest in the Wasserstein distance for statistical inference has been mainly theoretical, due to computational limitations. Read More

We develop a constructive approach to estimating sparse, high-dimensional linear regression models. The approach is a computational algorithm motivated from the KKT conditions for the $\ell_0$-penalized least squares solutions. It generates a sequence of solutions iteratively, based on support detection using primal and dual information and root finding. Read More

For a given target density, there exist an infinite number of diffusion processes which are ergodic with respect to this density. As observed in a number of papers, samplers based on nonreversible diffusion processes can significantly outperform their reversible counterparts both in terms of asymptotic variance and rate of convergence to equilibrium. In this paper, we take advantage of this in order to construct efficient sampling algorithms based on the Lie-Trotter decomposition of a nonreversible diffusion process into reversible and nonreversible components. Read More

Piecewise deterministic Monte Carlo methods (PDMC) consist of a class of continuous-time Markov chain Monte Carlo methods (MCMC) which have recently been shown to hold considerable promise. Being non-reversible, the mixing properties of PDMC methods often significantly outperform classical reversible MCMC competitors. Moreover, in a Bayesian context they can use sub-sampling ideas, so that they need only access one data point per iteration, whilst still maintaining the true posterior distribution as their invariant distribution. Read More

We present a new inference method based on approximate Bayesian computation for estimating parameters governing an entire network based on link-traced samples of that network. To do this, we first take summary statistics from an observed link-traced network sample, such as a recruitment network of subjects in a hard-to-reach population. Then we assume prior distributions, such as multivariate uniform, for the distribution of some parameters governing the structure of the network and behaviour of its nodes. Read More

We propose Edward, a Turing-complete probabilistic programming language. Edward builds on two compositional representations---random variables and inference. By treating inference as a first class citizen, on a par with modeling, we show that probabilistic programming can be as flexible and computationally efficient as traditional deep learning. Read More

Extensions in the field of joint modeling of correlated data and dynamic predictions improve the development of prognosis research. The R package frailtypack provides estimations of various joint models for longitudinal data and survival events. In particular, it fits models for recurrent events and a terminal event (frailtyPenal), models for two survival outcomes for clustered data (frailtyPenal), models for two types of recurrent events and a terminal event (multivPenal), models for a longitudinal biomarker and a terminal event (longiPenal) and models for a longitudinal biomarker, recurrent events and a terminal event (trivPenal). Read More

Recombinant binomial trees are binary trees where each non-leaf node has two child nodes, but adjacent parents share a common child node. Such trees arise in finance when pricing an option. For example, valuation of a European option can be carried out by evaluating the expected value of asset payoffs with respect to random paths in the tree. Read More

We discuss how the kernel convolution approach can be used to accurately approximate the spatial covariance model on a sphere using spherical distances between points. A detailed derivation of the required formulas is provided. The proposed covariance model approximation can be used for non-stationary spatial prediction and simulation in the case when the dataset is large and the covariance model can be estimated separately in the data subsets. Read More

Many clustering algorithms when the data are curves or functions have been recently proposed. However, the presence of contamination in the sample of curves can influence the performance of most of them. In this work we propose a robust, model-based clustering method based on an approximation to the "density function" for functional data. Read More

Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. Read More

There is an increasing focus in several fields on learning how the distribution of an outcome changes with a set of covariates. Bayesian nonparametric dependent mixture models provide a useful approach to flexibly address this goal, however many representations are characterized by difficult interpretation and intractable computational methods. Motivated by these issues, we describe a flexible formulation for Bayesian density regression, which is defined as a potentially infinite mixture model whose probability weights change with the covariates via a stick-breaking construction relying on a set of logistic regressions. Read More

New directions in research on master equations are showcased by example. Magnus expansions, time-varying rates, and pseudospectra are highlighted. Exact eigenvalues are found and contrasted with the large errors produced by standard numerical methods in some cases. Read More

We present a novel method for variable selection in regression models when covariates are measured with error. The iterative algorithm we propose, MEBoost, follows a path defined by estimating equations that correct for covariate measurement error. Via simulation, we evaluated our method and compare its performance to the recently-proposed Convex Conditioned Lasso (CoCoLasso) and to the "naive" Lasso which does not correct for measurement error. Read More

In many real applications of statistical learning, a decision made from misclassification can be too costly to afford; in this case, a reject option, which defers the decision until further investigation is conducted, is often preferred. In recent years, there has been much development for binary classification with a reject option. Yet, little progress has been made for the multicategory case. Read More

We present a new algorithm which detects the maximal number of matched disjoint pairs satisfying a given caliper when the matching is done with respect to a scalar index (e.g., propensity score), and constructs a corresponding matching. Read More

In state space models, smoothing refers to the task of estimating a latent stochastic process given noisy measurements related to the process. We propose the first unbiased estimator of smoothing expectations. The lack-of-bias property has methodological benefits, as it allows for a complete parallelization of the algorithm and for computing accurate confidence intervals. Read More

Whilst there are many approaches to detecting changes in mean for a univariate time-series, the problem of detecting multiple changes in slope has comparatively been ignored. Part of the reason for this is that detecting changes in slope is much more challenging. For example, simple binary segmentation procedures do not work for this problem, whilst efficient dynamic programming methods that work well for the change in mean problem cannot be directly used for detecting changes in slope. Read More

Mixed-effects models have emerged as the gold standard of statistical analysis in different sub-fields of linguistics (Baayen, Davidson & Bates, 2008; Johnson, 2009; Barr, et al, 2013; Gries, 2015). One problematic feature of these models is their failure to converge under maximal (or even near-maximal) random effects structures. The lack of convergence is relatively unaddressed in linguistics and when it is addressed has resulted in statistical practices (e. Read More

Dynamically typed programming languages like R allow programmers to write generic, flexible and concise code and to interact with the language using an interactive Read-eval-print-loop (REPL). However, this flexibility has its price: As the R interpreter has no information about the expected variable type, many base functions automatically convert the input instead of raising an exception. Unfortunately, this frequently leads to runtime errors deeper down the call stack which obfuscates the original problem and renders debugging challenging. Read More

The Log-Gaussian Cox Process is a commonly used model for the analysis of spatial point patterns. Fitting this model is difficult because of its doubly-stochastic property, i.e. Read More

With the advent of big data sets much of the computational science and engineering communities have been moving toward data-driven approaches to regression and classification. However, they present a significant challenge due to the increasing size, complexity and dimensionality of the problems. In this paper a multi-level kriging method that scales well with dimensions is developed. Read More

Minimizing sum of two functions under a linear constraint is what we called splitting problem. This convex optimization has wide applications in machine learning problems, such as Lasso, Group Lasso and Sparse logistic regression. A recent paper by Gu et al (2015) developed a Semi-Proximal-Based Strictly Contractive Peaceman-Rachford Splitting Method (SPB-SPRSM), which is an extension of Strictly Contractive Peaceman-Rachford Splitting Method (SPRSM) proposed by He et al (2014). Read More

Sequential Monte Carlo (SMC) methods comprise one of the most successful approaches to approximate Bayesian filtering. However, SMC without good proposal distributions struggle in high dimensions. We propose nested sequential Monte Carlo (NSMC), a methodology that generalises the SMC framework by requiring only approximate, properly weighted, samples from the SMC proposal distribution, while still resulting in a correct SMC algorithm. Read More

Random forest (Leo Breiman 2001a) (RF) is a non-parametric statistical method requiring no distributional assumptions on covariate relation to the response. RF is a robust, nonlinear technique that optimizes predictive accuracy by fitting an ensemble of trees to stabilize model estimates. Random survival forests (RSF) (Ishwaran and Kogalur 2007; Ishwaran et al. Read More

As illustrated via numerical experiments with an implementation in Spark (the popular platform for distributed computation), randomized algorithms solve two ubiquitous problems: (1) calculating a full principal component analysis or singular value decomposition of a highly rectangular matrix, and (2) calculating a low-rank approximation in the form of a singular value decomposition to an arbitrary matrix. Several optimizations to recently introduced methods yield results that are uniformly superior to those of the stock implementations. Read More

We extend the application of Hamiltonian Monte Carlo to allow for sampling from probability distributions defined over symmetric or Hermitian positive definite matrices. To do so, we exploit the Riemannian structure induced by Cartan's century-old canonical metric. The geodesics that correspond to this metric are available in closed-form and---within the context of Lagrangian Monte Carlo---provide a principled way to travel around the space of positive definite matrices. Read More

Ranking data represent a peculiar form of multivariate ordinal data taking values in the set of permutations. Despite the numerous methodological contributions to increase the flexibility of ranked data modeling, the application of more sophisticated models is limited by the related computational issues. The PLMIX package offers a comprehensive framework aiming at endowing the R environment with the recent methodological advancements. Read More

Model-based trees are used to find subgroups in data which differ with respect to model parameters. In some applications it is natural to keep some parameters fixed globally for all observations while asking if and how other parameters vary across the subgroups. Existing implementations of model-based trees can only deal with the scenario where all parameters depend on the subgroups. Read More

Modern imaging methods rely strongly on Bayesian inference techniques to solve challenging imaging problems. Currently, the predominant Bayesian computation approach is convex optimisation, which scales very efficiently to high dimensional image models and delivers accurate point estimation results. However, in order to perform more complex analyses, for example image uncertainty quantification or model selection, it is necessary to use more computationally intensive Bayesian computation techniques such as Markov chain Monte Carlo methods. Read More

In genome-wide association (GWA) studies the goal is to detect associations between genetic markers and a given phenotype. The number of genetic markers can be large and effective methods for control of the overall error rate is a central topic when analyzing GWA data. The Bonferroni method is known to be conservative when the tests are dependent. Read More

Estimating failure probabilities of engineering systems is an important problem in many engineering fields. In this work we consider such problems where the failure probability is extremely small (e.g $\leq10^{-10}$). Read More

The R-package REPPlab is designed to explore multivariate data sets using one-dimensional unsupervised projection pursuit. It is useful in practice as a preprocessing step to find clusters or as an outlier detection tool for multivariate numerical data. Except from the package tourr that implements smooth sequences of projection matrices and rggobi that provides an interface to a dynamic graphics package called GGobi, there is no implementation of exploratory projection pursuit tools available in R especially in the context of outlier detection. Read More

Big Data has become a pervasive and ubiquitous component of modern data analysis. Due to the pathologies of Big Data, such as network distribution or infeasible scalings of computational time, strategies are required for conducting effective analysis under such conditions. A traditional approach of computer science that has found success in Big Data analysis is the divide-and-conquer paradigm. Read More

This paper introduces methodology for performing Bayesian inference sequentially on a sequence of posteriors on spaces of different dimensions. We show how this may be achieved through the use of sequential Monte Carlo (SMC) samplers (Del Moral et al., 2006, 2007), making use of the full flexibility of this framework in order that the method is computationally efficient. Read More

The classical multi-set split feasibility problem seeks a point in the intersection of finitely many closed convex domain constraints, whose image under a linear mapping also lies in the intersection of finitely many closed convex range constraints. Split feasibility generalizes important inverse problems including convex feasibility, linear complementarity, and regression with constraint sets. When a feasible point does not exist, solution methods that proceed by minimizing a proximity function can be used to obtain optimal approximate solutions to the problem. Read More

For log-linear analysis, the hyper Dirichlet conjugate prior is available to work in the Bayesian paradigm. With this prior, the MC3 algorithm allows for exploration of the space of models to try to find those with the highest posterior probability. Once top models have been identified, a block Gibbs sampler can be constructed to sample from the posterior distribution and to estimate parameters of interest. Read More

Bayesian inference is a popular method to build learning algorithms but it is hampered by the fact that its key object, the posterior probability distribution, is often uncomputable. Expectation Propagation (EP) (Minka (2001)) is a popular algorithm that solves this issue by computing a parametric approximation (e.g: Gaussian) to the density of the posterior. Read More

We present the \pkg{BayesBD} package providing Bayesian inference for boundaries of noisy images. The \pkg{BayesBD} package implements flexible Gaussian process priors indexed by the circle to recover the boundary in a binary or Gaussian noised image, and achieves four aims of guaranteed geometric restriction, (nearly) minimax optimal rate adaptive to the smoothness level, convenience for joint inference, and computational efficiency. The core sampling tasks are carried out in \code{c++} using packages \pkg{Rcpp} and \pkg{RcppArmadillo}. Read More

The R software package excursions contains methods for calculating probabilistic excursion sets, contour credible regions, and simultaneous confidence bands for latent Gaussian stochastic processes and fields. It also contains methods for uncertainty quantification of contour maps and computation of Gaussian integrals. This article describes the theoretical and computational methods used in the package. Read More

Riemann manifold Hamiltonian Monte Carlo (RHMC) holds the potential for producing high-quality Markov chain Monte Carlo-output even for very challenging target distributions. To this end, a symmetric positive definite scaling matrix for RHMC, which derives, via a modified Cholesky factorization, from the potentially indefinite negative Hessian of the target log-density is proposed. The methodology is able to exploit sparsity, stemming from conditional independence modeling assumptions, of said Hessian and thus admit fast and highly automatic implementation of RHMC even for high-dimensional target distributions. Read More

Understanding the spread of any disease is a highly complex and interdisciplinary exercise as biological, social, geographic, economic, and medical factors may shape the way a disease moves through a population and options for its eventual control or eradication. Disease spread poses a serious threat in animal and plant health and has implications for ecosystem functioning and species extinctions as well as implications in society through food security and potential disease spread in humans. Space-time epidemiology is based on the concept that various characteristics of the pathogenic agents and the environment interact in order to alter the probability of disease occurrence and form temporal or spatial patterns. Read More

Manifold optimization appears in a wide variety of computational problems in the applied sciences. In recent statistical methodologies such as sufficient dimension reduction and regression envelopes, estimation relies on the optimization of likelihood functions over spaces of matrices such as the Stiefel or Grassmann manifolds. Recently, Huang, Absil, Gallivan, and Hand (2016) have introduced the library ROPTLIB, which provides a framework and state of the art algorithms to optimize real-valued objective functions over commonly used matrix-valued Riemannian manifolds. Read More

Two multiplierless pruned 8-point discrete cosine transform (DCT) approximation are presented. Both transforms present lower arithmetic complexity than state-of-the-art methods. The performance of such new methods was assessed in the image compression context. Read More

A Monte Carlo algorithm typically simulates a prescribed number of samples, taking some random real time to complete the computations necessary. This work considers the converse: to impose a real-time budget on the computation, so that the number of samples simulated is random. To complicate matters, the real time taken for each simulation may depend on the sample produced, so that the samples themselves are not independent of their number, and a length bias with respect to computation time is introduced. Read More

We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. We show how carefully replicating data blocks and coding across gradients can provide tolerance to failures and stragglers for synchronous Gradient Descent. We implement our scheme in MPI and show how we compare against baseline architectures in running time and generalization error. Read More