Statistics - Computation Publications (50)


Statistics - Computation Publications

Network modeling has become increasingly popular for analyzing genomic data, to aid in the interpretation and discovery of possible mechanistic components and therapeutic targets. However, genomic-scale networks are high-dimensional models and are usually estimated from a relatively small number of samples. Therefore, their usefulness is hampered by estimation instability. Read More

Statistical models for network epidemics usually assume a Bernoulli random graph, in which any two nodes have the same probability of being connected. This assumption provides computational simplicity but does not describe real-life networks well. We propose an epidemic model based on the preferential attachment model, which adds nodes sequentially by simple rules to generate a network. Read More

Feature subset selection arises in many high-dimensional applications in machine learning and statistics, such as compressed sensing and genomics. The $\ell_0$ penalty is ideal for this task, the caveat being it requires the NP-hard combinatorial evaluation of all models. A recent area of considerable interest is to develop efficient algorithms to fit models with a non-convex $\ell_\gamma$ penalty for $\gamma\in (0,1)$, which results in sparser models than the convex $\ell_1$ or lasso penalty, but is harder to fit. Read More

The R package BigVAR allows for the simultaneous estimation of high-dimensional time series by applying structured penalties to the conventional vector autoregression (VAR) and vector autoregression with exogenous variables (VARX) frameworks. Our methods can be utilized in many forecasting applications that make use of time-dependent data such as macroeconomics, finance, and internet traffic. Our package extends solution algorithms from the machine learning and signal processing literatures to a time dependent setting: selecting the regularization parameter by sequential cross validation and provides substantial improvements in forecasting performance over conventional methods. Read More

We provide an algorithm for sampling the space of abstract simplicial complexes on a fixed number of vertices that aims to provide a balanced sampling over non-isomorphic complexes. Although sampling uniformly from geometrically distinct complexes is a difficult task with no known analytic algorithm, our generative and descriptive algorithm is designed with heuristics to help balance the combinatorial multiplicities of the states and more widely sample across the space of inequivalent configurations. We provide a formula for the exact probabilities with which this algorithm will produce a requested labeled state, and compare the algorithm to Kahle's multi-parameter model of exponential random simplicial complexes, demonstrating analytically that our algorithm performs better with respect to worst-case probability bounds on a given complex and providing numerical results illustrating the increased sampling efficiency over distinct classes. Read More

Principal component analysis (PCA) is fundamental to statistical machine learning. It extracts latent principal factors that contribute to the most variation of the data. When data are stored across multiple machines, however, communication cost can prohibit the computation of PCA in a central location and distributed algorithms for PCA are thus needed. Read More

The R package frailtySurv for simulating and fitting semi-parametric shared frailty models is introduced. frailtySurv implements semi-parametric consistent estimators for a variety of frailty distributions, including gamma, log-normal, inverse Gaussian and power variance function, and provides consistent estimators of the standard errors of the parameters' estimators. The parameters' estimators are asymptotically normally distributed, and therefore statistical inference based on the results of this package, such as hypothesis testing and confidence intervals, can be performed using the normal distribution. Read More

Robust PCA methods are typically batch algorithms which requires loading all observations into memory before processing. This makes them inefficient to process big data. In this paper, we develop an efficient online robust principal component methods, namely online moving window robust principal component analysis (OMWRPCA). Read More

The problem of large scale multiple testing arises in many contexts, including testing for pairwise interaction among large numbers of neurons. With advances in technologies, it has become common to record from hundreds of neurons simultaneously, and this number is growing quickly, so that the number of pairwise tests can be very large. It is important to control the rate at which false positives occur. Read More

Gaussian Markov random fields (GMRFs) are popular for modeling temporal or spatial dependence in large areal datasets due to their ease of interpretation and computational convenience afforded by conditional independence and their sparse precision matrices needed for random variable generation. Using such models inside a Markov chain Monte Carlo algorithm requires repeatedly simulating random fields. This is a nontrivial issue, especially when the full conditional precision matrix depends on parameters that change at each iteration. Read More

In this paper we present an objective approach to change point analysis. In particular, we look at the problem from two perspectives. The first focuses on the definition of an objective prior when the number of change points is known a priori. Read More

We present a new variable selection method based on model-based gradient boosting and randomly permuted variables. Model-based boosting is a tool to fit a statistical model while performing variable selection at the same time. A drawback of the fitting lies in the need of multiple model fits on slightly altered data (e. Read More

We consider the issue of performing accurate small sample inference in beta autoregressive moving average model, which is useful for modeling and forecasting continuous variables that assumes values in the interval $(0,1)$. The inferences based on conditional maximum likelihood estimation have good asymptotic properties, but their performances in small samples may be poor. This way, we propose bootstrap bias corrections of the point estimators and different bootstrap strategies for confidence interval improvements. Read More

The Integrated Nested Laplace Approximation (INLA) is a convenient way to obtain approximations to the posterior marginals for parameters in Bayesian hierarchical models when the latent effects can be expressed as a Gaussian Markov Random Field (GMRF). In addition, its implementation in the R-INLA package for the R statistical software provides an easy way to fit models using INLA in practice. R-INLA implements a number of widely used latent models, including several spatial models. Read More

The emergent field of probabilistic numerics has thus far lacked rigorous statistical principals. This paper establishes Bayesian probabilistic numerical methods as those which can be cast as solutions to certain Bayesian inverse problems, albeit problems that are non-standard. This allows us to establish general conditions under which Bayesian probabilistic numerical methods are well-defined, encompassing both non-linear and non-Gaussian models. Read More

The Bayesian estimation of the unknown parameters of state-space (dynamical) systems has received considerable attention over the past decade, with a handful of powerful algorithms being introduced. In this paper we tackle the theoretical analysis of the recently proposed {\it nonlinear} population Monte Carlo (NPMC). This is an iterative importance sampling scheme whose key features, compared to conventional importance samplers, are (i) the approximate computation of the importance weights (IWs) assigned to the Monte Carlo samples and (ii) the nonlinear transformation of these IWs in order to prevent the degeneracy problem that flaws the performance of conventional importance samplers. Read More

Affiliations: 1Queensland University of Technology, 2University of Oxford, 3Queensland University of Technology

Likelihood-free methods, such as approximate Bayesian computation, are powerful tools for practical inference problems with intractable likelihood functions. Markov chain Monte Carlo and sequential Monte Carlo variants of approximate Bayesian computation can be effective techniques for sampling posterior distributions without likelihoods. However, the efficiency of these methods depends crucially on the proposal kernel used to generate proposal posterior samples, and a poor choice can lead to extremely low efficiency. Read More

We introduce a new class of Monte Carlo based approximations of expectations of random variables defined whose laws are not available directly, but only through certain discretisatizations. Sampling from the discretized versions of these laws can typically introduce a bias. In this paper, we show how to remove that bias, by introducing a new version of multi-index Monte Carlo (MIMC) that has the added advantage of reducing the computational effort, relative to i. Read More

Application of the minimum distance method to the linear regression model for estimating regression parameters is a difficult and time-consuming process due to the complexity of its distance function, and hence, it is computationally expensive. To deal with the computational cost, this paper proposes a fast algorithm which mainly uses technique of coordinate-wise minimization in order to estimate the regression parameters. R package based on the proposed algorithm and written in Rcpp is available online. Read More

Many clustering methods, including k-means, require the user to specify the number of clusters as an input parameter. A variety of methods have been devised to choose the number of clusters automatically, but they often rely on strong modeling assumptions. This paper proposes a data-driven approach to estimate the number of clusters based on a novel form of cross-validation. Read More

Probabilistic (or Bayesian) modeling and learning offers interesting possibilities for systematic representation of uncertainty based on probability theory. Recent advances in Monte Carlo based methods have made previously intractable problem possible to solve using only the computational power available in a standard personal computer. For probabilistic learning of unknown parameters in nonlinear state-space models, methods based on the particle filter have proven useful. Read More

Latent stochastic block models are flexible statistical models that are widely used in social network analysis. In recent years, efforts have been made to extend these models to temporal dynamic networks, whereby the connections between nodes are observed at a number of different times. In this paper we extend the original stochastic block model by using a Markovian property to describe the evolution of nodes' cluster memberships over time. Read More

Many contemporary statistical learning methods assume a Euclidean feature space, however, the "curse of dimensionality" associated with high feature dimensions is particularly severe for the Euclidean distance. This paper presents a method for defining similarity based on hyperspherical geometry and shows that it often improves the performance of support vector machine compared to other competing similarity measures. Specifically, the idea of using heat diffusion on a hypersphere to measure similarity has been proposed and tested by \citet{Lafferty:2015uy}, demonstrating promising results based on an approximate heat kernel, however, the exact hyperspherical heat kernel hitherto remains unknown. Read More

The Poisson-binomial distribution is useful in many applied problems in engineering, actuarial science, and data mining. The Poisson-binomial distribution models the distribution of the sum of independent but not identically distributed Bernoulli random variables whose success probabilities vary. In this paper, we extend the Poisson-binomial distribution to the generalized Poisson-binomial (GPB) distribution. Read More

For a large class of orthogonal basis functions, there has been a recent identification of expansion methods for computing accurate, stable approximations of a quantity of interest. This paper presents, within the context of uncertainty quantification, a practical implementation using basis adaptation, and coherence motivated sampling, which under assumptions has satisfying guarantees. This implementation is referred to as Basis Adaptive Sample Efficient Polynomial Chaos (BASE-PC). Read More

For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least squares estimate in linear regression, where statistical leverage scores are often used to define subsampling probabilities. In this paper, we propose fast subsampling algorithms to efficiently approximate the maximum likelihood estimate in logistic regression. Read More

A low-complexity 8-point orthogonal approximate DCT is introduced. The proposed transform requires no multiplications or bit-shift operations. The derived fast algorithm requires only 14 additions, less than any existing DCT approximation. Read More

Light detection and ranging (LiDAR) data provide critical information on the three-dimensional structure of forests. However, collecting wall-to-wall LiDAR data at regional and global scales is cost prohibitive. As a result, studies employing LiDAR data from airborne platforms typically collect data via strip sampling; leaving large swaths of the forest domain unmeasured by the instrument. Read More

We introduce a class of unbiased Monte Carlo estimators for the multivariate density of max-stable fields generated by Gaussian processes. Our estimators take advantage of recent results on exact simulation of max-stable fields combined with identities studied in the Malliavin calculus literature and ideas developed in the multilevel Monte Carlo literature. Our approach allows estimating multivariate densities of max-stable fields with precision $\varepsilon $ at a computational cost of order $O\left( \varepsilon ^{-2}\log \log \log \left( 1/\varepsilon \right) \right) $. Read More

Stochastic Gradient Descent (SGD) is widely used in machine learning problems to efficiently perform empirical risk minimization, yet, in practice, SGD is known to stall before reaching the actual minimizer of the empirical risk. SGD stalling has often been attributed to its sensitivity to the conditioning of the problem; however, as we demonstrate, SGD will stall even when applied to a simple linear regression problem with unity condition number for standard learning rates. Thus, in this work, we numerically demonstrate and mathematically argue that stalling is a crippling and generic limitation of SGD and its variants in practice. Read More

The latent position cluster model is a popular model for the statistical analysis of network data. This model assumes that there is an underlying latent space in which the actors follow a finite mixture distribution. Moreover, actors which are close in this latent space are more likely to be tied by an edge. Read More

A non-parametric method for evaluation of the aggregate loss distribution (ALD) by combining and numerically inverting the empirical characteristic functions (CFs) is presented and illustrated. This approach to evaluate ALD is based on purely non-parametric considerations, i.e. Read More

Ranking datasets is useful when statements on the order of observations are more important than the magnitude of their differences and little is known about the underlying distribution of the data. The Wallenius distribution is a generalisation of the Hypergeometric distribution where weights are assigned to balls of different colours. This naturally defines a model for ranking categories which can be used for classification purposes. Read More

The Integrated Nested Laplace Approximation (INLA) has established itself as a widely used method for approximate inference on Bayesian hierarchical models which can be represented as a latent Gaussian model (LGM). INLA is based on producing an accurate approximation to the posterior marginal distributions of the parameters in the model and some other quantities of interest by using repeated approximations to intermediate distributions and integrals that appear in the computation of the posterior marginals. INLA focuses on models whose latent effects are a Gaussian Markov random field (GMRF). Read More

We introduce a low dimensional function of the site frequency spectrum that is tailor-made for distinguishing coalescent models with multiple mergers from Kingman coalescent models with population growth, and use this function to construct a hypothesis test between these two model classes. The null and alternative sampling distributions of our statistic are intractable, but its low dimensionality renders these distributions amenable to Monte Carlo estimation. We construct kernel density estimates of the sampling distributions based on simulated data, and show that the resulting hypothesis test dramatically improves on the statistical power of a current state-of-the-art method. Read More

Phylogenetic comparative methods explore the relationships between quantitative traits adjusting for shared evolutionary history. This adjustment often occurs through a Brownian diffusion process along the branches of the phylogeny that generates model residuals or the traits themselves. For high-dimensional traits, inferring all pair-wise correlations within the multivariate diffusion is limiting. Read More

Models with intractable normalizing functions arise frequently in statistics. Common examples of such models include exponential random graph models for social networks and Markov point processes for ecology and disease modeling. Inference for these models is complicated because the normalizing functions of their probability distributions include the parameters of interest. Read More

Penalized regression models such as the lasso have been extensively applied to analyzing high-dimensional data sets. However, due to memory limitations, existing R packages like glmnet and ncvreg are not capable of fitting lasso-type models for ultrahigh-dimensional, multi-gigabyte data sets that are increasingly seen in many areas such as genetics, genomics, biomedical imaging, and high-frequency finance. In this research, we implement an R package called biglasso that tackles this challenge. Read More

In this article we consider static Bayesian parameter estimation for partially observed diffusions that are discretely observed. We work under the assumption that one must resort to discretizing the underlying diffusion process, for instance using the Euler-Maruyama method. Given this assumption, we show how one can use Markov chain Monte Carlo (MCMC) and particularly particle MCMC [Andrieu, C. Read More

Although applications of Bayesian analysis for numerical quadrature problems have been considered before, it's only very recently that statisticians have focused on the connections between statistics and numerical analysis of differential equations. In line with this very recent trend, we show how certain commonly used finite difference schemes for numerical solutions of ordinary and partial differential equations can be considered in a regression setting. Focusing on this regression framework, we apply a simple Bayesian strategy to obtain confidence intervals for the finite difference solutions. Read More

We introduce Fisher consistency in the sense of unbiasedness as a criterion to distinguish potentially suitable and unsuitable estimators of prior class probabilities in test datasets under prior probability and more general dataset shift. The usefulness of this unbiasedness concept is demonstrated with three examples of classifiers used for quantification: Adjusted Classify & Count, EM-algorithm and CDE-Iterate. We find that Adjusted Classify & Count and EM-algorithm are Fisher consistent. Read More

HYGARCH process is the commonly used long memory process in modeling the long-rang dependence in volatility. Financial time series are characterized by transition between phases of different volatility levels. The smooth transition HYGARCH (ST-HYGARCH) model is proposed to model time-varying structure with long memory property. Read More

Affiliations: 1Harvard University, 2Harvard University, 3University of Bristol, 4Université Paris-Dauphine PSL and University of Warwick

In purely generative models, one can simulate data given parameters but not necessarily evaluate the likelihood. We use Wasserstein distances between empirical distributions of observed data and empirical distributions of synthetic data drawn from such models to estimate their parameters. Previous interest in the Wasserstein distance for statistical inference has been mainly theoretical, due to computational limitations. Read More

We develop a constructive approach to estimating sparse, high-dimensional linear regression models. The approach is a computational algorithm motivated from the KKT conditions for the $\ell_0$-penalized least squares solutions. It generates a sequence of solutions iteratively, based on support detection using primal and dual information and root finding. Read More

For a given target density, there exist an infinite number of diffusion processes which are ergodic with respect to this density. As observed in a number of papers, samplers based on nonreversible diffusion processes can significantly outperform their reversible counterparts both in terms of asymptotic variance and rate of convergence to equilibrium. In this paper, we take advantage of this in order to construct efficient sampling algorithms based on the Lie-Trotter decomposition of a nonreversible diffusion process into reversible and nonreversible components. Read More

Piecewise deterministic Monte Carlo methods (PDMC) consist of a class of continuous-time Markov chain Monte Carlo methods (MCMC) which have recently been shown to hold considerable promise. Being non-reversible, the mixing properties of PDMC methods often significantly outperform classical reversible MCMC competitors. Moreover, in a Bayesian context they can use sub-sampling ideas, so that they need only access one data point per iteration, whilst still maintaining the true posterior distribution as their invariant distribution. Read More

We present a new inference method based on approximate Bayesian computation for estimating parameters governing an entire network based on link-traced samples of that network. To do this, we first take summary statistics from an observed link-traced network sample, such as a recruitment network of subjects in a hard-to-reach population. Then we assume prior distributions, such as multivariate uniform, for the distribution of some parameters governing the structure of the network and behaviour of its nodes. Read More

We propose Edward, a Turing-complete probabilistic programming language. Edward builds on two compositional representations---random variables and inference. By treating inference as a first class citizen, on a par with modeling, we show that probabilistic programming can be as flexible and computationally efficient as traditional deep learning. Read More

Extensions in the field of joint modeling of correlated data and dynamic predictions improve the development of prognosis research. The R package frailtypack provides estimations of various joint models for longitudinal data and survival events. In particular, it fits models for recurrent events and a terminal event (frailtyPenal), models for two survival outcomes for clustered data (frailtyPenal), models for two types of recurrent events and a terminal event (multivPenal), models for a longitudinal biomarker and a terminal event (longiPenal) and models for a longitudinal biomarker, recurrent events and a terminal event (trivPenal). Read More

Recombinant binomial trees are binary trees where each non-leaf node has two child nodes, but adjacent parents share a common child node. Such trees arise in finance when pricing an option. For example, valuation of a European option can be carried out by evaluating the expected value of asset payoffs with respect to random paths in the tree. Read More