Statistics - Methodology Publications (50)


Statistics - Methodology Publications

Statistical models for network epidemics usually assume a Bernoulli random graph, in which any two nodes have the same probability of being connected. This assumption provides computational simplicity but does not describe real-life networks well. We propose an epidemic model based on the preferential attachment model, which adds nodes sequentially by simple rules to generate a network. Read More

Blind source separation is a common processing tool to analyse the constitution of pixels of hyperspectral images. Such methods usually suppose that pure pixel spectra (endmembers) are the same in all the image for each class of materials. In the framework of remote sensing, such an assumption is no more valid in the presence of intra-class variabilities due to illumination conditions, weathering, slight variations of the pure materials, etc. Read More

Spatiotemporal gene expression data of the human brain offer insights on the spa- tial and temporal patterns of gene regulation during brain development. Most existing methods for analyzing these data consider spatial and temporal profiles separately with the implicit assumption that different brain regions develop in similar trajectories, and that the spatial patterns of gene expression remain similar at different time points. Al- though these analyses may help delineate gene regulation either spatially or temporally, they are not able to characterize heterogeneity in temporal dynamics across different brain regions, or the evolution of spatial patterns of gene regulation over time. Read More

Evidence synthesis models that combine multiple datasets of varying design, to estimate quantities that cannot be directly observed, require the formulation of complex probabilistic models that can be expressed as graphical models. An assessment of whether the different datasets synthesised contribute information that is consistent with each other (and in a Bayesian context, with the prior distribution) is a crucial component of the model criticism process. However, a systematic assessment of conflict in evidence syntheses suffers from the multiple testing problem, through testing for conflict at multiple locations in a model. Read More

Standard penalized methods of variable selection and parameter estimation rely on the magnitude of coefficient estimates to decide which variables to include in the final model. However, coefficient estimates are unreliable when the design matrix is collinear. To overcome this challenge an entirely new method of variable selection is presented within a generalized fiducial inference framework. Read More

Partially-observed Boolean dynamical systems (POBDS) are a general class of nonlinear models with application in estimation and control of Boolean processes based on noisy and incomplete measurements. The optimal minimum mean square error (MMSE) algorithms for POBDS state estimation, namely, the Boolean Kalman filter (BKF) and Boolean Kalman smoother (BKS), are intractable in the case of large systems, due to computational and memory requirements. To address this, we propose approximate MMSE filtering and smoothing algorithms based on the auxiliary particle filter (APF) method from sequential Monte-Carlo theory. Read More

We propose a novel Dirichlet-based P\'olya tree (D-P tree) prior on the copula and based on the D-P tree prior, a nonparametric Bayesian inference procedure. Through theoretical analysis and simulations, we are able to show that the flexibility of the D-P tree prior ensures its consistency in copula estimation, thus able to detect more subtle and complex copula structures than earlier nonparametric Bayesian models, such as a Gaussian copula mixture. Further, the continuity of the imposed D-P tree prior leads to a more favorable smoothing effect in copula estimation over classic frequentist methods, especially with small sets of observations. Read More

In this paper, we propose to construct uniform confidence sets by bootstrapping the debiased kernel density estimator (for density estimation) and the debiased local polynomial regression estimator (for regression analysis). The idea of using a debiased estimator was first introduced in Calonico et al. (2015), where they construct a pointwise confidence set by explicitly estimating stochastic variations. Read More

Detecting causal associations in time series datasets is a key challenge for novel insights into complex dynamical systems such as the Earth system or the human brain. Interactions in high-dimensional dynamical systems often involve time-delays, nonlinearity, and strong autocorrelations. These present major challenges for causal discovery techniques such as Granger causality leading to low detection power, biases, and unreliable hypothesis tests. Read More

Confidence interval procedures used in low dimensional settings are often inappropriate for high dimensional applications. When a large number of parameters are estimated, marginal confidence intervals associated with the most significant estimates have very low coverage rates: They are too small and centered at biased estimates. The problem of forming confidence intervals in high dimensional settings has previously been studied through the lens of selection adjustment. Read More

A powerful data transformation method named guided projections is proposed creating new possibilities to reveal the group structure of high-dimensional data in the presence of noise variables. Utilising projections onto a space spanned by a selection of a small number of observations allows measuring the similarity of other observations to the selection based on orthogonal and score distances. Observations are iteratively exchanged from the selection creating a non-random sequence of projections which we call guided projections. Read More

Empirical Bayes estimators are widely used to provide indirect and model-based estimates of means in small areas. The most common model is two-stage normal hierarchical model called Fay-Herriot model. However, due to the normality assumption, it can be highly influenced by the presence of outliers. Read More

In this paper, we present a methodology to estimate the parameters of stochastically contaminated models under two contamination regimes. In both regimes, we assume that the original process is a variable length Markov chain that is contaminated by a random noise. In the first regime we consider that the random noise is added to the original source and in the second regime, the random noise is multiplied by the original source. Read More

This paper provides asymptotic theory for Inverse Probability Weighing (IPW) and Locally Robust Estimator (LRE) of Best Linear Predictor where the response missing at random (MAR), but not completely at random (MCAR). We relax previous assumptions in the literature about the first-step nonparametric components, requiring only their mean square convergence. This relaxation allows to use a wider class of machine leaning methods for the first-step, such as lasso. Read More

Eigenvector spatial filtering (ESF) is a spatial modeling approach, which has been applied in urban and regional studies, ecological studies, and so on. However, it is computationally demanding, and may not be suitable for large data modeling. The objective of this study is developing fast ESF and random effects ESF (RE-ESF), which are capable of handling very large samples. Read More

Consider the problem of modeling hysteresis for finite-state random walks using higher-order Markov chains. This Letter introduces a Bayesian framework to determine, from data, the number of prior states of recent history upon which a trajectory is statistically dependent. The general recommendation is to use leave-one-out cross validation, using an easily-computable formula that is provided in closed form. Read More

Boolean matrix factorisation (BooMF) infers interpretable decompositions of a binary data matrix into a pair of low-rank, binary matrices: One containing meaningful patterns, the other quantifying how the observations can be expressed as a combination of these patterns. We introduce the OrMachine, a probabilistic generative model for BooMF and derive a Metropolised Gibbs sampler that facilitates very efficient parallel posterior inference. Our method outperforms all currently existing approaches for Boolean Matrix factorization and completion, as we show on simulated and real world data. Read More

In phylogenetics, alignments of molecular sequence data for a collection of species are used to learn about their phylogeny - an evolutionary tree which places these species as leaves and ancestors as internal nodes. Sequence evolution on each branch of the tree is generally modelled using a continuous time Markov process, characterised by an instantaneous rate matrix. Early models assumed the same rate matrix governed substitutions at all sites of the alignment, ignoring the variation in evolutionary constraints. Read More

This paper studies the nonparametric modal regression problem systematically from a statistical learning view. Originally motivated by pursuing a theoretical understanding of the maximum correntropy criterion based regression (MCCR), our study reveals that MCCR with a tending-to-zero scale parameter is essentially modal regression. We show that nonparametric modal regression problem can be approached via the classical empirical risk minimization. Read More

Without unrealistic continuity and smoothness assumptions on a distributional density of one dimensional dataset, constructing an authentic possibly-gapped histogram becomes rather complex. The candidate ensemble is described via a two-layer Ising model, and its size is shown to grow exponentially. This exponential complexity makes any exhaustive search in-feasible and all boundary parameters local. Read More

National statistical institutes in many countries are now mandated to produce reliable statistics for important variables such as population, income, unemployment, health outcomes, etc. for small areas, defined by geography and/or demography. Due to small samples from these areas, direct sample-based estimates are often unreliable. Read More

Most common parametric families of copulas are totally ordered, and in many cases they are also postively or negatively regression dependent and therefore they lead to monotone regression functions, which makes them not suitable for dependence relationships that imply or suggest a non-monotone regression function. In the present work it is proposed a gluing copula decomposition of the underlying copula in terms of copulas that are at least totally ordered, and if in addition they are positively or negatively regression dependent, when combined by the gluing copula technique it is possible to obtain a non-monotone regression function. Read More

In this paper we present an objective approach to change point analysis. In particular, we look at the problem from two perspectives. The first focuses on the definition of an objective prior when the number of change points is known a priori. Read More

In this paper we show that the negative sample distance covariance function is a quasi-concave set function of samples of random variables that are not statistically independent. We use these properties to propose greedy algorithms to combinatorially optimize some diversity (low statistical dependence) promoting functions of distance covariance. Our greedy algorithm obtains all the inclusion-minimal maximizers of this diversity promoting objective. Read More

This paper studies the sparse normal mean models under the empirical Bayes framework. We focus on the mixture priors with an atom at zero and a density component centered at a data driven location determined by maximizing the marginal likelihood or minimizing the Stein Unbiased Risk Estimate. We study the properties of the corresponding posterior median and posterior mean. Read More

Frequentist model averaging has been proposed as a method for incorporating "model uncertainty" into confidence interval construction. Such proposals have been of particular interest in the environmental and ecological statistics communities. A promising method of this type is the model averaged tail area (MATA) confidence interval put forward by Turek and Fletcher, 2012. Read More

We propose an empirical Bayes estimator based on Dirichlet process mixture model for estimating the sparse normalized mean difference, which could be directly applied to the high dimensional linear classification. In theory, we build a bridge to connect the estimation error of the mean difference and the misclassification error, also provide sufficient conditions of sub-optimal classifiers and optimal classifiers. In implementation, a variational Bayes algorithm is developed to compute the posterior efficiently and could be parallelized to deal with the ultra-high dimensional case. Read More

We propose a new Bayesian model for flexible nonlinear regression and classification using tree ensembles. The model is based on the RuleFit approach in Friedman and Popescu (2008) where rules from decision trees and linear terms are used in a L1-regularized regression. We modify RuleFit by replacing the L1-regularization by a horseshoe prior, which is well known to give aggressive shrinkage of noise predictor while leaving the important signal essentially untouched. Read More

Precision medicine is an emerging scientific topic for disease treatment and prevention that takes into account individual patient characteristics. It is an important direction for clinical research, and many statistical methods have been recently proposed. One of the primary goals of precision medicine is to obtain an optimal individual treatment rule (ITR), which can help make decisions on treatment selection according to each patient's specific characteristics. Read More

We consider estimation of an optimal individualized treatment rule (ITR) from observational and randomized studies when data for a high-dimensional baseline variable is available. Our optimality criterion is with respect to delaying time to occurrence of an event of interest (e.g. Read More

This paper studies non-separable models with a continuous treatment when control variables are high-dimensional. We propose an estimation and inference procedure for average, quantile, and marginal treatment effects. In the procedure, control variables are selected via a localized method of $L_1$-penalization at each value of the continuous treatment. Read More

Discriminant analysis is a useful classification method. Variable selection for discriminant analysis is becoming more and more im- portant in a high-dimensional setting. This paper is concerned with the binary-class problems of main and interaction effects selection for the quadratic discriminant analysis. Read More

Parametric hypothesis testing associated with two independent samples arises frequently in several applications in biology, medical sciences, epidemiology, reliability and many more. In this paper, we propose robust Wald-type tests for testing such two sample problems using the minimum density power divergence estimators of the underlying parameters. In particular, we consider the simple two-sample hypothesis concerning the full parametric homogeneity of the samples as well as the general two-sample (composite) hypotheses involving nuisance parameters also. Read More

Causal inference has received great attention across different fields from economics, statistics, education, medicine, to machine learning. Within this area, inferring causal effects at individual level in observational studies has become an important task, especially in high dimensional settings. In this paper, we propose a framework for estimating Individualized Treatment Effects in high-dimensional non-experimental data. Read More

Computation of asymptotic distributions is known to be a nontrivial and delicate task for the regression discontinuity designs (RDD) and the regression kink designs (RKD). It is even more complicated when a researcher is interested in joint or uniform inference across heterogeneous subpopulations indexed by covariates or quantiles. Hence, bootstrap procedures are often preferred in practice. Read More

A nonparametric Bayes approach is proposed for the problem of estimating a sparse sequence based on Gaussian random variables. We adopt the popular two-group prior with one component being a point mass at zero, and the other component being a mixture of Gaussian distributions. Although the Gaussian family has been shown to be suboptimal for this problem, we find that Gaussian mixtures, with a proper choice on the means and mixing weights, have the desired asymptotic behavior, e. Read More

Analysis of the rare and extreme values through statistical modeling is an important issue in economical crises, climate forecasting, and risk management of financial portfolios. Extreme value theory provides the probability models needed for statistical modeling of the extreme values. There are generally two ways to identifying the extreme values in a data set, the block-maxima and the peak-over threshold method. Read More

We analyze the problem of maximum likelihood estimation for Gaussian distributions that are multivariate totally positive of order two (MTP2). By exploiting connections to phylogenetics and single-linkage clustering, we give a simple proof that the maximum likelihood estimator (MLE) for such distributions exists based on at least 2 observations, irrespective of the underlying dimension. Slawski and Hein, who first proved this result, also provided empirical evidence showing that the MTP2 constraint serves as an implicit regularizer and leads to sparsity in the estimated inverse covariance matrix, determining what we name the ML graph. Read More

We present a procedure for controlling FWER when sequentially considering successive subfamilies of null hypotheses and rejecting at most one from each subfamily. Our procedure differs from previous procedures for controlling FWER by adjusting the critical values that are applied in subsequent rejection decisions by subtracting from the global significance level $\alpha$ quantities based on the p-values of rejected null hypotheses and the numbers of null hypotheses considered. Read More

The use of Kalman filtering, as well as its nonlinear extensions, for the estimation of system variables and parameters has played a pivotal role in many fields of scientific inquiry where observations of the system are restricted to a subset of variables. However in the case of censored observations, where measurements of the system beyond a certain detection point are impossible, the estimation problem is complicated. Without appropriate consideration, censored observations can lead to inaccurate estimates. Read More

We consider causal inference from observational studies when confounders have missing values. When the confounders are missing not at random, causal effects are generally not identifiable. In this article, we propose a novel framework for nonparametric identification of causal effects with confounders missing not at random, but subject to instrumental missingness, that is, the missing data mechanism is independent of the outcome, given the treatment and possibly missing confounder values. Read More

This work aimed, to determine the characteristics of activity series from fractal geometry concepts application, in addition to evaluate the possibility of identifying individuals with fibromyalgia. Activity level data were collected from 27 healthy subjects and 27 fibromyalgia patients, with the use of clock-like devices equipped with accelerometers, for about four weeks, all day long. The activity series were evaluated through fractal and multifractal methods. Read More

Constraint-based causal discovery (CCD) algorithms require fast and accurate conditional independence (CI) testing. The Kernel Conditional Independence Test (KCIT) is currently one of the most popular CI tests in the non-parametric setting, but many investigators cannot use KCIT with large datasets because the test scales cubicly with sample size. We therefore devise two relaxations called the Randomized Conditional Independence Test (RCIT) and the Randomized conditional Correlation Test (RCoT) which both approximate KCIT by utilizing random Fourier features. Read More

In order to understand underlying processes governing environmental and physical processes, and predict future outcomes, a complex computer model is frequently required to simulate these dynamics. However there is inevitably uncertainty related to the exact parametric form or the values of such parameters to be used when developing these simulators, with \emph{ranges} of plausible values prevalent in the literature. Systematic errors introduced by failing to account for these uncertainties have the potential to have a large effect on resulting estimates in unknown quantities of interest. Read More

The emergent field of probabilistic numerics has thus far lacked rigorous statistical principals. This paper establishes Bayesian probabilistic numerical methods as those which can be cast as solutions to certain Bayesian inverse problems, albeit problems that are non-standard. This allows us to establish general conditions under which Bayesian probabilistic numerical methods are well-defined, encompassing both non-linear and non-Gaussian models. Read More

Network topology evolves through time. A dynamic network model should account for both the temporal dependencies between graphs observed in time, as well as the structural dependencies inherent in each observed graph. We propose and investigate a family of dynamic network models, known as varying-coefficient exponential random graph models (VCERGMs), that characterize the evolution of network topology through smoothly varying parameters in an exponential family of distributions. Read More

In the following article we consider approximate Bayesian computation (ABC) inference. We introduce a method for numerically approximating ABC posteriors using the multilevel Monte Carlo (MLMC). A sequential Monte Carlo version of the approach is developed and it is shown under some assumptions that for a given level of mean square error, this method for ABC has a lower cost than i. Read More

This paper shows that the Conditional Quantile Treatment Effect on the Treated can be identified using a combination of (i) a conditional Distributional Difference in Differences assumption and (ii) an assumption on the conditional dependence between the change in untreated potential outcomes and the initial level of untreated potential outcomes for the treated group. The second assumption recovers the unknown dependence from the observed dependence for the untreated group. We also consider estimation and inference in the case where all of the covariates are discrete. Read More

Hidden Markov models (HMMs) are commonly used to model animal movement data and infer aspects of animal behavior. An HMM assumes that each data point from a time series of observations stems from one of $N$ possible states. The states are loosely connected to behavioral modes that manifest themselves at the temporal resolution at which observations are made. Read More

Randomized experiments in which the treatment of a unit can affect the outcomes of other units are becoming increasingly common in healthcare, economics, and in the social and information sciences. From a causal inference perspective, the typical assumption of no interference becomes untenable in such experiments. In many problems, however, the patterns of interference may be informed by the observation of network connections among the units of analysis. Read More