Statistics - Methodology Publications (50)


Statistics - Methodology Publications

Bayesian inference for complex models is challenging due to the need to explore high-dimensional spaces and multimodality and standard Monte Carlo samplers can have difficulties effectively exploring the posterior. We introduce a general purpose rejection-free ensemble Markov Chain Monte Carlo (MCMC) technique to improve on existing poorly mixing samplers. This is achieved by combining parallel tempering and an auxiliary variable move to exchange information between the chains. Read More

The regsem package in R, an implementation of regularized structural equation modeling (RegSEM; Jacobucci, Grimm, and McArdle 2016), was recently developed with the goal of incorporating various forms of penalized likelihood estimation in a broad array of structural equations models. The forms of regularization include both the ridge (Hoerl and Kennard 1970) and the least absolute shrinkage and selection operator (lasso; Tibshirani 1996), along with sparser extensions. RegSEM is particularly useful for structural equation models that have a small parameter to sample size ratio, as the addition of penalties can reduce the complexity, thus reducing the bias of the parameter estimates. Read More

In the study of complex physical and biological systems represented by multivariate stochastic processes, an issue of great relevance is the description of the system dynamics spanning multiple temporal scales. While methods to assess the dynamic complexity of individual processes at different time scales are well-established, the multiscale evaluation of directed interactions between processes is complicated by theoretical and practical issues such as filtering and downsampling. Here we extend the very popular measure of Granger causality (GC), a prominent tool for assessing directed lagged interactions between joint processes, to quantify information transfer across multiple time scales. Read More

We introduce a new type of point process model to describe the incidence of contagious diseases. The model is a variant of the Hawkes self-exciting process and exhibits similar clustering but without the restriction that the component describing the contagion must remain static over time. Instead, our proposed model prescribes that the degree of contagion (or productivity) changes as a function of the conditional intensity; of particular interest is the special case where the productivity is inversely proportional to the conditional intensity. Read More

Continuous-time multi-state survival models can be used to describe health-related processes over time. In the presence of interval-censored times for transitions between the living states, the likelihood is constructed using transition probabilities. Models can be specified using parametric or semi-parametric shapes for the hazards. Read More

This paper focuses on the multivariate linear mixed-effects model, including all the correlations between the random effects when the marginal residual terms are assumed uncorrelated and homoscedastic with possibly different standard deviations. The random effects covariance matrix is Cholesky factorized to directly estimate the variance components of these random effects. This strategy enables a consistent estimate of the random effects covariance matrix which, generally, has a poor estimate when it is grossly (or directly) estimated, using the estimating methods such as the EM algorithm. Read More

In this paper, we study a novel approach for the estimation of quantiles when facing potential right censoring of the responses. Contrary to the existing literature on the subject, the adopted strategy of this paper is to tackle censoring at the very level of the loss function usually employed for the computation of quantiles, the so-called "check" function. For interpretation purposes, a simple comparison with the latter reveals how censoring is accounted for in the newly proposed loss function. Read More

Cross-validation is one of the most popular model selection methods in statistics and machine learning. Despite its wide applicability, traditional cross-validation methods tend to select overfitting models, unless the ratio between the training and testing sample sizes is much smaller than conventional choices. We argue that such an overfitting tendency of cross-validation is due to the ignorance of the uncertainty in the testing sample. Read More

Particle filters are a popular and flexible class of numerical algorithms to solve a large class of nonlinear filtering problems. However, standard particle filters with importance weights have been shown to require a sample size that increases exponentially with the dimension D of the state space in order to achieve a certain performance, which precludes their use in very high-dimensional filtering problems. Here, we focus on the dynamic aspect of this curse of dimensionality (COD) in continuous time filtering, which is caused by the degeneracy of importance weights over time. Read More

Energy statistics are estimators of the energy distance that depend on the distances between observations. The idea behind energy statistics is to consider a statistical potential energy that would parallel Newton's gravitational potential energy. This statistical potential energy is zero if and only if a certain null hypothesis relating two distributions holds true. Read More

Recent advances in bioinformatics have made high-throughput microbiome data widely available, and new statistical tools are required to maximize the information gained from these data. For example, analysis of high-dimensional microbiome data from designed experiments remains an open area in microbiome research. Contemporary analyses work on metrics that summarize collective properties of the microbiome, but such reductions preclude inference on the fine-scale effects of environmental stimuli on individual microbial taxa. Read More

In social and economic studies many of the collected variables are measured on a nominal scale, often with a large number of categories. The definition of categories is usually not unambiguous and different classification schemes using either a finer or a coarser grid are possible. Categorisation has an impact when such a variable is included as covariate in a regression model: a too fine grid will result in imprecise estimates of the corresponding effects, whereas with a too coarse grid important effects will be missed, resulting in biased effect estimates and poor predictive performance. Read More

Thermodynamic integration (TI) for computing marginal likelihoods is based on an inverse annealing path from the prior to the posterior distribution. In many cases, the resulting estimator suffers from high variability, which particularly stems from the prior regime. When comparing complex models with differences in a comparatively small number of parameters, intrinsic errors from sampling fluctuations may outweigh the differences in the log marginal likelihood estimates. Read More

Sufficient dimension reduction (SDR) is continuing an active research field nowadays for high dimensional data. It aims to estimate the central subspace (CS) without making distributional assumption. To overcome the large-$p$-small-$n$ problem we propose a new approach for SDR. Read More

It is generally accepted that all models are wrong -- the difficulty is determining which are useful. Here, a useful model is considered as one that is capable of combining data and expert knowledge, through an inversion or calibration process, to adequately characterize the uncertainty in predictions of interest. This paper derives conditions that specify which simplified models are useful and how they should be calibrated. Read More

Variational inference methods for latent variable statistical models have gained popularity because they are relatively fast, can handle large data sets, and have deterministic convergence guarantees. However, in practice it is unclear whether the fixed point identified by the variational inference algorithm is a local or a global optimum. Here, we propose a method for constructing iterative optimization algorithms for variational inference problems that are guaranteed to converge to the $\epsilon$-global variational lower bound on the log-likelihood. Read More

The problems of computational data processing involving regression, interpolation, reconstruction and imputation for multidimensional big datasets are becoming more important these days, because of the availability of data and their widely spread usage in business, technological, scientific and other applications. The existing methods often have limitations, which either do not allow, or make it difficult to accomplish many data processing tasks. The problems usually relate to algorithm accuracy, applicability, performance (computational and algorithmic), demands for computational resources, both in terms of power and memory, and difficulty working with high dimensions. Read More

Conditional density estimation (density regression) estimates the distribution of a response variable y conditional on covariates x. Utilizing a partition model framework, a conditional density estimation method is proposed using logistic Gaussian processes. The partition is created using a Voronoi tessellation and is learned from the data using a reversible jump Markov chain Monte Carlo algorithm. Read More

The popularity of online surveys has increased the prominence of sampling weights in claims of representativeness. Yet, much uncertainty remains regarding how these weights should be employed in the analysis of survey experiments: Should they be used or ignored? If they are used, which estimators are preferred? We offer practical advice, rooted in the Neyman-Rubin model, for researchers producing and working with survey experimental data. We examine simple, efficient estimators (Horvitz-Thompson, H\`ajek, "double-H\`ajek", and post-stratification) for analyzing these data, along with formulae for biases and variances. Read More

Many application domains such as ecology or genomics have to deal with multivariate non Gaussian observations. A typical example is the joint observation of the respective abundances of a set of species in a series of sites, aiming to understand the co-variations between these species. The Gaussian setting provides a canonical way to model such dependencies, but does not apply in general. Read More

We describe a way to construct hypothesis tests and confidence intervals after having used the Lasso for feature selection, allowing the regularization parameter to be chosen via an estimate of prediction error. Our estimate of prediction error is a slight variation on cross-validation. Using this variation, we are able to describe an appropriate selection event for choosing a parameter by cross-validation. Read More

The stochastic block model is widely used for detecting community structures in network data. How to test the goodness-of-fit of the model is one of the fundamental problems and has gained growing interests in recent years. In this paper, we propose a new goodness-of-fit test based on the maximum entry of the centered and re-scaled observed adjacency matrix for the stochastic block model in which the number of communities can be allowed to grow linearly with the number of nodes ignoring a logarithm factor. Read More

This article proposes a new graphical tool, the magnitude-shape (MS) plot, for visualizing both the magnitude and shape outlyingness of multivariate functional data. The proposed tool builds on the recent notion of functional directional outlyingness, which measures the centrality of functional data by simultaneously considering the level and the direction of their deviation from the central region. The MS-plot intuitively presents not only levels but also directions of magnitude outlyingness on the horizontal axis or plane, and demonstrates shape outlyingness on the vertical axis. Read More

A novel approach towards the spectral analysis of stationary random bivariate signals is proposed. Using the Quaternion Fourier Transform, we introduce a quaternion-valued spectral representation of random bivariate signals seen as complex-valued sequences. This makes possible the definition of a scalar quaternion-valued spectral density for bivariate signals. Read More

The regularization approach for variable selection was well developed for a completely observed data set in the past two decades. In the presence of missing values, this approach needs to be tailored to different missing data mechanisms. In this paper, we focus on a flexible and generally applicable missing data mechanism, which contains both ignorable and nonignorable missing data mechanism assumptions. Read More

When dealing with the problem of simultaneously testing a large number of null hypotheses, a natural testing strategy is to first reduce the number of tested hypotheses by some selection (screening or filtering) process, and then to simultaneously test the selected hypotheses. The main advantage of this strategy is to greatly reduce the severe effect of high dimensions. However, the first screening or selection stage must be properly accounted for in order to maintain some type of error control. Read More

This paper is concerned with learning of mixture regression models for individuals that are measured repeatedly. The adjective "unsupervised" implies that the number of mixing components is unknown and has to be determined, ideally by data driven tools. For this purpose, a novel penalized method is proposed to simultaneously select the number of mixing components and to estimate the mixing proportions and unknown parameters in the models. Read More

We consider the problem of identifying the support of the block signal in a sequence when both the length and the location of the block signal are unknown. The multivariate version of this problem is also considered, in which we try to identify the support of the rectangular signal in the hyper- rectangle. We allow the length of the block signal to grow polynomially with the length of the sequence, which greatly generalizes the previous results in [16]. Read More

A significant literature has arisen to study ways to employing prior knowledge to improve power and precision of multiple testing procedures. Some common forms of prior knowledge may include (a) a priori beliefs about which hypotheses are null, modeled by non-uniform prior weights; (b) differing importances of hypotheses, modeled by differing penalties for false discoveries; (c) partitions of the hypotheses into known groups, indicating (dis)similarity of hypotheses; and (d) knowledge of independence, positive dependence or arbitrary dependence between hypotheses or groups, allowing for more aggressive or conservative procedures. We present a general framework for global null testing and false discovery rate (FDR) control that allows the scientist to incorporate all four types of prior knowledge (a)-(d) simultaneously. Read More

Adopting the Bayesian methodology of adjusting for selection to provide valid inference in Panigrahi (2016), the current work proposes an approximation to a selective posterior, post randomized queries on data. Such a posterior differs from the usual one as it involves a truncated likelihood prepended with a prior belief on parameters in a Bayesian model. The truncation, imposed by selection, leads to intractability of the selective posterior, thereby posing a technical hurdle in sampling from such a posterior. Read More

The current work proposes a Monte Carlo free alternative to inference post randomized selection algorithms with a convex loss and a convex penalty. The pivots based on the selective law that is truncated to all selected realizations, typically lack closed form expressions in randomized settings. Inference in these settings relies upon standard Monte Carlo sampling techniques, which can be prove to be unstable for parameters far off from the chosen reference distribution. Read More

Integration against an intractable probability measure is among the fundamental challenges of statistical inference, particularly in the Bayesian setting. A principled approach to this problem seeks a deterministic coupling of the measure of interest with a tractable "reference" measure (e.g. Read More

We study the convergence properties of the Gibbs Sampler in the context of posterior distributions arising from Bayesian analysis of Gaussian hierarchical models. We consider centred and non-centred parameterizations as well as their hybrids including the full family of partially non-centred parameterizations. We develop a novel methodology based on multi-grid decompositions to derive analytic expressions for the convergence rates of the algorithm for an arbitrary number of layers in the hierarchy, while previous work was typically limited to the two-level case. Read More

Propensity score weighting is a tool for causal inference to adjust for measured confounders in observational studies. In practice, data often present complex structures, such as clustering, which make propensity score modeling and estimation challenging. In addition, for clustered data, there may be unmeasured cluster-specific variables that are related to both the treatment assignment and the outcome. Read More

Many environmental processes exhibit weakening spatial dependence as events become more extreme. Well-known limiting models, such as max-stable or generalized Pareto processes, cannot capture this, which can lead to a preference for models that exhibit a property known as asymptotic independence. However, weakening dependence does not automatically imply asymptotic independence, and whether the process is truly asymptotically (in)dependent is usually far from clear. Read More

The concept of entropy, firstly introduced in information theory, rapidly became popular in many applied sciences via Shannon's formula to measure the degree of heterogeneity among observations. A rather recent research field aims at accounting for space in entropy measures, as a generalization when the spatial location of occurrences ought to be accounted for. The main limit of these developments is that all indices are computed conditional on a chosen distance. Read More

There has been considerable interest in using decomposition methods in epidemiology (mediation analysis) and economics (Oaxaca-Blinder decomposition) to understand how health disparities arise and how they might change upon intervention. It has not been clear when estimates from the Oaxaca-Blinder decomposition can be interpreted causally because its implementation does not explicitly address potential confounding of target variables. While mediation analysis does explicitly adjust for confounders of target variables, it does so in a way that entails equalizing confounders across racial groups, which may not reflect the intended intervention. Read More

The intersection of causal inference and machine learning is a rapidly advancing field. We propose a new approach, the method of direct estimation, that draws on both traditions in order to obtain nonparametric estimates of treatment effects. The approach focuses on estimating the effect of fluctuations in a treatment variable on an outcome. Read More

In modern biomedical research, it is ubiquitous to have multiple data sets measured on the same set of samples from different views (i.e., multi-view data). Read More

A distributed multi-speaker voice activity detection (DM-VAD) method for wireless acoustic sensor networks (WASNs) is proposed. DM-VAD is required in many signal processing applications, e.g. Read More

Brain mapping is an increasingly important tool in neurology and psychiatry researches for the realization of data-driven personalized medicine in the big data era, which learns the statistical links between brain images and subject level features. Taking images as responses, the task raises a lot of challenges due to the high dimensionality of the image with relatively small number of samples, as well as the noisiness of measurements in medical images. In this paper we propose a novel method {\it Smooth Image-on-scalar Regression} (SIR) for recovering the true association between an image outcome and scalar predictors. Read More

Probabilistic Component Latent Analysis (PLCA) is a statistical modeling method for feature extraction from non-negative data. It has been fruitfully applied to various research fields of information retrieval. However, the EM-solved optimization problem coming with the parameter estimation of PLCA-based models has never been properly posed and justified. Read More

Vine copulas are pair-copula constructions enabling multivariate dependence modeling in terms of bivariate building blocks. One of the main tasks of fitting a vine copula is the selection of a suitable tree structure. For this the prevalent method is a heuristic called Di{\ss}mann's algorithm. Read More

The aim of this article is to design a moment transformation for Student- t distributed random variables, which is able to account for the error in the numerically computed mean. We employ Student-t process quadrature, an instance of Bayesian quadrature, which allows us to treat the integral itself as a random variable whose variance provides information about the incurred integration error. Advantage of the Student- t process quadrature over the traditional Gaussian process quadrature, is that the integral variance depends also on the function values, allowing for a more robust modelling of the integration error. Read More

One-sided cross-validation (OSCV) is a bandwidth selection method initially introduced by Hart and Yi (1998) in the context of smooth regression functions. Mart\'{\i}nez-Miranda et al. (2009) developed a version of OSCV for smooth density functions. Read More

This paper studies identification, estimation, and inference of quantile treatment effects in the fuzzy regression kink design with a binary treatment variable. We first show the identification of conditional quantile treatment effects given the event of local compliance. We then propose a bootstrap method of uniform inference for the local quantile process. Read More

Uncertainty quantification (UQ) has received much attention in the literature in the past decade. In this context, Sparse Polynomial chaos expansions (PCE) have been shown to be among the most promising methods because of their ability to model highly complex models at relatively low computational costs. A least-square minimization technique may be used to determine the coefficients of the sparse PCE by relying on the so called experimental design (ED), i. Read More

In this article we derive the almost sure convergence theory of Bayes factor in the general set-up that includes even dependent data and misspecified models, as a simple application of a result of Shalizi (2009) to a well-known identity satisfied by the Bayes factor. Read More

Fully robust versions of the elastic net estimator are introduced for linear and logistic regression. The algorithms to compute the estimators are based on the idea of repeatedly applying the non-robust classical estimators to data subsets only. It is shown how outlier-free subsets can be identified efficiently, and how appropriate tuning parameters for the elastic net penalties can be selected. Read More

A method is derived for the quantitative analysis of signals that are composed of superpositions of isolated, time-localized "events". Here these events are taken to be well represented as rescaled and phase-rotated versions of generalized Morse wavelets, a broad family of continuous analytic functions. Analyzing a signal composed of replicates of such a function using another Morse wavelet allows one to directly estimate the properties of events from the values of the wavelet transform at its own maxima. Read More