Statistics - Methodology Publications (50)


Statistics - Methodology Publications

In this paper, we generalize the metric-based permutation test for the equality of covariance operators proposed by Pigoli et al. (2014) to the case of multiple samples of functional data. To this end, the non-parametric combination methodology of Pesarin and Salmaso (2010) is used to combine all the pairwise comparisons between samples into a global test. Read More

We propose a new family of error distributions for model-based quantile regression, which is constructed through a structured mixture of normal distributions. The construction enables fixing specific percentiles of the distribution while, at the same time, allowing for varying mode, skewness and tail behavior. It thus overcomes the severe limitation of the asymmetric Laplace distribution -- the most commonly used error model for parametric quantile regression -- for which the skewness of the error density is fully specified when a particular percentile is fixed. Read More

We introduce a geometric approach for estimating a probability density function (pdf) given its samples. The procedure involves obtaining an initial estimate of the pdf and then transforming it via a warping function to reach the final estimate. The initial estimate is intended to be computationally fast, albeit suboptimal, but its warping creates a larger, flexible class of density functions, resulting in substantially improved estimation. Read More

In this paper, we construct the simultaneous confidence band (SCB) for the nonparametric component in partially linear panel data models with fixed effects. We remove the fixed effects, and further obtain the estimators of parametric and nonparametric components, which do not depend on the fixed effects. We establish the asymptotic distribution of their maximum absolute deviation between the estimated nonparametric component and the true nonparametric component under some suitable conditions, and hence the result can be used to construct the simultaneous confidence band of the nonparametric component. Read More

We propose an objective Bayesian approach to estimate the number of degrees of freedom for the multivariate $t$ distribution and for the $t$-copula, when the parameter is considered discrete. Inference on this parameter has been problematic, as the scarce literature for the multivariate $t$ shows and, more important, the absence of any method for the $t$-copula. We employ an objective criterion based on loss functions which allows to overcome the issue of defining objective probabilities directly. Read More

In this article, we propose a new algorithm for supervised learning methods, by which one can both capture the non-linearity in data and also find the best subset model. To produce an enhanced subset of the original variables, an ideal selection method should have the potential of adding a supplementary level of regression analysis that would capture complex relationships in the data via mathematical transformation of the predictors and exploration of synergistic effects of combined variables. The method that we present here has the potential to produce an optimal subset of variables, rendering the overall process of model selection to be more efficient. Read More

Optimized spatial partitioning algorithms are the corner stone of many successful experimental designs and statistical methods. Of these algorithms, the Centroidal Voronoi Tessellation (CVT) is the most widely utilized. CVT based methods require global knowledge of spatial boundaries, do not readily allow for weighted regions, have challenging implementations, and are inefficiently extended to high dimensional spaces. Read More

This paper presents a novel method for selecting main effects and a set of reparametrized predictors called conditional main effects (CMEs), which capture the conditional effect of a factor at a fixed level of another factor. CMEs represent highly interpretable phenomena for a wide range of applications in engineering, social sciences and genomics. The challenge in model selection lies in the grouped collinearity structure of CMEs, which can cause poor selection and prediction performance for existing methods. Read More

Relational arrays represent interactions or associations between pairs of actors, often over time or in varied contexts. We focus on the case where the elements of a relational array are modeled as a linear function of observable covariates. Due to the inherent dependencies among relations involving the same individual, standard regression methods for quantifying uncertainty in the regression coefficients for independent data are invalid. Read More

This article introduces a k-Inflated Negative Binomial mixture distribution/regression model as a more flexible alternative to zero-inflated Poisson distribution/regression model. An EM algorithm has been employed to estimate the model's parameters. Then, such new model along with a Pareto mixture model have been employed to design an optimal rate--making system. Read More

A reinsurance contract should address the conflicting interests of the insurer and reinsurer. Most of existing optimal reinsurance contracts only considers the interests of one party. This article combines the proportional and stop-loss reinsurance contracts and introduces a new reinsurance contract called proportional-stop-loss reinsurance. Read More

A usual reinsurance policy for insurance companies admits one or two layers of the payment deductions. Under optimal criterion of minimizing the conditional tail expectation (CTE) risk measure of the insurer's total risk, this article generalized an optimal stop-loss reinsurance policy to an optimal multi-layer reinsurance policy. To achieve such optimal multi-layer reinsurance policy, this article starts from a given optimal stop-loss reinsurance policy $f(\cdot). Read More

This article, in a first step, considers two Bayes estimators for the relativity premium of a given Bonus--Malus system. It then develops a linear relativity premium that closes, in the sense of weighted mean square error loss, to such Bayes estimators. In a second step, it supposes that the claim size distribution for a given Bonus--Malus system can be formulated as a finite mixture distribution. Read More

We consider the recovery of regression coefficients, denoted by $\boldsymbol{\beta}_0$, for a single index model (SIM) relating a binary outcome $Y$ to a set of possibly high dimensional covariates $\boldsymbol{X}$, based on a large but 'unlabeled' dataset $\mathcal{U}$. On $\mathcal{U}$, we fully observe $\boldsymbol{X}$ and additionally a surrogate $S$ which, while not being strongly predictive of $Y$ throughout the entirety of its support, can forecast it with high accuracy when it assumes extreme values. Such datasets arise naturally in modern studies involving large databases such as electronic medical records (EMR) where $Y$, unlike $(\boldsymbol{X}, S)$, is difficult and/or expensive to obtain. Read More

Consider a multiple testing setup where we observe mutually independent pairs $((P_i, X_i))_{1\leq i \leq m}$ of p-values $P_i$ and covariates $X_i$, such that $P_i \perp X_i$ under the null hypothesis. Our goal is to use the information potentially available in the covariates to increase power compared to conventional procedures that only use the $P_i$, while controlling the false discovery rate (FDR). To this end, we recently introduced independent hypothesis weighting (IHW), a weighted Benjamini-Hochberg method, in which the weights are chosen as a function of the covariate $X_i$ in a data-driven manner. Read More

Affiliations: 1Harvard University, 2Harvard University, 3University of Bristol, 4Université Paris-Dauphine PSL and University of Warwick

In purely generative models, one can simulate data given parameters but not necessarily evaluate the likelihood. We use Wasserstein distances between empirical distributions of observed data and empirical distributions of synthetic data drawn from such models to estimate their parameters. Previous interest in the Wasserstein distance for statistical inference has been mainly theoretical, due to computational limitations. Read More

The propensity score is a common tool for estimating the causal effect of a binary treatment in observational data. In this setting, matching, subclassification, imputation, or inverse probability weighting on the propensity score can reduce the initial covariate bias between the treatment and control groups. With more than two treatment options, however, estimation of causal effects requires additional assumptions and techniques, the implementations of which have varied across disciplines. Read More

We develop a constructive approach to estimating sparse, high-dimensional linear regression models. The approach is a computational algorithm motivated from the KKT conditions for the $\ell_0$-penalized least squares solutions. It generates a sequence of solutions iteratively, based on support detection using primal and dual information and root finding. Read More

We consider the linear regression problem under semi-supervised settings wherein the available data typically consists of: (i) a small or moderate sized 'labeled' data, and (ii) a much larger sized 'unlabeled' data. Such data arises naturally from settings where the outcome, unlike the covariates, is expensive to obtain, a frequent scenario in modern studies involving large databases like electronic medical records (EMR). Supervised estimators like the ordinary least squares (OLS) estimator utilize only the labeled data. Read More

The Whittle likelihood is widely used for Bayesian nonparametric estimation of the spectral density of stationary time series. However, the loss of efficiency for non-Gaussian time series can be substantial. On the other hand, parametric methods are more powerful if the model is well-specified, but may fail entirely otherwise. Read More

In the spirit of recent asymptotic works on the General Poverty Index (GPI) in the field of Welfare Analysis, the asymptotic representation of the non-decomposable Takayama's index, which has failed to be incorporated in the unified GPI approach, is addressed and established here. This representation allows also to extend to it, recent results of statistical decomposability gaps estimations. The theoretical results are applied to real databases. Read More

Recent advances on overfitting Bayesian mixture models provide a solid and straightforward approach for inferring the underlying number of clusters and model parameters in heterogeneous data. In this study we demonstrate the applicability of such a framework in clustering multivariate continuous data with possibly complex covariance structure. For this purpose an overfitting mixture model of factor analyzers is introduced, assuming that the number of factors is known. Read More

The polygonal distributions are a class of distributions that can be defined via the mixture of triangular distributions over the unit interval. The class includes the uniform and trapezoidal distributions, and is an alternative to the beta distribution. We demonstrate that the polygonal densities are dense in the class of continuous and concave densities with bounded second derivatives. Read More

1. Analog forecasting has been successful at producing robust forecasts for a variety of ecological and physical processes. Analog forecasting is a mechanism-free nonlinear method that forecasts a system forward in time by examining how past states deemed similar to the current state moved forward. Read More

Employing nonparametric methods for density estimation has become routine in Bayesian statistical practice. Models based on discrete nonparametric priors such as Dirichlet Process Mixture (DPM) models are very attractive choices due to their flexibility and tractability. However, a common problem in fitting DPMs or other discrete models to data is that they tend to produce a large number of (sometimes) redundant clusters. Read More

We call change-point problem (CPP) the identification of changes in the probabilistic behavior of a sequence of observations. Solving the CPP involves detecting the number and position of such changes. In genetics the study of how and what characteristics of a individual's genetic content might contribute to the occurrence and evolution of cancer has fundamental importance in the diagnosis and treatment of such diseases and can be formulated in the framework of chage-point analysis. Read More

Several genetic alterations are involved in the genesis and development of cancers. The determination of whether and how each genetic alterations contributes to cancer development is fundamental for a complete understanding of the human cancer etiology. Loss of heterozygosity (LOH) is one of such genetic phenomenon linked to a variate of diseases and characterized by the change from heterozygosity (the presence of both alleles of a gene) to to homozygosity (presence of only one type of allele) in a particular DNA locus. Read More

For a given target density, there exist an infinite number of diffusion processes which are ergodic with respect to this density. As observed in a number of papers, samplers based on nonreversible diffusion processes can significantly outperform their reversible counterparts both in terms of asymptotic variance and rate of convergence to equilibrium. In this paper, we take advantage of this in order to construct efficient sampling algorithms based on the Lie-Trotter decomposition of a nonreversible diffusion process into reversible and nonreversible components. Read More

Piecewise deterministic Monte Carlo methods (PDMC) consist of a class of continuous-time Markov chain Monte Carlo methods (MCMC) which have recently been shown to hold considerable promise. Being non-reversible, the mixing properties of PDMC methods often significantly outperform classical reversible MCMC competitors. Moreover, in a Bayesian context they can use sub-sampling ideas, so that they need only access one data point per iteration, whilst still maintaining the true posterior distribution as their invariant distribution. Read More

The two-level normal hierarchical model (NHM) has played a critical role in the theory of small area estimation (SAE), one of the growing areas in statistics with numerous applications in different disciplines. In this paper, we address major well-known shortcomings associated with the empirical best linear unbiased prediction (EBLUP) of a small area mean and its mean squared error (MSE) estimation by considering an appropriate model variance estimator that satisfies multiple properties. The proposed model variance estimator simultaneously (i) improves on the estimation of the related shrinkage factors, (ii) protects EBLUP from the common overshrinkage problem, (iii) avoids complex bias correction in generating strictly positive second-order unbiased mean square error (MSE) estimator either by the Taylor series or single parametric bootstrap method. Read More

Maximum pseudolikelihood (MPL) estimators are useful alternatives to maximum likelihood (ML) estimators when likelihood functions are more difficult to manipulate than their marginal and conditional components. Furthermore, MPL estimators subsume a large number of estimation techniques including ML estimators, maximum composite marginal likelihood estimators, and maximum pairwise likelihood estimators. When considering only the estimation of discrete models (on a possibly countable infinite support), we show that a simple finiteness assumption on an entropy-based measure is sufficient for assessing the consistency of the MPL estimator. Read More

In causal inference confounding may be controlled either through regression adjustment in an outcome model, or through propensity score adjustment or inverse probability of treatment weighting, or both. The latter approaches, which are based on modelling of the treatment assignment mechanism and their doubly robust extensions have been difficult to motivate using formal Bayesian arguments, in principle, for likelihood-based inferences, the treatment assignment model can play no part in inferences concerning the expected outcomes if the models are assumed to be correctly specified. On the other hand, forcing dependency between the outcome and treatment assignment models by allowing the former to be misspecified results in loss of the balancing property of the propensity scores and the loss of any double robustness. Read More

This paper develops meshless methods for probabilistically describing discretisation error in the numerical solution of partial differential equations. This construction enables the solution of Bayesian inverse problems while accounting for the impact of the discretisation of the forward problem. In particular, this drives statistical inferences to be more conservative in the presence of significant solver error. Read More

The famous Hiemstra-Jones (HJ) test developed by Hiemstra and Jones (1994) plays a significant role in studying nonlinear causality. Over the last two decades, there have been numerous applications and theoretical extensions based on this pioneering work. However, several works note that counterintuitive results are obtained from the HJ test, and some researchers find that the HJ test is seriously over-rejecting in simulation studies. Read More

Insurers are faced with the challenge of estimating the future reserves needed to handle historic and outstanding claims that are not fully settled. A well-known and widely used technique is the chain-ladder method, which is a deterministic algorithm. To include a stochastic component one may apply generalized linear models to the run-off triangles based on past claims data. Read More

In the current paper, the estimation of the probability density function and the cumulative distribution function of the Topp-Leone distribution is considered. We derive the following estimators: maximum likelihood estimator, uniformly minimum variance unbiased estimator, percentile estimator, least squares estimator and weighted least squares estimator. A simulation study shows that the maximum likelihood estimator is more efficient than the others estimators. Read More

We consider an additive partially linear framework for modelling massive heterogeneous data. The major goal is to extract multiple common features simultaneously across all sub-populations while exploring heterogeneity of each sub-population. This work generalizes the partially linear framework proposed in Zhao et al. Read More

In this paper we develop a bivariate discrete generalized exponential distribution, whose marginals are discrete generalized exponential distribution as proposed by Nekoukhou, Alamatsaz and Bidram ("Discrete generalized exponential distribution of a second type", Statistics, 47, 876 - 887, 2013). It is observed that the proposed bivariate distribution is a very flexible distribution and the bivariate geometric distribution can be obtained as a special case of this distribution. The proposed distribution can be seen as a natural discrete analogue of the bivariate generalized exponential distribution proposed by Kundu and Gupta ("Bivariate generalized exponential distribution", Journal of Multivariate Analysis, 100, 581 - 593, 2009). Read More

The focus in this paper is Bayesian system identification based on noisy incomplete modal data where we can impose spatially-sparse stiffness changes when updating a structural model. To this end, based on a similar hierarchical sparse Bayesian learning model from our previous work, we propose two Gibbs sampling algorithms. The algorithms differ in their strategies to deal with the posterior uncertainty of the equation-error precision parameter, but both sample from the conditional posterior probability density functions (PDFs) for the structural stiffness parameters and system modal parameters. Read More

The Cox process is a stochastic process which generalises the Poisson process by letting the underlying intensity function itself be a stochastic process. Much work has focused on the Log-Gaussian Cox process, where the logarithm of the intensity is a Gaussian process. In this paper we propose the Gamma Gaussian Cox Process, under which the square root of the intensity is a Gaussian process. Read More

The presented methodology for single imputation of missing values borrows the idea from data depth --- a measure of centrality defined for an arbitrary point of the space with respect to a probability distribution or a data cloud. This consists in iterative maximization of the depth of each observation with missing values, and can be employed with any properly defined statistical depth function. On each single iteration, imputation is narrowed down to optimization of quadratic, linear, or quasiconcave function being solved analytically, by linear programming, or the Nelder-Mead method, respectively. Read More

Maximum entropy modeling is a flexible and popular framework for formulating statistical models given partial knowledge. In this paper, rather than the traditional method of optimizing over the continuous density directly, we learn a smooth and invertible transformation that maps a simple distribution to the desired maximum entropy distribution. Doing so is nontrivial in that the objective being maximized (entropy) is a function of the density itself. Read More

In this paper, we develop a family of bivariate beta distributions that encapsulate both positive and negative correlations, and which can be of general interest for Bayesian inference. We then invoke a use of these bivariate distributions in two contexts. The first is diagnostic testing in medicine, threat detection, and signal processing. Read More

In nonparametric estimation of the autocovariance matrices or the spectral density matrix of a second-order stationary multivariate time series, it is important to preserve positive-definiteness of the estimator. This in order to ensure interpretability of the estimator as an estimated covariance or spectral matrix, but also to avoid computational issues in e.g. Read More

Variational approximation methods have proven to be useful for scaling Bayesian computations to large data sets and highly parametrized models. Applying variational methods involves solving an optimization problem, and recent research in this area has focused on stochastic gradient ascent methods as a general approach to implementation. Here variational approximation is considered for a posterior distribution in high dimensions using a Gaussian approximating family. Read More

Estimating treatment effects for subgroups defined by post-treatment behavior (i.e., estimating causal effects in a principal stratification framework) can be technically challenging and heavily reliant on strong assumptions. Read More

There is considerable interest in studying how the distribution of an outcome varies with a predictor. We are motivated by environmental applications in which the predictor is the dose of an exposure and the response is a health outcome. A fundamental focus in these studies is inference on dose levels associated with a particular increase in risk relative to a baseline. Read More

Models for fitting spatio-temporal point processes should incorporate spatio-temporal inhomogeneity and allow for different types of interaction between points (clustering or regularity). This paper proposes an extension of the spatial multi-scale area-interaction model to a spatio-temporal framework. This model allows for interaction between points at different spatio-temporal scales and the inclusion of covariates. Read More

We propose a novel methodology for feature screening in clustering massive datasets, in which both the number of features and the number of observations can potentially be very large. Taking advantage of a fusion penalization based convex clustering criterion, we propose a very fast screening procedure that efficiently discards non-informative features by first computing a clustering score corresponding to the clustering tree constructed for each feature, and then thresholding the resulting values. We provide theoretical support for our approach by establishing uniform non-asymptotic bounds on the clustering scores of the "noise" features. Read More

We introduce designs for order-of-addition (OofA) experiments, those that study the order in which m components are applied. Full designs require m! runs, so we investigate design fractions. Balance criteria for creating such designs employ an extension of orthogonal arrays (OA's) to OofA-OA's. Read More