Statistics - Methodology Publications (50)


Statistics - Methodology Publications

Dynamic treatment regimes (DTRs) aim to formalize personalized medicine by tailoring treatment decisions to individual patient characteristics. G-estimation for DTR identification targets the parameters of a structural nested mean model known as the blip function from which the optimal DTR is derived. Despite considerable work deriving such estimation methods, there has been little focus on extending G-estimation to the case of non-additive effects, non-continuous outcomes or on model selection. Read More

Missing data are a common problem for both the construction and implementation of a prediction algorithm. Pattern mixture kernel submodels (PMKS) - a series of submodels for every missing data pattern that are fit using only data from that pattern - are a computationally efficient remedy for both stages. Here we show that PMKS yield the most predictive algorithm among all standard missing data strategies. Read More

The areas of model selection and model evaluation for predictive modeling have received extensive treatment in the statistics literature, leading to both theoretical advances and practical methods based on covariance penalties and other approaches. However, the majority of this work, and especially the practical approaches, are based on the "Fixed-X assumption", where covariate values are assumed to be non-random and known. By contrast, in most modern predictive modeling applications, it is more reasonable to take the "Random-X" view, where future prediction points are random and new. Read More

There is a growing demand for nonparametric conditional density estimators (CDEs) in fields such as astronomy and economics. In astronomy, for example, one can dramatically improve estimates of the parameters that dictate the evolution of the Universe by working with full conditional densities instead of regression (i.e. Read More

This note proposes a consistent bootstrap-based distributional approximation for cube root consistent estimators such as the maximum score estimator of Manski (1975) and the isotonic density estimator of Grenander (1956). In both cases, the standard nonparametric bootstrap is known to be inconsistent. Our method restores consistency of the nonparametric bootstrap by altering the shape of the criterion function defining the estimator whose distribution we seek to approximate. Read More

Hypothesis testing in the linear regression model is a fundamental statistical problem. We consider linear regression in the high-dimensional regime where the number of parameters exceeds the number of samples ($p> n$) and assume that the high-dimensional parameters vector is $s_0$ sparse. We develop a general and flexible $\ell_\infty$ projection statistic for hypothesis testing in this model. Read More

The missing phase problem in X-ray crystallography is commonly solved using the technique of molecular replacement, which borrows phases from a previously solved homologous structure, and appends them to the measured Fourier magnitudes of the diffraction patterns of the unknown structure. More recently, molecular replacement has been proposed for solving the missing orthogonal matrices problem arising in Kam's autocorrelation analysis for single particle reconstruction using X-ray free electron lasers and cryo-EM. In classical molecular replacement, it is common to estimate the magnitudes of the unknown structure as twice the measured magnitudes minus the magnitudes of the homologous structure, a procedure known as `twicing'. Read More

This paper develops a new family of estimators, MDPDEs, as a robust generalization of maximum likelihood estimator for the polytomous logistic regression model (PLRM) by using the DPD measure. Based on these estimators, the family of Wald-type test statistics for linear hypotheses is introduced and their robust properties are theoretically studied through the classical influence function analysis. Some numerical examples are presented to justify the requirement of a suitable robust statistical procedure of estimation in place of the MLE. Read More

This paper develops a new family of estimators, the minimum density power divergence estimators (MDPDEs), for the parameters of the one-shot device model as well as a new family of test statistics, Z-type test statistics based on MDPDEs, for testing the corresponding model parameters. The family of MDPDEs contains as a particular case the maximum likelihood estimator (MLE) considered in Balakrishnan and Ling (2012). Through a simulation study, it is shown that some MDPDEs have a better behavior than the MLE in relation to robustness. Read More

Hierarchical models for regionally aggregated disease incidence data commonly involve region specific latent random effects which are modelled jointly as having a multivariate Gaussian distribution. The covariance or precision matrix incorporates the spatial dependence between the regions. Common choices for the precision matrix include the widely used intrinsic conditional autoregressive model which is singular, and its nonsingular extension which lacks interpretability. Read More

Randomized controlled trials (RCTs) provide strong internal validity compared with observational studies. However, selection bias threatens the external validity of randomized trials. Thus, RCT results may not apply to either broad public policy populations or narrow populations, such as specific insurance pools. Read More

In many real problems, dependence structures more general than exchangeability are required. For instance, in some settings partial exchangeability is a more reasonable assumption. For this reason, vectors of dependent Bayesian nonparametric priors have recently gained popularity. Read More

In this paper, we introduce a wideband dictionary framework for estimating sparse signals. By formulating integrated dictionary elements spanning bands of the considered parameter space, one may efficiently find and discard large parts of the parameter space not active in the signal. After each iteration, the zero-valued parts of the dictionary may be discarded to allow a refined dictionary to be formed around the active elements, resulting in a zoomed dictionary to be used in the following iterations. Read More

In observational studies, estimation of a causal effect of a treatment on an outcome relies on proper adjustment for confounding. If the number of the potential confounders ($p$) is larger than the number of observations ($n$), then direct control for all these potential confounders is infeasible. Existing approaches for dimension reduction and penalization are for the most part aimed at predicting the outcome, and are not suited for estimation of causal effects. Read More

Advances in mobile computing technologies have made it possible to monitor and apply data-driven interventions across complex systems in real time. Markov decision processes (MDPs) are the primary model for sequential decision problems with a large or indefinite time horizon. Choosing a representation of the underlying decision process that is both Markov and low-dimensional is non-trivial. Read More

In practice, data often contain discrete variables. But most of the popular nonparametric estimation methods have been developed in a purely continuous framework. A common trick among practitioners is to make discrete variables continuous by adding a small amount of noise. Read More

We develop and evaluate tolerance interval methods for dynamic treatment regimes (DTRs) that can provide more detailed prognostic information to patients who will follow an estimated optimal regime. Although the problem of constructing confidence intervals for DTRs has been extensively studied, prediction and tolerance intervals have received little attention. We begin by reviewing in detail different interval estimation and prediction methods and then adapting them to the DTR setting. Read More

This article reviews the application of advanced Monte Carlo techniques in the context of Multilevel Monte Carlo (MLMC). MLMC is a strategy employed to compute expectations which can be biased in some sense, for instance, by using the discretization of a associated probability law. The MLMC approach works with a hierarchy of biased approximations which become progressively more accurate and more expensive. Read More

The multivariate linear regression model is an important tool for investigating relationships between several response variables and several predictor variables. The primary interest is in inference about the unknown regression coefficient matrix. We propose multivariate bootstrap techniques as a means for making inferences about the unknown regression coefficient matrix. Read More

In the probabilistic topic models, the quantity of interest---a low-rank matrix consisting of topic vectors---is hidden in the text corpus matrix, masked by noise, and Singular Value Decomposition (SVD) is a potentially useful tool for learning such a low-rank matrix. However, the connection between this low-rank matrix and the singular vectors of the text corpus matrix are usually complicated and hard to spell out, so how to use SVD for learning topic models faces challenges. We overcome the challenge by revealing a surprising insight: there is a low-dimensional $\textit{simplex}$ structure which can be viewed as a bridge between the low-rank matrix of interest and the SVD of the text corpus matrix, and which allows us to conveniently reconstruct the former using the latter. Read More

Current statistical inference problems in areas like astronomy, genomics, and marketing routinely involve the simultaneous testing of thousands -- even millions -- of null hypotheses. For high-dimensional multivariate distributions, these hypotheses may concern a wide range of parameters, with complex and unknown dependence structures among variables. In analyzing such hypothesis testing procedures, gains in efficiency and power can be achieved by performing variable reduction on the set of hypotheses prior to testing. Read More

The ensemble Kalman filter (EnKF) is a computational technique for approximate inference on the state vector in spatio-temporal state-space models. It has been successfully used in many real-world nonlinear data-assimilation problems with very high dimensions, such as weather forecasting. However, the EnKF is most appropriate for additive Gaussian state-space models with linear observation equation and without unknown parameters. Read More

Variable clustering is one of the most important unsupervised learning methods, ubiquitous in most research areas. In the statistics and computer science literature, most of the clustering methods lead to non-overlapping partitions of the variables. However, in many applications, some variables may belong to multiple groups, yielding clusters with overlap. Read More

We study the problem of testing for structure in networks using relations between the observed frequencies of small subgraphs. We consider the statistics \begin{align*} T_3 & =(\text{edge frequency})^3 - \text{triangle frequency}\\ T_2 & =3(\text{edge frequency})^2(1-\text{edge frequency}) - \text{V-shape frequency} \end{align*} and prove a central limit theorem for $(T_2, T_3)$ under an Erd\H{o}s-R\'{e}nyi null model. We then analyze the power of the associated $\chi^2$ test statistic under a general class of alternative models. Read More

A new recalibration post-processing method is presented to improve the quality of the posterior approximation when using Approximate Bayesian Computation (ABC) algorithms. Recalibration may be used in conjunction with existing post-processing methods, such as regression-adjustments. In addition, this work extends and strengthens the links between ABC and indirect inference algorithms, allowing more extensive use of misspecified auxiliary models in the ABC context. Read More

We provide compact algebraic expressions that replace the lengthy symbolic-algebra-generated integrals I6 and I8 in Part I of this series of papers [1]. The MRSE entries of Part I, Table 4.3 are thus updated to simpler algebraic expressions. Read More

Distributional approximations of (bi--) linear functions of sample variance-covariance matrices play a critical role to analyze vector time series, as they are needed for various purposes, especially to draw inference on the dependence structure in terms of second moments and to analyze projections onto lower dimensional spaces as those generated by principal components. This particularly applies to the high-dimensional case, where the dimension $d$ is allowed to grow with the sample size $n$ and may even be larger than $n$. We establish large-sample approximations for such bilinear forms related to the sample variance-covariance matrix of a high-dimensional vector time series in terms of strong approximations by Brownian motions. Read More

Objectives Motivated by two case studies using primary care records from the Clinical Practice Research Datalink, we describe statistical methods that facilitate the analysis of tall data, with very large numbers of observations. Our focus is on investigating the association between patient characteristics and an outcome of interest, while allowing for variation among general practices. Study design and setting We fit mixed effects models to outcome data, including predictors of interest and confounding factors as covariates, and including random intercepts to allow for heterogeneity in outcome among practices. Read More

Relative error approaches are more of concern compared to absolute error ones such as the least square and least absolute deviation, when it needs scale invariant of output variable, for example with analyzing stock and survival data. An h-relative error estimation method via the h-likelihood is developed to avoid heavy and intractable integration for a multiplicative regression model with random effect. Statistical properties of the parameters and random effect in the model are studied. Read More

We propose the misclassified Ising Model; a framework for analyzing dependent binary data where the binary state is susceptible to error. We extend the theoretical results of the model selection method presented in Ravikumar et. al. Read More

This article proposes a mixture modeling approach to estimating cluster-wise conditional distributions in clustered (grouped) data. We adapt the mixture-of-experts model to the latent distributions, and propose a model in which each cluster-wise density is represented as a mixture of latent experts with cluster-wise mixing proportions distributed as Dirichlet distribution. The model parameters are estimated by maximizing the marginal likelihood function using a newly developed Monte Carlo Expectation-Maximization algorithm. Read More

Pre-treatment selection or censoring (`selection on treatment') can occur when two treatment levels are compared ignoring the third option of neither treatment, in `censoring by death' settings where treatment is only defined for those who survive long enough to receive it, or in general in studies where the treatment is only defined for a subset of the population. Unfortunately, the standard instrumental variable (IV) estimand is not defined in the presence of such selection, so we consider estimating a new survivor-complier causal effect. Although this effect is not identified under standard IV assumptions, it is possible to construct sharp bounds. Read More

This article introduces new methods for inference with count data registered on a set of aggregation units. Such data are omnipresent in epidemiology due to confidentiality issues: it is much more common to know the county in which an individual resides, say, than know their exact location in space. Inference for aggregated data has traditionally made use of models for discrete spatial variation, for example conditional autoregressive models (CAR). Read More

The challenge of taking many variables into account in optimization problems may be overcome under the hypothesis of low effective dimensionality. Then, the search of solutions can be reduced to the random embedding of a low dimensional space into the original one, resulting in a more manageable optimization problem. Specifically, in the case of time consuming black-box functions and when the budget of evaluations is severely limited, global optimization with random embeddings appears as a sound alternative to random search. Read More

We search for the signature of universal properties of extreme events, theoretically predicted for Axiom A flows, in a chaotic and high dimensional dynamical system by studying the convergence of GEV (Generalized Extreme Value) and GP (Generalized Pareto) shape parameter estimates to a theoretical value, expressed in terms of partial dimensions of the attractor, which are global properties. We consider a two layer quasi-geostrophic (QG) atmospheric model using two forcing levels, and analyse extremes of different types of physical observables (local, zonally-averaged energy, and the average value of energy over the mid-latitudes). Regarding the predicted universality, we find closer agreement in the shape parameter estimates only in the case of strong forcing, producing a highly chaotic behaviour, for some observables (the local energy at every latitude). Read More

This paper addresses maximum likelihood (ML) estimation based model fitting in the context of extrasolar planet detection. This problem is featured by the following properties: 1) the candidate models under consideration are highly nonlinear; 2) the likelihood surface has a huge number of peaks; 3) the parameter space ranges in size from a few to dozens of dimensions. These properties make the ML search a very challenging problem, as it lacks any analytical or gradient based searching solution to explore the parameter space. Read More

In this paper, we propose a new method for estimation and constructing confidence intervals for low-dimensional components in a high-dimensional model. The proposed estimator, called Constrained Lasso (CLasso) estimator, is obtained by simultaneously solving two estimating equations---one imposing a zero-bias constraint for the low-dimensional parameter and the other forming an $\ell_1$-penalized procedure for the high-dimensional nuisance parameter. By carefully choosing the zero-bias constraint, the resulting estimator of the low dimensional parameter is shown to admit an asymptotically normal limit attaining the Cram\'{e}r-Rao lower bound in a semiparametric sense. Read More

In this paper, we apply shrinkage strategies to estimate regression coefficients efficiently for the high-dimensional multiple regression model, where the number of samples is smaller than the number of predictors. We assume in the sparse linear model some of the predictors have very weak influence on the response of interest. We propose to shrink estimators more than usual. Read More

We propose a class of dimension reduction methods for right censored survival data that exhibits significant numerical performances over existing approaches. The underlying framework of the proposed methods is based on a counting process representation of the failure process. Semiparametric estimating equations are built to estimate the dimension reduction space for the failure time model. Read More

In this work, we deal with a bivariate time series of wind speed and direction. Our observed data have peculiar features, such as informative missing values, non-reliable measures under a specific condition and interval-censored data, that we take into account in the model specification. We analyze the time series with a non-parametric Bayesian hidden Markov model, introducing a new emission distribution based on the invariant wrapped Poisson, the Poisson and the hurdle density, suitable to model our data. Read More

We consider modeling of angular or directional data viewed as a linear variable wrapped onto a unit circle. In particular, we focus on the spatio-temporal context, motivated by a collection of wave directions obtained as computer model output developed dynamically over a collection of spatial locations. We propose a novel wrapped skew Gaussian process which enriches the class of wrapped Gaussian process. Read More

Circular data arise in many areas of application. Recently, there has been interest in looking at circular data collected separately over time and over space. Here, we extend some of this work to the spatio-temporal setting, introducing space-time dependence. Read More

In this work we define log-linear models to compare several square contingency tables under the quasi-independence or the quasi-symmetry model, and the relevant Markov bases are theoretically characterized. Through Markov bases, an exact test to evaluate if two or more tables fit a common model is introduced. Two real-data examples illustrate the use of these models in different fields of applications. Read More

Regression discontinuity designs (RDDs) are natural experiments where treatment assignment is determined by a covariate value (or "running variable") being above or below a predetermined threshold. Because the treatment effect will be confounded by the running variable, RDD analyses focus on the local average treatment effect (LATE) at the threshold. The most popular methodology for estimating the LATE in an RDD is local linear regression (LLR), which is a weighted linear regression that places larger weight on units closer to the threshold. Read More

There has been great interest recently in applying nonparametric kernel mixtures in a hierarchical manner to model multiple related data samples jointly. In such settings several data features are commonly present: (i) the related samples often share some, if not all, of the mixture components but with differing weights, (ii) only some, not all, of the mixture components vary across the samples, and (iii) often the shared mixture components across samples are not aligned perfectly in terms of their location and spread, but rather display small misalignments either due to systematic cross-sample difference or more often due to uncontrolled, extraneous causes. Properly incorporating these features in mixture modeling will enhance the efficiency of inference, whereas ignoring them not only reduces efficiency but can jeopardize the validity of the inference due to issues such as confounding. Read More

We propose a framework to shrink a user-specified characteristic of a precision matrix estimator that is needed to fit a predictive model. Estimators in our framework minimize the Gaussian negative log-likelihood plus an $L_1$ penalty on a linear or affine function evaluated at the optimization variable corresponding to the precision matrix. We establish convergence rate bounds for these estimators and we propose an alternating direction method of multipliers algorithm for their computation. Read More

We extend some equatilites in the Owen's table of normal integrals (A table of normal integrals in Communication in Statistics-Simulation and Computation, 1980). Furthermore a new probabilistic model for a vector of binary random variables is proposed. Read More

Monte Carlo (MC) sampling methods are widely applied in Bayesian inference, system simulation and optimization problems. The Markov Chain Monte Carlo (MCMC) algorithms are a well-known class of MC methods which generate a Markov chain with the desired invariant distribution. In this document, we focus on the Metropolis-Hastings (MH) sampler, which can be considered as the atom of the MCMC techniques, introducing the basic notions and different properties. Read More

In this paper we provide the asymptotic theory of the general of $\phi$-divergences measures, which includes the most common divergence measures : Renyi and Tsallis families and the Kullback-Leibler measure. Instead of using the Parzen nonparametric estimators of the probability density functions whose discrepancy is estimated, we use the wavelets approach and the geometry of Besov spaces. One-sided and two-sided statistical tests are derived as well as symmetrized estimators. Read More

We derive the sample size formulae for comparing two negative binomial rates based on both the relative and absolute rate difference metrics in noninferiority and equivalence trials with unequal follow-up times, and establish an approximate relationship between the sample sizes required for the treatment comparison based on the two treatment effect metrics. The proposed method allows the dispersion parameter to vary by treatment groups. The accuracy of these methods is assessed by simulations. Read More