Statistics - Theory Publications (50)


Statistics - Theory Publications

This paper addresses the problem of estimating, in the presence of random censoring as well as competing risks, the extreme value index of the (sub)-distribution function associated to one particular cause, in the heavy-tail case. Asymptotic normality of the proposed estimator (which has the form of an Aalen-Johansen integral, and is the first estimator proposed in this context) is established. A small simulation study exhibits its performances for finite samples. Read More

In this contribution we are concerned with the asymptotic behaviour as $u\to \infty$ of $\mathbb{P}\{\sup_{t\in [0,T]} X_u(t)> u\}$, where $X_u(t),t\in [0,T],u>0$ is a family of centered Gaussian processes with continuous trajectories. A key application of our findings concerns $\mathbb{P}\{\sup_{t\in [0,T]} (X(t)+ g(t))> u\}$ as $u\to\infty$, for $X$ a centered Gaussian process and $g$ some measurable trend function. Further applications include the approximation of both the ruin time and the ruin probability of the Brownian motion risk model with constant force of interest. Read More

In this article we present a Bernstein inequality for sums of random variables which are $\beta$-mixing. The inequality can be used to derive concentration inequalities. It can be useful to obtain consistency properties for nonparametric estimators of conditional expectation functions. Read More

Fitting linear regression models can be computationally very expensive in large-scale data analysis tasks if the sample size and the number of variables are very large. Random projections are extensively used as a dimension reduction tool in machine learning and statistics. We discuss the applications of random projections in linear regression problems, developed to decrease computational costs, and give an overview of the theoretical guarantees of the generalization error. Read More

We consider the recovery of regression coefficients, denoted by $\boldsymbol{\beta}_0$, for a single index model (SIM) relating a binary outcome $Y$ to a set of possibly high dimensional covariates $\boldsymbol{X}$, based on a large but 'unlabeled' dataset $\mathcal{U}$. On $\mathcal{U}$, we fully observe $\boldsymbol{X}$ and additionally a surrogate $S$ which, while not being strongly predictive of $Y$ throughout the entirety of its support, can forecast it with high accuracy when it assumes extreme values. Such datasets arise naturally in modern studies involving large databases such as electronic medical records (EMR) where $Y$, unlike $(\boldsymbol{X}, S)$, is difficult and/or expensive to obtain. Read More

Affiliations: 1Harvard University, 2Harvard University, 3University of Bristol, 4Université Paris-Dauphine PSL and University of Warwick

In purely generative models, one can simulate data given parameters but not necessarily evaluate the likelihood. We use Wasserstein distances between empirical distributions of observed data and empirical distributions of synthetic data drawn from such models to estimate their parameters. Previous interest in the Wasserstein distance for statistical inference has been mainly theoretical, due to computational limitations. Read More

Conditions for geometric ergodicity of multivariate ARCH processes, with the so-called BEKK parametrization, are considered. We show for a class of BEKK-ARCH processes that the invariant distribution is regularly varying. In order to account for the possibility of different tail indices of the marginals, we consider the notion of vector scaling regular variation, in the spirit of Perfekt (1997). Read More

Classification rules can be severely affected by the presence of disturbing observations in the training sample. Looking for an optimal classifier with such data may lead to unnecessarily complex rules. So, simpler effective classification rules could be achieved if we relax the goal of fitting a good rule for the whole training sample but only consider a fraction of the data. Read More

We study the maximum likelihood estimator of density of $n$ independent observations, under the assumption that it is well approximated by a mixture with a large number of components. The main focus is on statistical properties with respect to the Kullback-Leibler loss. We establish risk bounds taking the form of sharp oracle inequalities both in deviation and in expectation. Read More

This paper discusses the properties of certain risk estimators recently proposed to choose regularization parameters in ill-posed problems. A simple approach is Stein's unbiased risk estimator (SURE), which estimates the risk in the data space, while a recent modification (GSURE) estimates the risk in the space of the unknown variable. It seems intuitive that the latter is more appropriate for ill-posed problems, since the properties in the data space do not tell much about the quality of the reconstruction. Read More

The probability density quantile (pdQ) carries essential information regarding shape and tail behavior of a location-scale family. The Kullback-Leibler divergences from uniformity of these pdQs are found and interpreted and convergence of the pdQ mapping to the uniform distribution is investigated. Read More

We consider the linear regression problem under semi-supervised settings wherein the available data typically consists of: (i) a small or moderate sized 'labeled' data, and (ii) a much larger sized 'unlabeled' data. Such data arises naturally from settings where the outcome, unlike the covariates, is expensive to obtain, a frequent scenario in modern studies involving large databases like electronic medical records (EMR). Supervised estimators like the ordinary least squares (OLS) estimator utilize only the labeled data. Read More

A new three-parameter cumulative distribution function defined on $(\alpha,\infty)$, for some $\alpha\geq0$, with asymmetric probability density function and showing exponential decays at its both tails, is introduced. The new distribution is near to familiar distributions like the gamma and log-normal distributions, but this new one shows own elements and thus does not generalize neither of these distributions. Hence, the new distribution constitutes a new alternative to fit values showing light-tailed behaviors. Read More

In many biological, agricultural, military activity problems and in some quality control problems, it is almost impossible to have a fixed sample size, because some observations are always lost for various reasons. Therefore, the sample size itself is considered frequently to be a random variable (rv). The class of limit distribution functions (df's) of the random bivariate extreme generalized order statistics (GOS) from independent and identically distributed RV's are fully characterized. Read More

We propose to estimate a metamodel and the sensitivity indices of a complex model m in the Gaussian regression framework. Our approach combines methods for sensitivity analysis of complex models and statistical tools for sparse non-parametric estimation in multivariate Gaussian regression model. It rests on the construction of a metamodel for aproximating the Hoeffding-Sobol decomposition of m. Read More

In the present paper we propose and study estimators for a wide class of bivariate measures of concordance for copulas. These measures of concordance are generated by a copula and generalize Spearman's rho and Gini's gamma. In the case of Spearman's rho and Gini's gamma the estimators turn out to be the usual sample versions of these measures of concordance. Read More

We consider a sparse linear regression model Y=X\beta^{*}+W where X has a Gaussian entries, W is the noise vector with mean zero Gaussian entries, and \beta^{*} is a binary vector with support size (sparsity) k. Using a novel conditional second moment method we obtain a tight up to a multiplicative constant approximation of the optimal squared error \min_{\beta}\|Y-X\beta\|_{2}, where the minimization is over all k-sparse binary vectors \beta. The approximation reveals interesting structural properties of the underlying regression problem. Read More

Based on the convex least-squares estimator, we propose two different procedures for testing convexity of a probability mass function supported on N with an unknown finite support. The procedures are shown to be asymptotically calibrated. Read More

While scale invariance is commonly observed in each component of real world multivariate signals, it is also often the case that the inter-component correlation structure is not fractally connected, i.e., its scaling behavior is not determined by that of the individual components. Read More

In this article we present a Bernstein inequality for sums of random variables which are defined on a graphical network whose nodes grow at an exponential rate. The inequality can be used to derive concentration inequalities in highly-connected networks. It can be useful to obtain consistency properties for nonparametric estimators of conditional expectation functions which are derived from such networks. Read More

Drawing causal inference with observational studies is the central pillar of many disciplines. One sufficient condition for identifying the causal effect is that the treatment-outcome relationship is unconfounded conditional on the observed covariates. It is often believed that the more covariates we condition on, the more plausible this unconfoundedness assumption is. Read More

The two-level normal hierarchical model (NHM) has played a critical role in the theory of small area estimation (SAE), one of the growing areas in statistics with numerous applications in different disciplines. In this paper, we address major well-known shortcomings associated with the empirical best linear unbiased prediction (EBLUP) of a small area mean and its mean squared error (MSE) estimation by considering an appropriate model variance estimator that satisfies multiple properties. The proposed model variance estimator simultaneously (i) improves on the estimation of the related shrinkage factors, (ii) protects EBLUP from the common overshrinkage problem, (iii) avoids complex bias correction in generating strictly positive second-order unbiased mean square error (MSE) estimator either by the Taylor series or single parametric bootstrap method. Read More

A regularized risk minimization procedure for regression function estimation is introduced that achieves near optimal accuracy and confidence under general conditions, including heavy-tailed predictor and response variables. The procedure is based on median-of-means tournaments, introduced by the authors in [8]. It is shown that the new procedure outperforms standard regularized empirical risk minimization procedures such as lasso or slope in heavy-tailed problems. Read More

Nowadays data compressors are applied to many problems of text analysis, but many such applications are developed outside of the framework of mathematical statistics. In this paper we overcome this obstacle and show how several methods of classical mathematical statistics can be developed based on applications of the data compressors. Read More

This paper develops meshless methods for probabilistically describing discretisation error in the numerical solution of partial differential equations. This construction enables the solution of Bayesian inverse problems while accounting for the impact of the discretisation of the forward problem. In particular, this drives statistical inferences to be more conservative in the presence of significant solver error. Read More

In this article, we give some reviews concerning negative probabilities model and quasi-infinitely divisible at the beginning. We next extend Feller's characterization of discrete infinitely divisible distributions to signed discrete infinitely divisible distributions, which are discrete pseudo compound Poisson (DPCP) distributions with connections to the L\'evy-Wiener theorem. This is a special case of an open problem which is proposed by Sato(2014), Chaumont and Yor(2012). Read More

We consider an additive partially linear framework for modelling massive heterogeneous data. The major goal is to extract multiple common features simultaneously across all sub-populations while exploring heterogeneity of each sub-population. This work generalizes the partially linear framework proposed in Zhao et al. Read More

In survival analysis it often happens that some subjects under study do not experience the event of interest; they are considered to be `cured'. The population is thus a mixture of two subpopulations: the one of cured subjects, and the one of `susceptible' subjects. When covariates are present, a so-called mixture cure model can be used to model the conditional survival function of the population. Read More

We consider the Lasso for a noiseless experiment where one has observations $X \beta^0$ and uses the penalized version of basis pursuit. We compute for some special designs the compatibility constant, a quantity closely related to the restricted eigenvalue. We moreover show the dependence of the (penalized) prediction error on this compatibility constant. Read More

The Kaplan-Meier product-limit estimator is a simple and powerful tool in time to event analysis. However, by design, it is agnostic to the influence of covariates. Hence it is not suited for resolving issues of heterogeneity and differential censoring that may feature in real applications, except through extensions or modifications. Read More

We introduce a robust estimator of the location parameter for the change-point in the mean based on the Wilcoxon statistic and establish its consistency for $L_1$ near epoch dependent processes. It is shown that the consistency rate depends on the magnitude of change. A simulation study is performed to evaluate finite sample properties of the Wilcoxon-type estimator in standard cases, as well as under heavy-tailed distributions and disturbances by outliers, and to compare it with a CUSUM-type estimator. Read More

In many real applications of statistical learning, a decision made from misclassification can be too costly to afford; in this case, a reject option, which defers the decision until further investigation is conducted, is often preferred. In recent years, there has been much development for binary classification with a reject option. Yet, little progress has been made for the multicategory case. Read More

We study the asymptotic behavior of estimators of a two-valued, discontinuous diffusion coefficient in a Stochastic Differential Equation, called an Oscillating Brownian Motion. Using the relation of the latter process with the Skew Brownian Motion, we propose two natural consistent estimators, which are variants of the integrated volatility estimator and take the occupation times into account. We show the stable convergence of the renormalized errors' estimations toward some Gaussian mixture, possibly corrected by a term that depends on the local time. Read More

This paper gives upper and lower bounds on the minimum error probability of Bayesian $M$-ary hypothesis testing in terms of the Arimoto-R\'enyi conditional entropy of an arbitrary order $\alpha$. The improved tightness of these bounds over their specialized versions with the Shannon conditional entropy ($\alpha=1$) is demonstrated. In particular, in the case where $M$ is finite, we show how to generalize Fano's inequality under both the conventional and list-decision settings. Read More

We obtain estimation error rates and sharp oracle inequalities for a Birg\'e's T-estimator using a regularized median of mean principle as based tests. The results hold with exponentially large probability -- the same one as in the gaussian framework with independent noise-- under only weak moments assumption like a $L_4/L_2$ assumption and without assuming independence between the noise and the design $X$. The obtained rates are minimax optimal. Read More

This paper studies some robust regression problems associated with the $q$-norm loss ($q\ge1$) and the $\epsilon$-insensitive $q$-norm loss in the reproducing kernel Hilbert space. We establish a variance-expectation bound under a priori noise condition on the conditional distribution, which is the key technique to measure the error bound. Explicit learning rates will be given under the approximation ability assumptions on the reproducing kernel Hilbert space. Read More

The paper deals with planar segment processes given by a density with respect to the Poisson process. Parametric models involve reference distributions of directions and/or lengths of segments. These distributions generally do not coincide with the corresponding observed distributions. Read More

The orientation of a rigid object can be described by a rotation that transforms it into a standard position. For a symmetrical object the rotation is known only up to multiplication by an element of the symmetry group. Such ambiguous rotations arise in biomechanics, crystallography and seismology. Read More

With nonignorable missing data, likelihood-based inference should be based on the joint distribution of the study variables and their missingness indicators. These joint models cannot be estimated from the data alone, thus requiring the analyst to impose restrictions that make the models uniquely obtainable from the distribution of the observed data. We present an approach for constructing classes of identifiable nonignorable missing data models. Read More

Extreme values modeling has attracting the attention of researchers in diverse areas such as the environment, engineering, or finance. Multivariate extreme value distributions are particularly suitable to model the tails of multidimensional phenomena. The analysis of the dependence among multivariate maxima is useful to evaluate risk. Read More

This paper discusses minimum distance estimation method in the linear regression model with dependent errors which are strongly mixing. The regression parameters are estimated through the minimum distance estimation method, and asymptotic distributional properties of the estimators are discussed. A simulation study compares the performance of the minimum distance estimator with other well celebrated estimator. Read More

We construct optimal designs for group testing experiments where the goal is to estimate the prevalence of a trait by using a test with uncertain sensitivity and specificity. Using optimal design theory for approximate designs, we show that the most efficient design for simultaneously estimating the prevalence, sensitivity and specificity requires three different group sizes with equal frequencies. However, if estimating prevalence as accurately as possible is the only focus, the optimal strategy is to have three group sizes with unequal frequencies. Read More

This article is an extended version of previous work of the authors [40, 41] on low-rank matrix estimation in the presence of constraints on the factors into which the matrix is factorized. Low-rank matrix factorization is one of the basic methods used in data analysis for unsupervised learning of relevant features and other types of dimensionality reduction. We present a framework to study the constrained low-rank matrix estimation for a general prior on the factors, and a general output channel through which the matrix is observed. Read More

Testing whether a probability distribution is compatible with a given Bayesian network is a fundamental task in the field of causal inference, where Bayesian networks model causal relations. Here we consider the class of causal structures where all correlations between observed quantities are solely due to the influence from latent variables. We show that each model of this type imposes a certain signature on the observable covariance matrix in terms of a particular decomposition into positive semidefinite components. Read More

In this article, we investigate large sample properties of model selection procedures in a general Bayesian framework when a closed form expression of the marginal likelihood function is not available or a local asymptotic quadratic approximation of the log-likelihood function does not exist. Under appropriate identifiability assumptions on the true model, we provide sufficient conditions for a Bayesian model selection procedure to be consistent and obey the Occam's razor phenomenon, i.e. Read More

We consider the problems of compressed sensing and optimal denoising for signals $\mathbf{x_0}\in\mathbb{R}^N$ that are monotone, i.e., $\mathbf{x_0}(i+1) \geq \mathbf{x_0}(i)$, and sparsely varying, i. Read More

Approximations of Laplace-Beltrami operators on manifolds through graph Lapla-cians have become popular tools in data analysis and machine learning. These discretized operators usually depend on bandwidth parameters whose tuning remains a theoretical and practical problem. In this paper, we address this problem for the unnormalized graph Laplacian by establishing an oracle inequality that opens the door to a well-founded data-driven procedure for the bandwidth selection. Read More

Nearly all estimators in statistical prediction come with an associated tuning parameter, in one way or another. Common practice, given data, is to choose the tuning parameter value that minimizes a constructed estimate of the prediction error of the estimator; we focus on Stein's unbiased risk estimator, or SURE (Stein, 1981; Efron, 1986) which forms an unbiased estimate of the prediction error by augmenting the observed training error with an estimate of the degrees of freedom of the estimator. Parameter tuning via SURE minimization has been advocated by many authors, in a wide variety of problem settings, and in general, it is natural to ask: what is the prediction error of the SURE-tuned estimator? An obvious strategy would be simply use the apparent error estimate as reported by SURE, i. Read More

For finite parameter spaces under finite loss, every Bayesian procedure derived from a prior with full support is admissible, and every admissible procedure is Bayes. This relationship begins to break down as we move to continuous parameter spaces. Under some regularity conditions, admissible procedures can be shown to be the limits of Bayesian procedures. Read More

We are trying to give a mathematically correct definition of outliers. Our approach is based on the distance between two last order statistics and appears to be connected to the law of large numbers. Key words: outliers, law of large numbers, heavy tails, stability index. Read More