Statistics - Theory Publications (50)


Statistics - Theory Publications

We obtain the optimal Bayesian minimax rate for the unconstrained large covariance matrix of multivariate normal sample with mean zero, when both the sample size, n, and the dimension, p, of the covariance matrix tend to infinity. Traditionally the posterior convergence rate is used to compare the frequentist asymptotic performance of priors, but defining the optimality with it is elusive. We propose a new decision theoretic framework for prior selection and define Bayesian minimax rate. Read More

We propose the kl-UCB ++ algorithm for regret minimization in stochastic bandit models with exponential families of distributions. We prove that it is simultaneously asymptotically optimal (in the sense of Lai and Robbins' lower bound) and minimax optimal. This is the first algorithm proved to enjoy these two properties at the same time. Read More

This paper argues that a class of Riemannian metrics, called warped metrics, plays a fundamental role in statistical problems involving location-scale models. The paper reports three new results : i) the Rao-Fisher metric of any location-scale model is a warped metric, provided that this model satisfies a natural invariance condition, ii) the analytic expression of the sectional curvature of this metric, iii) the exact analytic solution of the geodesic equation of this metric. The paper applies these new results to several examples of interest, where it shows that warped metrics turn location-scale models into complete Riemannian manifolds of negative sectional curvature. Read More

In big data analysis for detecting rare and weak signals among $n$ features, some grouping-test methods such as Higher Criticism test (HC), Berk-Jones test (B-J), and $\phi$-divergence test share the similar asymptotical optimality when $n \rightarrow \infty$. However, in practical data analysis $n$ is frequently small and moderately large at most. In order to properly apply these optimal tests and wisely choose them for practical studies, it is important to know how to get the p-values and statistical power of them. Read More

In this paper, we propose to construct uniform confidence sets by bootstrapping the debiased kernel density estimator (for density estimation) and the debiased local polynomial regression estimator (for regression analysis). The idea of using a debiased estimator was first introduced in Calonico et al. (2015), where they construct a pointwise confidence set by explicitly estimating stochastic variations. Read More

We consider the recovery of a low rank $M \times N$ matrix $S$ from its noisy observation $\tilde{S}$ in two different regimes. Under the assumption that $M$ is comparable to $N$, we propose two optimal estimators for $S$. Our analysis rely on the local behavior of the large dimensional rectangle matrices with finite rank perturbation. Read More

The multi-armed restless bandit problem is studied in the case where the pay-offs are not necessarily independent over time nor across the arms. Even though this version of the problem provides a more realistic model for most real-world applications, it cannot be optimally solved in practice since it is known to be PSPACE-hard. The objective of this paper is to characterize special sub-classes of the problem where good approximate solutions can be found using tractable approaches. Read More

Principal component analysis (PCA) is fundamental to statistical machine learning. It extracts latent principal factors that contribute to the most variation of the data. When data are stored across multiple machines, however, communication cost can prohibit the computation of PCA in a central location and distributed algorithms for PCA are thus needed. Read More

We test three common information criteria (IC) for selecting the order of a Hawkes process with an intensity kernel that can be expressed as a mixture of exponential terms. These processes find application in high-frequency financial data modelling. The information criteria are Akaike's information criterion (AIC), the Bayesian information criterion (BIC) and the Hannan-Quinn criterion (HQ). Read More

We extend Fano's inequality, which controls the average probability of (disjoint) events in terms of the average of some Kullback-Leibler divergences, to work with arbitrary [0,1]-valued random variables. Our simple two-step methodology is general enough to cover the case of an arbitrary (possibly continuously infinite) family of distributions as well as [0,1]-valued random variables not necessarily summing up to 1. Several novel applications are provided, in which the consideration of random variables is particularly handy. Read More

This paper studies the nonparametric modal regression problem systematically from a statistical learning view. Originally motivated by pursuing a theoretical understanding of the maximum correntropy criterion based regression (MCCR), our study reveals that MCCR with a tending-to-zero scale parameter is essentially modal regression. We show that nonparametric modal regression problem can be approached via the classical empirical risk minimization. Read More

An important property of statistical estimators is qualitative robustness, that is small changes in the distribution of the data only result in small chances of the distribution of the estimator. Moreover, in practice, the distribution of the data is commonly unknown, therefore bootstrap approximations can be used to approximate the distribution of the estimator. Hence qualitative robustness of the statistical estimator under the bootstrap approximation is a desirable property. Read More

We determine the joint limiting distribution of adjacent spacings around a central, intermediate, or an extreme order statistic $X_{k:n}$ of a random sample of size $n$ from a continuous distribution $F$. For central and intermediate cases, normalized spacings in the left and right neighborhoods are asymptotically i.i. Read More

The goodness-of-fit test for discrimination of two tail distribution using higher order statistics is proposed. The consistency of proposed test is proved for two different alternatives. We do not assume belonging the corresponding distribution function to a maximum domain of attraction. Read More

Many applications require stochastic processes specified on two- or higher-dimensional domains; spatial or spatial-temporal modelling, for example. In these applications it is attractive, for conceptual simplicity and computational tractability, to propose a covariance function that is separable; e.g. Read More

The problem of population recovery refers to estimating a distribution based on incomplete or corrupted samples. Consider a random poll of sample size $n$ conducted on a population of individuals, where each pollee is asked to answer $d$ binary questions. We consider one of the two polling impediments: (a) in lossy population recovery, a pollee may skip each question with probability $\epsilon$; (b) in noisy population recovery, a pollee may lie on each question with probability $\epsilon$. Read More

We consider the problem of finding a proper confidence interval for the mean based on a single observation from a normal distribution with both mean and variance unknown. Portnoy (2017) characterizes the scale-sign invariant rules and shows that the Hunt-Stein construction provides a randomized invariant rule that improves on any given randomized rule in the sense that it has greater minimal coverage among all procedures with a fixed expected length. Mathematical results here provide a specific mixture of two non-randomized invariant rules that achieve the minimax optimality. Read More

In this paper we present an objective approach to change point analysis. In particular, we look at the problem from two perspectives. The first focuses on the definition of an objective prior when the number of change points is known a priori. Read More

How many samples are sufficient to guarantee that the eigenvectors and eigenvalues of the sample covariance matrix are close to those of the actual covariance matrix? For a wide family of distributions, including distributions with finite second moment and distributions supported in a centered Euclidean ball, we prove that the inner product between eigenvectors of the sample and actual covariance matrices decreases proportionally to the respective eigenvalue distance. Our findings imply non-asymptotic concentration bounds for eigenvectors, eigenspaces, and eigenvalues. They also provide conditions for distinguishing principal components based on a constant number of samples. Read More

Estimation of the intensity of a point process is considered within a nonparametric framework. The intensity measure is unknown and depends on covariates, possibly many more than the observed number of jumps. Only a single trajectory of the counting process is observed. Read More

This work is closely related to the theories of set estimation and manifold estimation. Our object of interest is a, possibly lower-dimensional, compact set $S \subset {\mathbb R}^d$. The general aim is to identify (via stochastic procedures) some qualitative or quantitative features of $S$, of geometric or topological character. Read More

We study trend filtering, a relatively recent method for univariate nonparametric regression. For a given integer $r \geq 1$, the trend filtering estimator of order $r$ is defined as the minimizer of the sum of squared errors when we constrain (or penalize) the sum of the absolute discrete derivatives of order $r$ over the input points. For $r = 1$, the estimator reduces to total variation regularization which has received much attention in the statistics and image processing literature. Read More

Gaussian mixture models are widely used in Statistics. A fundamental aspect of these distributions is the study of the local maxima of the density, or modes. In particular, the number of modes that can arise in a mixture of $k$ Gaussians in $d$ dimensions remains unknown in the general case. Read More

We prove a new concentration inequality for the excess risk of a M-estimator in least-squares regression with random design and heteroscedastic noise. This kind of result is a central tool in modern model selection theory, as well as in recent achievements concerning the behavior of regularized estimators such as LASSO, group LASSO and SLOPE. Read More

Power spectrum estimation is an important tool in many applications, such as the whitening of noise. The popular multitaper method enjoys significant success, but fails for short signals with few samples. We propose a statistical model where a signal is given by a random linear combination of fixed, yet unknown, stochastic sources. Read More

It is shown that Newton's inequalities and the related Maclaurin's inequalities provide several refinements of the fundamental Arithmetic mean - Geometric mean - Harmonic mean inequality in terms of the means and variance of positive real numbers. We also obtain some inequalities involving third and fourth central moments of real numbers. Read More

This paper studies robust regression in the settings of Huber's $\epsilon$-contamination models. We consider estimators that are maximizers of multivariate regression depth functions. These estimators are shown to achieve minimax rates in the settings of $\epsilon$-contamination models for various regression problems including nonparametric regression, sparse linear regression, reduced rank regression, etc. Read More

Multiple root estimation problems in statistical inference arise in many contexts in the literature. In the context of maximum likelihood estimation, the existence of multiple roots causes uncertainty in the computation of maximum likelihood estimators using hill-climbing algorithms, and consequent difficulties in the resulting statistical inference. In this paper, we study the multiple roots phenomenon in maximum likelihood estimation for factor analysis. Read More

Let G = An be the graph corresponding to the graphical model of nearest neighbour interaction in a Gaussian character. We study Natural Exponential Families( NEF) ofWishart distributions on convex cones QG and PG, where PG is the cone of positive definite real symmetric matrices with obligatory zeros prescribed by G, and QG is the dual cone of PG. The Wishart NEF that we construct include Wishart distributions considered earlier by Lauritzen (1996) and Letac and Massam (2007) for models based on decomposable graphs. Read More

We analyze the problem of maximum likelihood estimation for Gaussian distributions that are multivariate totally positive of order two (MTP2). By exploiting connections to phylogenetics and single-linkage clustering, we give a simple proof that the maximum likelihood estimator (MLE) for such distributions exists based on at least 2 observations, irrespective of the underlying dimension. Slawski and Hein, who first proved this result, also provided empirical evidence showing that the MTP2 constraint serves as an implicit regularizer and leads to sparsity in the estimated inverse covariance matrix, determining what we name the ML graph. Read More

Linear structural equation models relate the components of a random vector using linear interdependencies and Gaussian noise. Each such model can be naturally associated with a mixed graph whose vertices correspond to the components of the random vector. The graph contains directed edges that represent the linear relationships between components, and bidirected edges that encode unobserved confounding. Read More

We consider composite-composite testing problems for the expectation in the Gaussian sequence model where the null hypothesis corresponds to a convex subset $\mathcal{C}$ of $\mathbb{R}^d$. We adopt a minimax point of view and our primary objective is to describe the smallest Euclidean distance between the null and alternative hypotheses such that there is a test with small total error probability. In particular, we focus on the dependence of this distance on the dimension $d$ and the sample size/variance parameter $n$ giving rise to the minimax separation rate. Read More

We investigate the frequentist properties of Bayesian procedures for estimation based on the horseshoe prior in the sparse multivariate normal means model. Previous theoretical results assumed that the sparsity level, that is, the number of signals, was known. We drop this assumption and characterize the behavior of the maximum marginal likelihood estimator (MMLE) of a key parameter of the horseshoe prior. Read More

The emergent field of probabilistic numerics has thus far lacked rigorous statistical principals. This paper establishes Bayesian probabilistic numerical methods as those which can be cast as solutions to certain Bayesian inverse problems, albeit problems that are non-standard. This allows us to establish general conditions under which Bayesian probabilistic numerical methods are well-defined, encompassing both non-linear and non-Gaussian models. Read More

We study the relationship between information- and estimation-theoretic quantities in time-evolving systems. We focus on the Fokker-Planck channel defined by a general stochastic differential equation, and show that the time derivatives of entropy, KL divergence, and mutual information are characterized by estimation-theoretic quantities involving an appropriate generalization of the Fisher information. Our results vastly extend De Bruijn's identity and the classical I-MMSE relation. Read More

This paper shows that the Conditional Quantile Treatment Effect on the Treated can be identified using a combination of (i) a conditional Distributional Difference in Differences assumption and (ii) an assumption on the conditional dependence between the change in untreated potential outcomes and the initial level of untreated potential outcomes for the treated group. The second assumption recovers the unknown dependence from the observed dependence for the untreated group. We also consider estimation and inference in the case where all of the covariates are discrete. Read More

This paper considers the problem of inliers and empty cells and the resulting issue of relative inefficiency in estimation under pure samples from a discrete population when the sample size is small. Many minimum divergence estimators in the $S$-divergence family, although possessing very strong outlier stability properties, often have very poor small sample efficiency in the presence of inliers and some are not even defined in the presence of a single empty cell; this limits the practical applicability of these estimators, in spite of their otherwise sound robustness properties and high asymptotic efficiency. Here, we will study a penalized version of the $S$-divergences such that the resulting minimum divergence estimators are free from these issues without altering their robustness properties and asymptotic efficiencies. Read More

Bayesian networks, or directed acyclic graph (DAG) models, are widely used to represent complex causal systems. Since the basic task of learning a Bayesian network from data is NP-hard, a standard approach is greedy search over the space of DAGs or Markov equivalent DAGs. Since the space of DAGs on $p$ nodes and the associated space of Markov equivalence classes are both much larger than the space of permutations, it is desirable to consider permutation-based searches. Read More

Data with hierarchical structure arise in many fields. Estimating global effect sizes from nested data, and testing effects against global null hypotheses, is, however, more challenging than in the traditional setting of independent and identically distributed data. In this paper, we review statistical approaches to deal with nested data following either a fixed-effect or a random-effects model. Read More

We consider a point cloud $X_n := \{ x_1, \dots, x_n \}$ uniformly distributed on the flat torus $\mathbb{T}^d : = \mathbb{R}^d / \mathbb{Z}^d $, and construct a geometric graph on the cloud by connecting points that are within distance $\epsilon$ of each other. We let $\mathcal{P}(X_n)$ be the space of probability measures on $X_n$ and endow it with a discrete Wasserstein distance $W_n$ as defined by Maas. We show that as long as $\epsilon= \epsilon_n$ decays towards zero slower than an explicit rate depending on the level of uniformity of $X_n$, then the space $(\mathcal{P}(X_n), W_n)$ converges in the Gromov-Hausdorff sense towards the space of probability measures on $\mathbb{T}^d$ endowed with the Wasserstein distance. Read More

This paper develops a method to construct uniform confidence bands for a nonparametric regression function where a predictor variable is subject to a measurement error. We allow for the distribution of the measurement error to be unknown, but assume that there is an independent sample from the measurement error distribution. The sample from the measurement error distribution need not be independent from the sample on response and predictor variables. Read More

In this paper, we consider a high-dimensional linear regression model with fixed design. We present a unified analysis of the performance guarantees of exponential weighted aggregation and penalized estimators with a general class of priors which encourage objects which conform to some notion of simplicity/complexity. More precisely, we show that these two estimators satisfy sharp oracle inequalities for prediction ensuring their good theoretical performances. Read More

Determining risk contributions by unit exposures to portfolio-wide economic capital is an important task in financial risk management. Despite its practical demands, computation of risk contributions is challenging for most risk models because it often requires rare-event simulation. In this paper, we address the problem of estimating risk contributions when the total risk is measured by Value-at-Risk (VaR). Read More

The seminal paper of Caponnetto and de Vito (2007) provides minimax-optimal rates for kernel ridge regression in a very general setting. Its proof, however, contains an error in its bound on the effective dimensionality. In this note, we explain the mistake, provide a correct bound, and show that the main theorem remains true. Read More

There has been considerable interest across several fields in methods that reduce the problem of learning good treatment assignment policies to the problem of accurate policy evaluation. Given a class of candidate policies, these methods first effectively evaluate each policy individually, and then learn a policy by optimizing the estimated value function; such approaches are guaranteed to be risk-consistent whenever the policy value estimates are uniformly consistent. However, despite the wealth of proposed methods, the literature remains largely silent on questions of statistical efficiency: there are only limited results characterizing which policy evaluation strategies lead to better learned policies than others, or what the optimal policy evaluation strategies are. Read More

In this paper, we introduce the notion of DTM-signature, a measure on R + that can be associated to any metric-measure space. This signature is based on the distance to a measure (DTM) introduced by Chazal, Cohen-Steiner and M\'erigot. It leads to a pseudo-metric between metric-measure spaces, upper-bounded by the Gromov-Wasserstein distance. Read More

In nature or societies, the power-law is present ubiquitously, and then it is important to investigate the mathematical characteristics of power-laws in the recent era of big data. In this paper we prove the superposition of non-identical stochastic processes with power-laws converges in density to a unique stable distribution. This property can be used to explain the universality of stable laws such that the sums of the logarithmic return of non-identical stock price fluctuations follow stable distributions. Read More

We provide a new simple characterization of the multivariate generalized Laplace distribution. In particular, our characterization implies that the product of a Gaussian matrix with independent and identically distributed columns and an independent isotropic Gaus-sian vector follows a symmetric multivariate generalized Laplace distribution. Read More

Stochastic Neighbor Embedding and its variants are widely used dimensionality reduction techniques -- despite their popularity, no theoretical results are known. We prove that the optimal SNE embedding of well-separated clusters from high dimensions to any Euclidean space R^d manages to successfully separate the clusters in a quantitative way. The result also applies to a larger family of methods including a variant of t-SNE. Read More

This paper presents a clustering method that allows for rigorous statistical error control similar to a statistical test. We develop estimators for both the unknown number of clusters and the clusters themselves. The estimators depend on a tuning parameter alpha which is similar to the significance level of a statistical hypothesis test. Read More