Adaptive p-values after cross-validation

We describe a way to construct hypothesis tests and confidence intervals after having used the Lasso for feature selection, allowing the regularization parameter to be chosen via an estimate of prediction error. Our estimate of prediction error is a slight variation on cross-validation. Using this variation, we are able to describe an appropriate selection event for choosing a parameter by cross-validation. Adjusting for this selection event, we derive a pivotal quantity that has an asymptotically Unif(0,1) distribution which can be used to test hypotheses or construct intervals. To enhance power, we consider the randomized Lasso with cross-validation. We derive a similar test statistic and develop MCMC sampling scheme to construct valid post-selective confidence intervals empirically. Finally, we demonstrate via simulation that our procedure achieves high-statistical power and FDR control, yielding results comparable to knockoffs (in simulations favorable to knockoffs).


Similar Publications

Continuous-time multi-state survival models can be used to describe health-related processes over time. In the presence of interval-censored times for transitions between the living states, the likelihood is constructed using transition probabilities. Models can be specified using parametric or semi-parametric shapes for the hazards. Read More


This paper focuses on the multivariate linear mixed-effects model, including all the correlations between the random effects when the marginal residual terms are assumed uncorrelated and homoscedastic with possibly different standard deviations. The random effects covariance matrix is Cholesky factorized to directly estimate the variance components of these random effects. This strategy enables a consistent estimate of the random effects covariance matrix which, generally, has a poor estimate when it is grossly (or directly) estimated, using the estimating methods such as the EM algorithm. Read More


In this paper, we study a novel approach for the estimation of quantiles when facing potential right censoring of the responses. Contrary to the existing literature on the subject, the adopted strategy of this paper is to tackle censoring at the very level of the loss function usually employed for the computation of quantiles, the so-called "check" function. For interpretation purposes, a simple comparison with the latter reveals how censoring is accounted for in the newly proposed loss function. Read More


Cross-validation is one of the most popular model selection methods in statistics and machine learning. Despite its wide applicability, traditional cross-validation methods tend to select overfitting models, unless the ratio between the training and testing sample sizes is much smaller than conventional choices. We argue that such an overfitting tendency of cross-validation is due to the ignorance of the uncertainty in the testing sample. Read More


Particle filters are a popular and flexible class of numerical algorithms to solve a large class of nonlinear filtering problems. However, standard particle filters with importance weights have been shown to require a sample size that increases exponentially with the dimension D of the state space in order to achieve a certain performance, which precludes their use in very high-dimensional filtering problems. Here, we focus on the dynamic aspect of this curse of dimensionality (COD) in continuous time filtering, which is caused by the degeneracy of importance weights over time. Read More


Energy statistics are estimators of the energy distance that depend on the distances between observations. The idea behind energy statistics is to consider a statistical potential energy that would parallel Newton's gravitational potential energy. This statistical potential energy is zero if and only if a certain null hypothesis relating two distributions holds true. Read More


Recent advances in bioinformatics have made high-throughput microbiome data widely available, and new statistical tools are required to maximize the information gained from these data. For example, analysis of high-dimensional microbiome data from designed experiments remains an open area in microbiome research. Contemporary analyses work on metrics that summarize collective properties of the microbiome, but such reductions preclude inference on the fine-scale effects of environmental stimuli on individual microbial taxa. Read More


In social and economic studies many of the collected variables are measured on a nominal scale, often with a large number of categories. The definition of categories is usually not unambiguous and different classification schemes using either a finer or a coarser grid are possible. Categorisation has an impact when such a variable is included as covariate in a regression model: a too fine grid will result in imprecise estimates of the corresponding effects, whereas with a too coarse grid important effects will be missed, resulting in biased effect estimates and poor predictive performance. Read More


Thermodynamic integration (TI) for computing marginal likelihoods is based on an inverse annealing path from the prior to the posterior distribution. In many cases, the resulting estimator suffers from high variability, which particularly stems from the prior regime. When comparing complex models with differences in a comparatively small number of parameters, intrinsic errors from sampling fluctuations may outweigh the differences in the log marginal likelihood estimates. Read More


Sufficient dimension reduction (SDR) is continuing an active research field nowadays for high dimensional data. It aims to estimate the central subspace (CS) without making distributional assumption. To overcome the large-$p$-small-$n$ problem we propose a new approach for SDR. Read More