Statistics - Applications Publications (50)


Statistics - Applications Publications

Competing risk analysis considers event times due to multiple causes, or of more than one event types. Commonly used regression models for such data include 1) cause-specific hazards model, which focuses on modeling one type of event while acknowledging other event types simultaneously; and 2) subdistribution hazards model, which links the covariate effects directly to the cumulative incidence function. Their use and in particular statistical properties in the presence of high-dimensional predictors are largely unexplored. Read More

Hypothesis testing in the linear regression model is a fundamental statistical problem. We consider linear regression in the high-dimensional regime where the number of parameters exceeds the number of samples ($p> n$) and assume that the high-dimensional parameters vector is $s_0$ sparse. We develop a general and flexible $\ell_\infty$ projection statistic for hypothesis testing in this model. Read More

The missing phase problem in X-ray crystallography is commonly solved using the technique of molecular replacement, which borrows phases from a previously solved homologous structure, and appends them to the measured Fourier magnitudes of the diffraction patterns of the unknown structure. More recently, molecular replacement has been proposed for solving the missing orthogonal matrices problem arising in Kam's autocorrelation analysis for single particle reconstruction using X-ray free electron lasers and cryo-EM. In classical molecular replacement, it is common to estimate the magnitudes of the unknown structure as twice the measured magnitudes minus the magnitudes of the homologous structure, a procedure known as `twicing'. Read More

We describe the Bedside Patient Rescue (BPR) project, the goal of which is risk prediction of adverse events for non-ICU patients using ~200 variables (vitals, lab results, assessments, ... Read More

In national accounts, relations between industries are analyzed using input-output tables. In the Czech Republic these tables are compiled in a five year period. For the remaining years tables must be estimated. Read More

We reconsider the classic problem of recovering exogenous variation from an endogenous regressor. Two-stage least squares recovers the exogenous variation through presuming the existence of an instrumental variable. We instead rely on the assumption that there is a positive measure of observations of the regressor that are exogenous--say as the result of a temporary natural experiment. Read More

This paper is concerned with the nonparametric item response theory (NIRT) for estimating item characteristic curves (ICCs) and latent abilities of examinees on educational and psychological tests. In contrast to parametric models, NIRT models can estimate various forms of ICCs under mild shape restrictions, such as the constraints of monotone homogeneity and double monotonicity. However, NIRT models frequently suffer from estimation instability because of the great flexibility of nonparametric ICCs, especially when there is only a small amount of item-response data. Read More

In many real problems, dependence structures more general than exchangeability are required. For instance, in some settings partial exchangeability is a more reasonable assumption. For this reason, vectors of dependent Bayesian nonparametric priors have recently gained popularity. Read More

The present letter to the editor is one in a series of publications discussing the formulation of hypotheses (propositions) for the evaluation of strength of forensic evidence. In particular, the discussion focusses on the issue of what information may be used to define the relevant population specified as part of the different-speaker hypothesis in forensic voice comparison. The previous publications in the series are: Hicks et al. Read More

This study aims to investigate the effects of violations of the sphericity assumption on Type I error rates for different methodical approaches of repeated measures analysis using a simulation approach. In contrast to previous simulation studies on this topic, up to nine measurement occasions were considered. Therefore, two populations representing the conditions of a violation vs. Read More

The share of wind energy in total installed power capacity has grown rapidly in recent years around the world. Producing accurate and reliable forecasts of wind power production, together with a quantification of the uncertainty, is essential to optimally integrate wind energy into power systems. We build spatio-temporal models for wind power generation and obtain full probabilistic forecasts from 15 minutes to 5 hours ahead. Read More

The problem of estimating trend and seasonal variation in time-series data has been studied over several decades, although mostly using single time series. This paper studies the problem of estimating these components from functional data, i.e. Read More

It is becoming increasingly clear that complex interactions among genes and environmental factors play crucial roles in triggering complex diseases. Thus, understanding such interactions is vital, which is possible only through statistical models that adequately account for such intricate, albeit unknown, dependence structures. Bhattacharya & Bhattacharya (2016b) attempt such modeling, relating finite mixtures composed of Dirichlet processes that represent unknown number of genetic sub-populations through a hierarchical matrix-normal structure that incorporates gene-gene interactions, and possible mutations, induced by environmental variables. Read More

The "Planning in the Early Medieval Landscape" project (PEML) Read More

In [16], a new family of vector-valued risk measures called multivariate expectiles is introduced. In this paper, we focus on the asymptotic behavior of these measures in a multivariate regular variations context. For models with equivalent tails, we propose an estimator of these multivariate asymptotic expectiles, in the Fr\'echet attraction domain case, with asymptotic independence, or in the comonotonic case. Read More

Due to freely available, tailored software, Bayesian statistics is fast becoming the dominant paradigm in archaeological chronology construction. Such software provides users with powerful tools for Bayesian inference for chronological models with little need to undertake formal study of statistical modelling or computer programming. This runs the risk that it is reduced to the status of a black-box which is not sensible given the power and complexity of the modelling tools it implements. Read More

This paper proposes a new method to estimate dynamic panel data models with spatially dependent errors that allows for known/unknown group-specific patterns of slope heterogeneity. Analysis of this model is conducted in the framework of composite quasi-likelihood (CL) maximization. The proposed CL estimator is robust against some misspecification of the unobserved individual/group-specific fixed effects. Read More

We propose a simple stochastic model for the dynamics of a limit order book, extending the recent work of Cont and de Larrard (2013), where the price dynamics are endogenous, resulting from market transactions. We also show that the conditional diffusion limit of the price process is the so-called Brownian meander. Read More

The rates of respiratory prescriptions vary by GP surgery across Scotland, suggesting there are sizeable health inequalities in respiratory ill health across the country. The aim of this paper is to estimate the magnitude, spatial pattern and drivers of this spatial variation. Monthly data on respiratory prescriptions are available at the GP surgery level, which creates an interesting methodological challenge as these data are not the classical geostatistical, areal unit or point process data types. Read More

Objective: We investigated the influence of risk of bias judgments from Cochrane reviews for sequence generation, allocation concealment and blinding on between-trial heterogeneity. Study Design and Setting: Bayesian hierarchical models were fitted to binary data from 117 meta-analyses, to estimate the ratio {\lambda} by which heterogeneity changes for trials at high/unclear risk of bias, compared to trials at low risk of bias. We estimated the proportion of between-trial heterogeneity in each meta-analysis that could be explained by the bias associated with specific design characteristics. Read More

We present a method to estimate a multivariate Gaussian distribution of diffusion tensor features in a set of brain regions based on a small sample of healthy individuals, and use this distribution to identify imaging abnormalities in subjects with mild traumatic brain injury. The multivariate model receives a {\em apriori} knowledge in the form of a neighborhood graph imposed on the precision matrix, which models brain region interactions, and an additional $L_1$ sparsity constraint. The model is then estimated using the graphical LASSO algorithm and the Mahalanobis distance of healthy and TBI subjects to the distribution mean is used to evaluate the discriminatory power of the model. Read More

We describe an exploratory and confirmatory factor analysis of the International Social Survey Programme Religion Cumulation (1991-1998-2008) data set, to identify the factors of individual religiosity and their interrelations in quantitative terms. The exploratory factor analysis was performed using data from the first two waves (1991 and 1998), and led to the identification of four strongly correlated and reliable factors which we labeled Religious formation, Supernatural beliefs, Belief in God, and Religious practice. The confirmatory factor analysis was run using data from 2008, and led to the confirmation of this four-factor structure with very good fit measures. Read More

A new class of disturbance covariance matrix estimators for radar signal processing applications is introduced following a geometric paradigm. Each estimator is associated with a given unitary invariant norm and performs the sample covariance matrix projection into a specific set of structured covariance matrices. Regardless of the considered norm, an efficient solution technique to handle the resulting constrained optimization problem is developed. Read More

In this paper, direction-of-arrival (DOA) estimation using non-coherent processing for partly calibrated arrays composed of multiple subarrays is considered. The subarrays are assumed to compute locally the sample covariance matrices of their measurements and communicate them to the processing center. A sufficient condition for the unique identifiability of the sources in the aforementioned non-coherent processing scheme is presented. Read More

This study aims to analyze the methodologies that can be used to estimate the total number of unemployed, as well as the unemployment rates for 28 regions of Portugal, designated as NUTS III regions, using model based approaches as compared to the direct estimation methods currently employed by INE (National Statistical Institute of Portugal). Model based methods, often known as small area estimation methods (Rao, 2003), "borrow strength" from neighbouring regions and in doing so, aim to compensate for the small sample sizes often observed in these areas. Consequently, it is generally accepted that model based methods tend to produce estimates which have lesser variation. Read More

Background: For newborn infants in critical care, continuous monitoring of brain function can help identify infants at-risk of brain injury. Quantitative features allow a consistent and reproducible approach to EEG analysis, but only when all implementation aspects are clearly defined. Methods: We detail quantitative features frequently used in neonatal EEG analysis and present a Matlab software package together with exact implementation details for all features. Read More

A special class of standard Gaussian Autoregressive Hilbertian processes of order one (Gaussian ARH(1) processes), with bounded linear autocorrelation operator, which does not satisfy the usual Hilbert-Schmidt assumption, is considered. To compensate the slow decay of the diagonal coefficients of the autocorrelation operator, a faster decay velocity of the eigenvalues of the trace autocovariance operator of the innovation process is assumed. As usual, the eigenvectors of the autocovariance operator of the ARH(1) process are considered for projection, since, here, they are assumed to be known. Read More

Information systems experience an ever-growing volume of unstructured data, particularly in the form of textual materials. This represents a rich source of information from which one can create value for people, organizations and businesses. For instance, recommender systems can benefit from automatically understanding preferences based on user reviews or social media. Read More

Speckle reduction is a longstanding topic in synthetic aperture radar (SAR) imaging. Since most current and planned SAR imaging satellites operate in polarimetric, interferometric or tomographic modes, SAR images are multi-channel and speckle reduction techniques must jointly process all channels to recover polarimetric and interferometric information. The distinctive nature of SAR signal (complex-valued, corrupted by multiplicative fluctuations) calls for the development of specialized methods for speckle reduction. Read More

Autoimmune diseases are characterized by highly specific immune responses against molecules in self-tissues. Different autoimmune diseases are characterized by distinct immune responses, making autoantibodies useful for diagnosis and prediction. In many diseases, the targets of autoantibodies are incompletely defined. Read More

Winds from the North-West quadrant and lack of precipitation are known to lead to an increase of PM10 concentrations over a residential neighborhood in the city of Taranto (Italy). In 2012 the local government prescribed a reduction of industrial emissions by 10% every time such meteorological conditions are forecasted 72 hours in advance. Wind forecasting is addressed using the Weather Research and Forecasting (WRF) atmospheric simulation system by the Regional Environmental Protection Agency. Read More

There has been great interest recently in applying nonparametric kernel mixtures in a hierarchical manner to model multiple related data samples jointly. In such settings several data features are commonly present: (i) the related samples often share some, if not all, of the mixture components but with differing weights, (ii) only some, not all, of the mixture components vary across the samples, and (iii) often the shared mixture components across samples are not aligned perfectly in terms of their location and spread, but rather display small misalignments either due to systematic cross-sample difference or more often due to uncontrolled, extraneous causes. Properly incorporating these features in mixture modeling will enhance the efficiency of inference, whereas ignoring them not only reduces efficiency but can jeopardize the validity of the inference due to issues such as confounding. Read More

We investigate the rates of drug resistance acquisition in a natural population using molecular epidemiological data from Bolivia. First, we study the rate of direct acquisition of double resistance from the double sensitive state within patients and compare it to the rates of evolution to single resistance. In particular, we address whether or not double resistance can evolve directly from a double sensitive state within a given host. Read More

Time-to-event models are a popular tool to analyse data where the outcome variable is the time to the occurrence of a specific event of interest. Here we focus on the analysis of time-to-event outcomes that are either intrisically discrete or grouped versions of continuous event times. In the literature, there exists a variety of regression methods for such data. Read More

Fragility curves are commonly used in civil engineering to assess the vulnerability of structures to earthquakes. The probability of failure associated with a prescribed criterion (e.g. Read More

In many experiments in the life sciences, several endpoints are recorded per subject. The analysis of such multivariate data is usually based on MANOVA models assuming multivariate normality and covariance homogeneity. These assumptions, however, are often not met in practice. Read More

Given the limited pool of donor organs, accurate predictions of survival on the wait list and post transplantation are crucial for cardiac transplantation decisions and policy. However, current clinical risk scores do not yield accurate predictions. We develop a new methodology (ToPs, Trees of Predictors) built on the principle that specific predictors should be used for specific clusters within the target population. Read More

In this paper we propose an ad-hoc construction of the Likelihood Function, in order to develop a data analysis procedure, to be applied in atomic and nuclear spectral analysis. The classical Likelihood Function was modified taking into account the underlying statistics of the phenomena studied, by the inspection of the residues of the fitting, which should behave with specific statistical properties. This new formulation was analytically developed, but the sought parameter should be evaluated numerically, since it cannot be obtained as a function of each one of the independent variables. Read More

Foot-mounted inertial positioning (FMIP) and fingerprinting based WiFi indoor positioning (FWIP) are two promising solutions for indoor positioning. However, FMIP suffers from accumulative positioning errors in the long term while FWIP involves a very labor-intensive offline training phase. A new approach combining the two solutions is proposed in this paper, which can limit the error growth in FMIP and is free of any offline site survey phase. Read More

We develop a Bayesian vector autoregressive (VAR) model that is capable of handling vast dimensional information sets. Three features are introduced to permit reliable estimation of the model. First, we assume that the reduced-form errors in the VAR feature a factor stochastic volatility structure, allowing for conditional equation-by-equation estimation. Read More

This article proposes a systematic methodological review and objective criticism of existing methods enabling the derivation of time-varying Granger-causality statistics in neuroscience. The increasing interest and the huge number of publications related to this topic calls for this systematic review which describes the very complex methodological aspects. The capacity to describe the causal links between signals recorded at different brain locations during a neuroscience experiment is of primary interest for neuroscientists, who often have very precise prior hypotheses about the relationships between recorded brain signals that arise at a specific time and in a specific frequency band. Read More

We consider the online and nonparametric detection of abrupt and persistent anomalies, such as a change in the regular system dynamics at a time instance due to an anomalous event (e.g., a failure, a malicious activity). Read More

In applications of graphical models, we typically have more information than just the samples themselves. A prime example is the estimation of brain connectivity networks based on fMRI data, where in addition to the samples themselves, the spatial positions of the measurements are readily available. With particular regard for this application, we are thus interested in ways to incorporate additional knowledge most effectively into graph estimation. Read More

We present novel methods for predicting the outcome of large elections. Our first algorithm uses a diffusion process to model the time uncertainty inherent in polls taken with substantial calendar time left to the election. Our second model uses Online Learning along with a novel ex-ante scoring function to combine different forecasters along with our first model. Read More

We develop methods to evaluate whether a political districting accurately represents the will of the people. To explore and showcase our ideas, we concentrate on the congressional districts for the U.S. Read More

The extraction of natural gas from the earth has been shown to be governed by differential equations concerning flow through a porous material. Recently, models such as fractional differential equations have been developed to model this phenomenon. One key issue with these models is estimating the fraction of the differential equation. Read More

Predictive modeling from high-dimensional genomic data is often preceded by a dimension reduction step, such as principal components analysis (PCA). However, the application of PCA is not straightforward for multi-source data, wherein multiple sources of 'omics data measure different but related biological components. In this article we utilize recent advances in the dimension reduction of multi-source data for predictive modeling. Read More

Generalized linear models are often assumed to fit propensity scores, which are used to compute inverse probability weighted (IPW) estimators. In order to derive the asymptotic properties of IPW estimators, the propensity score is supposed to be bounded away from cero. This condition is known in the literature as strict positivity (or positivity assumption) and, in practice, when it does not hold, IPW estimators are very unstable and have a large variability. Read More

By constructing a sampling distribution for DVARS we can create a standardized version of DVARS that should be more similar across scanners and datasets. Read More