Statistics - Applications Publications (50)


Statistics - Applications Publications

Tumor cell populations can be thought of as being composed of homogeneous cell subpopulations, with each subpopulation being characterized by overlapping sets of single nucleotide variants (SNVs). Such subpopulations are known as subclones and are an important target for precision medicine. Reconstructing such subclones from next-generation sequencing (NGS) data is one of the major challenges in precision medicine. Read More

Zoonotic diseases are a major cause of morbidity, and productivity losses in both humans and animal populations. Identifying the source of food-borne zoonoses (e.g. Read More

Estimating vaccination uptake is an integral part of ensuring public health. It was recently shown that vaccination uptake can be estimated automatically from web data, instead of slowly collected clinical records or population surveys. All prior work in this area assumes that features of vaccination uptake collected from the web are temporally regular. Read More

Our work was motivated by a recent study on birth defects of infants born to pregnant women exposed to a certain medication for treating chronic diseases. Outcomes such as birth defects are rare events in the general population, which often translate to very small numbers of events in the unexposed group. As drug safety studies in pregnancy are typically observational in nature, we control for confounding in this rare events setting using propensity scores (PS). Read More

Detecting causal associations in time series datasets is a key challenge for novel insights into complex dynamical systems such as the Earth system or the human brain. Interactions in high-dimensional dynamical systems often involve time-delays, nonlinearity, and strong autocorrelations. These present major challenges for causal discovery techniques such as Granger causality leading to low detection power, biases, and unreliable hypothesis tests. Read More

Motivated by the modeling of liquidity risk in fund management in a dynamic setting, we propose and investigate a class of time series models with generalized Pareto marginals: the autoregressive generalized Pareto process (ARGP), a modified ARGP (MARGP) and a thresholded ARGP (TARGP). These models are able to capture key data features apparent in fund liquidity data and reflect the underlying phenomena via easily interpreted, low-dimensional model parameters. We establish stationarity and ergodicity, provide a link to the class of shot-noise processes, and determine the associated interarrival distributions for exceedances. Read More

In this study, the authors develop a structural model that combines a macro diffusion model with a micro choice model to control for the effect of social influence on the mobile app choices of customers over app stores. Social influence refers to the density of adopters within the proximity of other customers. Using a large data set from an African app store and Bayesian estimation methods, the authors quantify the effect of social influence and investigate the impact of ignoring this process in estimating customer choices. Read More

The method of biomass estimation based on a volume-to-biomass relationship has been applied in estimating forest biomass conventionally through the mean volume (m3 ha-1). However, few studies have been reported concerning the verification of the volume-biomass equations regressed using field data. The possible bias may result from the volume measurements and extrapolations from sample plots to stands or a unit area. Read More

This paper presents an estimator for semiparametric models that uses a feed-forward neural network to fit the nonparametric component. Unlike many methodologies from the machine learning literature, this approach is suitable for longitudinal/panel data. It provides unbiased estimation of the parametric component of the model, with associated confidence intervals that have near-nominal coverage rates. Read More

Electron ptychography has seen a recent surge of interest for phase sensitive imaging at atomic or near-atomic resolution. However, applications are so far mainly limited to radiation-hard samples because the required doses are too high for imaging biological samples at high resolution. We propose the use of non-convex, Bayesian optimization to overcome this problem and reduce the dose required for successful reconstruction by two orders of magnitude compared to previous experiments. Read More

Robust PCA methods are typically batch algorithms which requires loading all observations into memory before processing. This makes them inefficient to process big data. In this paper, we develop an efficient online robust principal component methods, namely online moving window robust principal component analysis (OMWRPCA). Read More

Goals are results of pin-point shots and it is a pivotal decision in soccer when, how and where to shoot. The main contribution of this study is two-fold. At first, after showing that there exists high spatial correlation in the data of shots across games, we introduce a spatial process in the error structure to model the probability of conversion from a shot depending on positional and situational covariates. Read More

In this paper we present an objective approach to change point analysis. In particular, we look at the problem from two perspectives. The first focuses on the definition of an objective prior when the number of change points is known a priori. Read More

Methods for detecting structural changes, or change points, in time series data are widely used in many fields of science and engineering. This chapter sketches some basic methods for the analysis of structural changes in time series data. The exposition is confined to retrospective methods for univariate time series. Read More

Evaluating the accuracies of models for match outcome predictions is nice and well but in the end the real proof is in the money to be made by betting. To evaluate the question whether the models developed by us could be used easily to make money via sports betting, we evaluate three cases: NCAAB post-season, NBA season, and NFL season, and find that it is possible yet not without its pitfalls. In particular, we illustrate that high accuracy does not automatically equal high pay-out, by looking at the type of match-ups that are predicted correctly by different models. Read More

A number of popular measures of dependence between pairs of band-limited signals rely on analytic phase. A common misconception is that the dependence revealed by these measures must be specific to the spectral range of the filtered input signals. Implicitly or explicitly, obtaining analytic phase involves normalizing the signal by its own envelope, which is a nonlinear operation that introduces broad spectral leakage. Read More

In classical sparse representation based classification and weighted SRC algorithms, the test samples are sparely represented by all training samples. They emphasize the sparsity of the coding coefficients but without considering the local structure of the input data. To overcome the shortcoming, aiming at the difficult problem of plant leaf recognition on the large-scale database, a two-stage local similarity based classification learning method is proposed by combining local mean-based classification method and local WSRC. Read More

Hypothesis tests based on linear models are widely accepted by organizations that regulate clinical trials. These tests are derived using strong assumptions about the data-generating process so that the resulting inference can be based on parametric distributions. Because these methods are well understood and robust, they are sometimes applied to data that depart from assumptions, such as ordinal integer scores. Read More

Recently, Eklund et al. (2016) analyzed clustering methods in standard FMRI packages: AFNI (which we maintain), FSL, and SPM [1]. They claimed: 1) false positive rates (FPRs) in traditional approaches are greatly inflated, questioning the validity of "countless published fMRI studies"; 2) nonparametric methods produce valid, but slightly conservative, FPRs; 3) a common flawed assumption is that the spatial autocorrelation function (ACF) of FMRI noise is Gaussian-shaped; and 4) a 15-year-old bug in AFNI's 3dClustSim significantly contributed to producing "particularly high" FPRs compared to other software. Read More

Recent reports of inflated false positive rates (FPRs) in FMRI group analysis tools by Eklund et al. (2016) have become a large topic within (and outside) neuroimaging. They concluded that: existing parametric methods for determining statistically significant clusters had greatly inflated FPRs ("up to 70%," mainly due to the faulty assumption that the noise spatial autocorrelation function is Gaussian- shaped and stationary), calling into question potentially "countless" previous results; in contrast, nonparametric methods, such as their approach, accurately reflected nominal 5% FPRs. Read More

In human microbiome studies, sequencing reads data are often summarized as counts of bacterial taxa at various taxonomic levels specified by a taxonomic tree. This paper considers the problem of analyzing two repeated measurements of microbiome data from the same subjects. Such data are often collected to assess the change of microbial composition after certain treatment, or the difference in microbial compositions across body sites. Read More

From doctors diagnosing patients to judges setting bail, experts often base their decisions on experience and intuition rather than on statistical models. While understandable, relying on intuition over models has often been found to result in inferior outcomes. Here we present a new method, select-regress-and-round, for constructing simple rules that perform well for complex decisions. Read More

Parametric hypothesis testing associated with two independent samples arises frequently in several applications in biology, medical sciences, epidemiology, reliability and many more. In this paper, we propose robust Wald-type tests for testing such two sample problems using the minimum density power divergence estimators of the underlying parameters. In particular, we consider the simple two-sample hypothesis concerning the full parametric homogeneity of the samples as well as the general two-sample (composite) hypotheses involving nuisance parameters also. Read More

Geostatistical modeling of the reservoir intrinsic properties starts only with sparse data available. These estimates will depend largely on the number of wells and their location. The drilling costs are so high that they do not allow new wells to be placed for uncertainty assessment. Read More

We consider a class of branching processes called Markovian binary trees, in which the individuals lifetime and reproduction epochs are modeled using a transient Markovian arrival process (TMAP). We estimate the parameters of the TMAP based on population data containing information on age-specific fertility and mortality rates. Depending on the degree of detail of the available data, a weighted non-linear regression method or a maximum likelihood method is applied. Read More

In this work we explore the dissimilarity between symmetric word pairs, by comparing the inter-word distance distribution of a word to that of its reversed complement. We propose a new measure of dissimilarity between such distributions. Since symmetric pairs with different patterns could point to evolutionary features, we search for the pairs with the most dissimilar behaviour. Read More

In this work, a novel subspace-based method for blind identification of multichannel finite impulse response (FIR) systems is presented. Here, we exploit directly the impeded Toeplitz channel structure in the signal linear model to build a quadratic form whose minimization leads to the desired channel estimation up to a scalar factor. This method can be extended to estimate any predefined linear structure, e. Read More

Border inspection, and the challenge of deciding which of the tens of millions of consignments that arrive should be inspected, is a perennial problem for regulatory authorities. The objective of these inspections is to minimise the risk of contraband entering the country. As an example, for regulatory authorities in charge of biosecurity material, consignments of goods are classified before arrival according to their economic tariff number (Department of Immigration and Border Protection, 2016). Read More

Biosecurity risk material (BRM) presents a clear and significant threat to national and international environmental and economic assets. Intercepting BRM carried by non-compliant international passengers is a key priority of border biosecurity services. Global travel rates are constantly increasing, which complicates this important responsibility, and necessitates judicious intervention. Read More

We consider a new model of individual neuron of Integrate-and-Fire (IF) type with fractional noise. The correlations of its spike trains are studied and proved to have long memory, unlike classical IF models. To measure correctly long-range dependence, it is often necessary to know if the data are stationary. Read More

Short-term probabilistic wind power forecasting can provide critical quantified uncertainty information of wind generation for power system operation and control. As the complicated characteristics of wind power prediction error, it would be difficult to develop a universal forecasting model dominating over other alternative models. Therefore, a novel multi-model combination (MMC) approach for short-term probabilistic wind generation forecasting is proposed in this paper to exploit the advantages of different forecasting models. Read More

In the presented study Parent/Teacher Disruptive Behavior Disorder (DBD) rating scale based on the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV-TR [APA, 2000]) which was developed by Pelham and his colleagues (Pelham et al., 1992) was translated and adopted for assessment of childhood behavioral abnormalities, especially ADHD, ODD and CD in Georgian children and adolescents. The DBD rating scale was translated into Georgian language using back translation technique by English language philologists and checked and corrected by qualified psychologists and psychiatrist of Georgia. Read More

Health economic evaluation studies are widely used in public health to assess health strategies in terms of their cost-effectiveness and inform public policies. We developed an R package for Markov models implementing most of the modelling and reporting features described in reference textbooks and guidelines: deterministic and probabilistic sensitivity analysis, heterogeneity analysis, time dependency on state-time and model-time (semi-Markov and non-homogeneous Markov models), etc. In this paper we illustrate the features of heemod by building and analysing an example Markov model. Read More

We develop a supervised-learning-based approach for monitoring and diagnosing texture-related defects in manufactured products characterized by stochastic textured surfaces that satisfy the locality and stationarity properties of Markov random fields. Examples of stochastic textured surface data include images of woven textiles; image or surface metrology data for machined, cast, or formed metal parts; microscopy images of material microstructure samples; etc. To characterize the complex spatial statistical dependencies of in-control samples of the stochastic textured surface, we use rather generic supervised learning methods, which provide an implicit characterization of the joint distribution of the surface texture. Read More

Affiliations: 1Dep of Statistics University of Oxford UK, 2Private practice of Orthodontics, Rome Italy, 3IMT School Lucca, Italy, 4Department of Orthodontics, University of Florence, Italy

In this paper we use Bayesian networks to determine and visualise the interactions among various Class III malocclusion maxillofacial features during growth and treatment. We start from a sample of 143 patients characterised through a series of a maximum of 21 different craniofacial features. We estimate a network model from these data and we test its consistency by verifying some commonly accepted hypotheses on the evolution of these disharmonies by means of Bayesian statistics. Read More

This paper considers an alternative method for fitting CARR models using combined estimating functions (CEF) by showing its usefulness in applications in economics and quantitative finance. The associated information matrix for corresponding new estimates is derived to calculate the standard errors. A simulation study is carried out to demonstrate its superiority relative to other two competitors: linear estimating functions (LEF) and the maximum likelihood (ML). Read More

Cancer cell lines have frequently been used to link drug sensitivity and resistance with genomic profiles. To capture genomic complexity in cancer, the Cancer Genome Project (CGP) (Garnett et al., 2012) screened 639 human tumor cell lines with 130 drugs ranging from known chemotherapeutic agents to experimental compounds. Read More

Intermediate mass black holes play a critical role in understanding the evolutionary connection between stellar mass and super-massive black holes. However, to date the existence of these species of black holes remains ambiguous and their formation process is therefore unknown. It has been long suspected that black holes with masses $10^{2}-10^{4}M_{\odot}$ should form and reside in dense stellar systems. Read More

There is a need for affordable, widely deployable maternal-fetal ECG monitors to improve maternal and fetal health during pregnancy and delivery. Based on the diffusion-based channel selection, here we present the mathematical formalism and clinical validation of an algorithm capable of accurate separation of maternal and fetal ECG from a two channel signal acquired over maternal abdomen. Read More

Wind has the potential to make a significant contribution to future energy resources; however, the task of locating the sources of this renewable energy on a global scale with climate models, along with the associated uncertainty, is hampered by the storage challenges associated with the extremely large amounts of computer output. Various data compression techniques can be used to mitigate this problem, but traditional algorithms deliver relatively small compression rates by focusing on individual simulations. We propose a statistical model that aims at reproducing the data-generating mechanism of an ensemble of runs by providing a stochastic approximation of global annual wind data and compressing all the scientific information in the estimated statistical parameters. Read More

The importance sampling (IS) method lies at the core of many Monte Carlo-based techniques. IS allows the approximation of a target probability distribution by drawing samples from a proposal (or importance) distribution, different from the target, and computing importance weights (IWs) that account for the discrepancy between these two distributions. The main drawback of IS schemes is the degeneracy of the IWs, which significantly reduces the efficiency of the method. Read More

In this manuscript we review new ideas and first results on application of the Graphical Models approach, originated from Statistical Physics, Information Theory, Computer Science and Machine Learning, to optimization problems of network flow type with additional constraints related to the physics of the flow. We illustrate the general concepts on a number of enabling examples from power system and natural gas transmission (continental scale) and distribution (district scale) systems. Read More

We consider the parameter estimation of a 2D sinusoid. Although sinusoidal parameter estimation has been extensively studied, our model differs from those examined in the available literature by the inclusion of an offset term. We derive both the maximum likelihood estimation (MLE) solution and the Cramer-Rao lower bound (CRLB) on the variance of the model's estimators. Read More

Predictive modeling plays key role in providing accurate prognosis and enables us to take a step closer to personalized treatment. We identified two potential sources of human induced biases that can lead to disparate conclusions. We illustrate through a complex phenotype that robust results can still be drawn after accounting for such biases. Read More

In NMR spectroscopy, undersampling in the indirect dimensions causes reconstruction artifacts whose size can be bounded using the so-called {\it coherence}. In experiments with multiple indirect dimensions, new undersampling approaches were recently proposed: random phase detection (RPD) \cite{Maciejewski11} and its generalization, partial component sampling (PCS) \cite{Schuyler13}. The new approaches are fully aware of the fact that high-dimensional experiments generate hypercomplex-valued free induction decays; they randomly acquire only certain low-dimensional components of each high-dimensional hypercomplex entry. Read More

The eigenstructure of the discrete Fourier transform (DFT) is examined and new systematic procedures to generate eigenvectors of the unitary DFT are proposed. DFT eigenvectors are suggested as user signatures for data communication over the real adder channel (RAC). The proposed multiuser communication system over the 2-user RAC is detailed. Read More

Agile localization of anomalous events plays a pivotal role in enhancing the overall reliability of the grid and avoiding cascading failures. This is especially of paramount significance in the large-scale grids due to their geographical expansions and the large volume of data generated. This paper proposes a stochastic graphical framework, by leveraging which it aims to localize the anomalies with the minimum amount of data. Read More

Estimating the time lag between two hydrogeologic time series (e.g. precipitation and water levels in an aquifer) is of significance for a hydrogeologist-modeler. Read More

This is a companion paper to Yarkoni and Westfall (2017), which describes the Python package Bambi for estimating Bayesian generalized linear mixed models using a simple interface. Here I give the statistical details underlying the default, weakly informative priors used in all models when the user does not specify the priors. Our approach is to first deduce what the variances of the slopes would be if we were instead to have defined the priors on the partial correlation scale, and then to set independent Normal priors on the slopes with variances equal to these implied variances. Read More

We propose a curve-based Riemannian-geometric approach for general shape-based statistical analyses of tumors obtained from radiologic images. A key component of the framework is a suitable metric that (1) enables comparisons of tumor shapes, (2) provides tools for computing descriptive statistics and implementing principal component analysis on the space of tumor shapes, and (3) allows for a rich class of continuous deformations of a tumor shape. The utility of the framework is illustrated through specific statistical tasks on a dataset of radiologic images of patients diagnosed with glioblastoma multiforme, a malignant brain tumor with poor prognosis. Read More