Statistics - Applications Publications (50)


Statistics - Applications Publications

In this paper, we generalize the metric-based permutation test for the equality of covariance operators proposed by Pigoli et al. (2014) to the case of multiple samples of functional data. To this end, the non-parametric combination methodology of Pesarin and Salmaso (2010) is used to combine all the pairwise comparisons between samples into a global test. Read More

Origin-destination problems have a substantial literature at this point. Applications have included labor flows, traffic flows, and mail flows. Viewed from a spatial perspective, the data consist of a pair of spatial locations, the origin and the destination with an associated measurement. Read More

Confidence intervals are a popular way to visualize and analyze data distributions. Unlike p-values, they can convey information both about statistical significance as well as effect size. However, very little work exists on applying confidence intervals to multivariate data. Read More

Mechanistic modelling of animal movement is often formulated in discrete time despite inevitable problems with scale invariance, such as handling irregularly timed observations. A natural solution is to define movement in continuous time, yet the uptake of such modelling has been slower than that of its discrete counterparts. This lack of implementation is often excused by a difficulty in interpretation. Read More

Traffic accident data are usually noisy, contain missing values, and heterogeneous. How to select the most important variables to improve real-time traffic accident risk prediction has become a concern of many recent studies. This paper proposes a novel variable selection method based on the Frequent Pattern tree (FP tree) algorithm. Read More

We propose an objective Bayesian approach to estimate the number of degrees of freedom for the multivariate $t$ distribution and for the $t$-copula, when the parameter is considered discrete. Inference on this parameter has been problematic, as the scarce literature for the multivariate $t$ shows and, more important, the absence of any method for the $t$-copula. We employ an objective criterion based on loss functions which allows to overcome the issue of defining objective probabilities directly. Read More

In this article, we propose a new algorithm for supervised learning methods, by which one can both capture the non-linearity in data and also find the best subset model. To produce an enhanced subset of the original variables, an ideal selection method should have the potential of adding a supplementary level of regression analysis that would capture complex relationships in the data via mathematical transformation of the predictors and exploration of synergistic effects of combined variables. The method that we present here has the potential to produce an optimal subset of variables, rendering the overall process of model selection to be more efficient. Read More

Relational arrays represent interactions or associations between pairs of actors, often over time or in varied contexts. We focus on the case where the elements of a relational array are modeled as a linear function of observable covariates. Due to the inherent dependencies among relations involving the same individual, standard regression methods for quantifying uncertainty in the regression coefficients for independent data are invalid. Read More

This article provides a weighted model confidence set, whenever underling model has been misspecified and some part of support of random variable $X$ conveys some important information about underling true model. Application of such weighted model confidence set for local and mixture model confidence sets have been given. Two simulation studies have been conducted to show practical application of our findings. Read More

Expression quantitative trait loci (eQTL) analysis identifies genetic markers associated with the expression of a gene. Most existing eQTL analyses and methods investigate association in a single, readily available tissue, such as blood. Joint analysis of eQTL in multiple tissues has the potential to improve, and expand the scope of, single-tissue analyses. Read More

Longitudinal social network studies may easily suffer from a lack of statistical power. This is the case in particular for studies that simultaneously investigate change of network ties and change of nodal attributes. Such selection and influence studies have become increasingly popular due to the introduction of stochastic actor-oriented models (SAOMs). Read More

The brain is a paradigmatic example of a complex system as its functionality emerges as a global property of local mesoscopic and microscopic interactions. Complex network theory allows to elicit the functional architecture of the brain in terms of links (correlations) between nodes (grey matter regions) and to extract information out of the noise. Here we present the analysis of functional magnetic resonance imaging data from forty healthy humans during the resting condition for the investigation of the basal scaffold of the functional brain network organization. Read More

There is a widespread notion in cricketing world that with increasing pace the performance of a bowler improves. Additionally, many commentators believe lower order batters to be more vulnerable to pace. The present study puts these two ubiquitous notions under test by statistically analysing the differences in performance of bowlers from three subpopulations based on average release velocities. Read More

We extend the model used in Gardiner et al. (2002) and Polverejan et al. (2003) through deriving an explicit expression for the joint probability density function of hospital charge and length of stay (LOS) under a general class of conditions. Read More

We call change-point problem (CPP) the identification of changes in the probabilistic behavior of a sequence of observations. Solving the CPP involves detecting the number and position of such changes. In genetics the study of how and what characteristics of a individual's genetic content might contribute to the occurrence and evolution of cancer has fundamental importance in the diagnosis and treatment of such diseases and can be formulated in the framework of chage-point analysis. Read More

Several genetic alterations are involved in the genesis and development of cancers. The determination of whether and how each genetic alterations contributes to cancer development is fundamental for a complete understanding of the human cancer etiology. Loss of heterozygosity (LOH) is one of such genetic phenomenon linked to a variate of diseases and characterized by the change from heterozygosity (the presence of both alleles of a gene) to to homozygosity (presence of only one type of allele) in a particular DNA locus. Read More

Drawing causal inference with observational studies is the central pillar of many disciplines. One sufficient condition for identifying the causal effect is that the treatment-outcome relationship is unconfounded conditional on the observed covariates. It is often believed that the more covariates we condition on, the more plausible this unconfoundedness assumption is. Read More

Over- and undertreatment harm patients and society and confound other healthcare quality measures. Despite a growing body of research covering specific conditions, we lack tools to systematically detect and measure over- and undertreatment in hospitals. We demonstrate a test used to monitor over- and undertreatment in Dutch hospitals, and illustrate its results applied to the aggregated administrative treatment data of 1,836,349 patients at 89 hospitals in 2013. Read More

This study investigates travel behavior determinants based on a multiday travel survey conducted in the region of Ghent, Belgium. Due to the limited data reliability of the data sample and the influence of outliers exerted on classical principal component analysis, robust principal component analysis (ROBPCA) is employed in order to reveal the explanatory variables responsible for most of the variability. Interpretation of the results is eased by utilizing ROSPCA. Read More

In the current paper, the estimation of the probability density function and the cumulative distribution function of the Topp-Leone distribution is considered. We derive the following estimators: maximum likelihood estimator, uniformly minimum variance unbiased estimator, percentile estimator, least squares estimator and weighted least squares estimator. A simulation study shows that the maximum likelihood estimator is more efficient than the others estimators. Read More

Recent work has shown that a country's productive structure constrains its level of economic growth and income inequality. Here, we compare the productive structure of countries in Latin America and the Caribbean (LAC) with that of China and other High-Performing Asian Economies (HPAE) to expose the increasing gap in their productive capabilities. Moreover, we use the product space and the Product Gini Index to reveal the structural constraints on income inequality. Read More

In this paper we develop a bivariate discrete generalized exponential distribution, whose marginals are discrete generalized exponential distribution as proposed by Nekoukhou, Alamatsaz and Bidram ("Discrete generalized exponential distribution of a second type", Statistics, 47, 876 - 887, 2013). It is observed that the proposed bivariate distribution is a very flexible distribution and the bivariate geometric distribution can be obtained as a special case of this distribution. The proposed distribution can be seen as a natural discrete analogue of the bivariate generalized exponential distribution proposed by Kundu and Gupta ("Bivariate generalized exponential distribution", Journal of Multivariate Analysis, 100, 581 - 593, 2009). Read More

The focus in this paper is Bayesian system identification based on noisy incomplete modal data where we can impose spatially-sparse stiffness changes when updating a structural model. To this end, based on a similar hierarchical sparse Bayesian learning model from our previous work, we propose two Gibbs sampling algorithms. The algorithms differ in their strategies to deal with the posterior uncertainty of the equation-error precision parameter, but both sample from the conditional posterior probability density functions (PDFs) for the structural stiffness parameters and system modal parameters. Read More

Estimating treatment effects for subgroups defined by post-treatment behavior (i.e., estimating causal effects in a principal stratification framework) can be technically challenging and heavily reliant on strong assumptions. Read More

Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. Read More

There is considerable interest in studying how the distribution of an outcome varies with a predictor. We are motivated by environmental applications in which the predictor is the dose of an exposure and the response is a health outcome. A fundamental focus in these studies is inference on dose levels associated with a particular increase in risk relative to a baseline. Read More

A recent Editorial by Slotnick (2017) reconsiders the findings of our paper on the accuracy of false positive rate control with cluster inference in fMRI (Eklund et al, 2016), in particular criticising our use of resting state fMRI data as a source for null data in the evaluation of task fMRI methods. We defend this use of resting fMRI data, as while there is much structure in this data, we argue it is representative of task data noise and as such analysis software should be able to accommodate this noise. We also discuss a potential problem with Slotnick's own method. Read More

Discrete-time hidden Markov models are a broadly useful class of latent-variable models with applications in areas such as speech recognition, bioinformatics, and climate data analysis. It is common in practice to introduce temporal non-homogeneity into such models by making the transition probabilities dependent on time-varying exogenous input variables via a multinomial logistic parametrization. We extend such models to introduce additional non-homogeneity into the emission distribution using a generalized linear model (GLM), with data augmentation for sampling-based inference. Read More

A typical neuroimaging study will produce a 3D brain statistic image that summarises the evidence for activation during the experiment. However, for practical reasons those images are rarely published; instead, authors only report the (x,y,z) locations of local maxima in the statistic image. Neuroimaging meta-analyses use these foci from multiple studies to find areas of consistent activation across the human brain. Read More

Deriving the optimal safety stock quantity with which to meet customer satisfaction is one of the most important topics in stock management. However, it is difficult to control the stock management of correlated marketable merchandise when using an inventory control method that was developed under the assumption that the demands are not correlated. For this, we propose a deterministic approach that uses a probability inequality to derive a reasonable safety stock for the case in which we know the correlation between various commodities. Read More

The Intelligent Transportation System (ITS) targets to a coordinated traffic system by applying the advanced wireless communication technologies for road traffic scheduling. Towards an accurate road traffic control, the short-term traffic forecasting to predict the road traffic at the particular site in a short period is often useful and important. In existing works, Seasonal Autoregressive Integrated Moving Average (SARIMA) model is a popular approach. Read More

We consider the problem of probabilistic projection of the total fertility rate (TFR) for subnational regions. We seek a method that is consistent with the UN's recently adopted Bayesian method for probabilistic TFR projections for all countries, and works well for all countries. We assess various possible methods using subnational TFR data for 47 countries. Read More

The development of statistical approaches for the joint modelling of the temporal changes of imaging, biochemical, and clinical biomarkers is of paramount importance for improving the understanding of neurodegenerative disorders, and for providing a reference for the prediction and quantification of the pathology in unseen individuals. Nonetheless, the use of disease progression models for probabilistic predictions still requires investigation, for example for accounting for missing observations in clinical data, and for accurate uncertainty quantification. We tackle this problem by proposing a novel Gaussian process-based method for the joint modeling of imaging and clinical biomarker progressions from time series of individual observations. Read More

Penetrance, which plays a key role in genetic research, is defined as the proportion of individuals with the genetic variants (i.e., {genotype}) that cause a particular trait and who have clinical symptoms of the trait (i. Read More

Mixed-effects models have emerged as the gold standard of statistical analysis in different sub-fields of linguistics (Baayen, Davidson & Bates, 2008; Johnson, 2009; Barr, et al, 2013; Gries, 2015). One problematic feature of these models is their failure to converge under maximal (or even near-maximal) random effects structures. The lack of convergence is relatively unaddressed in linguistics and when it is addressed has resulted in statistical practices (e. Read More

Detection with high dimensional multimodal data is a challenging problem when there are complex inter- and intra- modal dependencies. While several approaches have been proposed for dependent data fusion (e.g. Read More

Current theories hold that brain function is highly related to long-range physical connections through axonal bundles, namely extrinsic connectiv-ity. However, obtaining a groupwise cortical parcellation based on extrinsic connectivity remains challenging. Current parcellation methods are compu-tationally expensive; need tuning of several parameters or rely on ad-hoc constraints. Read More

Extreme phenotype sampling is a selective genotyping design for genetic association studies where only individuals with extreme values of a continuous trait are genotyped for a set of genetic variants. Under financial or other limitations, this design is assumed to improve the power to detect associations between genetic variants and the trait, compared to randomly selecting the same number of individuals for genotyping. Here we present extensions of likelihood models that can be used for inference when the data are sampled according to the extreme phenotype sampling design. Read More

It has been demonstrated that the statistical power of many neuroscience studies is very low, so that the results are unlikely to be robustly reproducible. How are neuroscientists and the journals in which they publish responding to this problem? Here I review the sample size justifications provided for all 15 papers published in one recent issue of the leading journal Nature Neuroscience. Of these, only one claimed it was adequately powered. Read More

This paper shows how to carry out efficient asymptotic variance reduction when estimating volatility in the presence of stochastic volatility and microstructure noise with the realized kernels (RK) from [Barndorff-Nielsen et al., 2008] and the quasi-maximum likelihood estimator (QMLE) studied in [Xiu, 2010]. To obtain such a reduction, we chop the data into B blocks, compute the RK (or QMLE) on each block, and aggregate the block estimates. Read More

We study how to meta-analyze a large collection of randomized experiments (eg. those done during routine improvements of an online service) to learn general causal relationships. We focus on the case where the number of tests is large, the analyst has no metadata about the context of the test and only has access to summary statistics (and not the raw data). Read More

In this paper, we consider a soft measure of block sparsity, $k_\alpha(\mathbf{x})=\left(\lVert\mathbf{x}\rVert_{2,\alpha}/\lVert\mathbf{x}\rVert_{2,1}\right)^{\frac{\alpha}{1-\alpha}},\alpha\in[0,\infty]$ and propose a procedure to estimate it by using multivariate isotropic symmetric $\alpha$-stable random projections without sparsity or block sparsity assumptions. The limiting distribution of the estimator is given. Some simulations are conducted to illustrate our theoretical results. Read More

Maximizing product use is a central goal of many businesses, which makes retention and monetization two central analytics metrics in games. Player retention may refer to various duration variables quantifying product use: total playtime or session playtime are popular research targets, and active playtime is well-suited for subscription games. Such research often has the goal of increasing player retention or conversely decreasing player churn. Read More

Cooperative geolocation has attracted significant research interests in recent years. A large number of localization algorithms rely on the availability of statistical knowledge of measurement errors, which is often difficult to obtain in practice. Compared with the statistical knowledge of measurement errors, it can often be easier to obtain the measurement error bound. Read More

We propose a bio-inspired, agent-based approach to describe the natural phenomenon of group chasing in both two and three dimensions. Using a set of local interaction rules we created a continuous-space and discrete-time model with time delay, external noise and limited acceleration. We implemented a unique collective chasing strategy, optimized its parameters and studied its properties when chasing a much faster, erratic escaper. Read More

Local climate conditions play a major role in the development of the mosquito population responsible for transmitting Dengue Fever. Since the {\em Aedes Aegypti} mosquito is also a primary vector for the recent Zika and Chikungunya epidemics across the Americas, a detailed monitoring of periods with favorable climate conditions for mosquito profusion may improve the timing of vector-control efforts and other urgent public health strategies. We apply dimensionality reduction techniques and machine-learning algorithms to climate time series data and analyze their connection to the occurrence of Dengue outbreaks for seven major cities in Brazil. Read More

We are in the middle of a remarkable rise in the use and capability of artificial intelligence. Much of this growth has been fueled by the success of deep learning architectures: models that map from observables to outputs via multiple layers of latent representations. These deep learning algorithms are effective tools for unstructured prediction, and they can be combined in AI systems to solve complex automated reasoning problems. Read More

Transformed Generalized Autoregressive Moving Average (TGARMA) models were recently proposed to deal with non-additivity, non-normality and heteroscedasticity in real time series data. In this paper, a Bayesian approach is proposed for TGARMA models, thus extending the original model. We conducted a simulation study to investigate the performance of Bayesian estimation and Bayesian model selection criteria. Read More

Damage detection of mechanical structures such as bridges is an important research problem in civil engineering. Using spatially distributed sensor time series data collected from a recent experiment on a local bridge in upper state New York, we study noninvasive damage detection using information-theoretical methods. Several findings are in order. Read More