Statistics - Applications Publications (50)


Statistics - Applications Publications

In the study of complex physical and biological systems represented by multivariate stochastic processes, an issue of great relevance is the description of the system dynamics spanning multiple temporal scales. While methods to assess the dynamic complexity of individual processes at different time scales are well-established, the multiscale evaluation of directed interactions between processes is complicated by theoretical and practical issues such as filtering and downsampling. Here we extend the very popular measure of Granger causality (GC), a prominent tool for assessing directed lagged interactions between joint processes, to quantify information transfer across multiple time scales. Read More

Spatio-temporal hierarchical modeling is an extremely attractive way to model the spread of crime or terrorism data over a given region, especially when the observations are counts and must be modeled discretely. The spatio-temporal diffusion is placed, as a matter of convenience, in the process model allowing for straightforward estimation of the diffusion parameters through Bayesian techniques. However, this method of modeling does not allow for the existence of self-excitation, or a temporal data model dependency, that has been shown to exist in criminal and terrorism data. Read More

Feature aided tracking can often yield improved tracking performance over the standard multiple target tracking (MTT) algorithms with only kinematic measurements. However, in many applications, the feature signal of the targets consists of sparse Fourier-domain signals. It changes quickly and nonlinearly in the time domain, and the feature measurements are corrupted by missed detections and mis-associations. Read More

Affiliations: 1and the MAVAN team, 2and the MAVAN team, 3and the MAVAN team, 4and the MAVAN team, 5and the MAVAN team, 6and the MAVAN team, 7and the MAVAN team, 8and the MAVAN team

Motivated by the goal of expanding currently existing genotype x environment interaction (GxE) models to simultaneously include multiple genetic variants and environmental exposures in a parsimonious way, we developed a novel method to estimate the parameters in a GxE model, where G is a weighted sum of genetic variants (genetic score) and E is a weighted sum of environments (environmental score). The approach uses alternating optimization to estimate the parameters of the GxE model. This is an iterative process where the genetic score weights, the environmental score weights, and the main model parameters are estimated in turn assuming the other parameters to be constant. Read More

It is widely recognized that citation counts for papers from different fields cannot be directly compared because different scientific fields adopt different citation practices. Citation counts are also strongly biased by paper age since older papers had more time to attract citations. Various procedures aim at suppressing these biases and give rise to new normalized indicators, such as the relative citation count. Read More

This paper considers a new method for the binary asteroid orbit determination problem. The method is based on the Bayesian approach with a global optimisation algorithm. The orbital parameters to be determined are modelled through a posteriori, including a priori and likelihood terms. Read More

In online discussion communities, users can interact and share information and opinions on a wide variety of topics. However, some users may create multiple identities, or sockpuppets, and engage in undesired behavior by deceiving others or manipulating discussions. In this work, we study sockpuppetry across nine discussion communities, and show that sockpuppets differ from ordinary users in terms of their posting behavior, linguistic traits, as well as social network structure. Read More

Many interesting natural phenomena are sparsely distributed and discrete. Locating the hotspots of such sparsely distributed phenomena is often difficult because their density gradient is likely to be very noisy. We present a novel approach to this search problem, where we model the co-occurrence relations between a robot's observations with a Bayesian nonparametric topic model. Read More

Current astrophysical models of the interstellar medium assume that small scale variation and noise can be modelled as Gaussian random fields or simple transformations thereof, such as lognormal. We use topological methods to investigate this assumption for three regions of the southern sky. We consider Gaussian random fields on two-dimensional lattices and investigate the expected distribution of topological structures quantified through Betti numbers. Read More

The most common reason for spinal surgery in elderly patients is lumbar spinal stenosis(LSS). For LSS, treatment decisions based on clinical and radiological information as well as personal experience of the surgeon shows large variance. Thus a standardized support system is of high value for a more objective and reproducible decision. Read More

I address the difficult challenge of measuring the relative influence of competing basketball game strategies, and I apply my analysis to plays resulting in three-point shots. I use a glut of SportVU player tracking data from over 600 NBA games to derive custom position-based features that capture tangible game strategies from game-play data, such as teamwork, player matchups, and on-ball defender distances. Then, I demonstrate statistical methods for measuring the relative importance of any given basketball strategy. Read More

Prior information is often incorporated informally when planning a clinical trial. Here, we present an approach on how to incorporate prior information, such as data from historical clinical trials, into the nuisance parameter based sample size re-estimation in a design with an internal pilot study. We focus on trials with continuous endpoints in which the outcome variance is the nuisance parameter. Read More

In the past few years, new technologies in the field of neuroscience have made it possible to simultaneously image activity in large populations of neurons at cellular resolution in behaving animals. In mid-2016, a huge repository of this so-called "calcium imaging" data was made publicly-available. The availability of this large-scale data resource opens the door to a host of scientific questions, for which new statistical methods must be developed. Read More

The popularity of online surveys has increased the prominence of sampling weights in claims of representativeness. Yet, much uncertainty remains regarding how these weights should be employed in the analysis of survey experiments: Should they be used or ignored? If they are used, which estimators are preferred? We offer practical advice, rooted in the Neyman-Rubin model, for researchers producing and working with survey experimental data. We examine simple, efficient estimators (Horvitz-Thompson, H\`ajek, "double-H\`ajek", and post-stratification) for analyzing these data, along with formulae for biases and variances. Read More

We introduce a method for decomposition of trend, cycle and seasonal components in spatio-temporal models and apply it to investigate the existence of climate changes in temperature and rainfall series. The method incorporates critical features in the analysis of climatic problems - the importance of spatial heterogeneity, information from a large number of weather stations, and the presence of missing data. The spatial component is based on continuous projections of spatial covariance functions, allowing modeling the complex patterns of dependence observed in climatic data. Read More

Realistic maps of past land cover are needed to investigate prehistoric environmental changes and anthropogenic impacts. However, observation based reconstructions of past land cover are rare. Recently Pirzamanbein et al. Read More

Affiliations: 1Charite, FU, HU, BIH, BCCN, BCAN, Neurocure, Berlin, 2University Medical Center Hamburg-Eppendorf, 3Charite, FU, HU, BIH, BCCN, BCAN, Neurocure, Berlin, 4Charite, FU, HU, BIH, BCCN, BCAN, Neurocure, Berlin

Standard neuroimaging data analysis based on traditional principles of experimental design, modelling, and statistical inference is increasingly complemented by novel analysis methods, driven e.g. by machine learning methods. Read More

Every network scientist knows that preferential attachment combines with growth to produce networks with power-law in-degree distributions. So how, then, is it possible for the network of American Physical Society journal collection citations to enjoy a log-normal citation distribution when it was found to have grown in accordance with preferential attachment? This anomalous result, which we exalt as the preferential attachment paradox, has remained unexplained since the physicist Sidney Redner first made light of it over a decade ago. In this paper we propose a resolution to the paradox. Read More

In an efficient stock market, the returns and their time-dependent volatility are often jointly modeled by stochastic volatility models (SVMs). Over the last few decades several SVMs have been proposed to adequately capture the defining features of the relationship between the return and its volatility. Among one of the earliest SVM, Taylor (1982) proposed a hierarchical model, where the current return is a function of the current latent volatility, which is further modeled as an auto-regressive process. Read More

We consider the problem of model selection and estimation in sparse high dimensional linear regression models with strongly correlated variables. First, we study the theoretical properties of the dual Lasso solution, and we show that joint consideration of the Lasso primal and its dual solutions are useful for selecting correlated active variables. Second, we argue that correlations among active predictors are not problematic, and we derive a new weaker condition on the design matrix, called Pseudo Irrepresentable Condition (PIC). Read More

Load forecasting at distribution networks is more challenging than load forecasting at transmission networks because its load pattern is more stochastic and unpredictable. To plan sufficient resources and estimate DER hosting capacity, it is invaluable for a distribution network planner to get the probabilistic distribution of daily peak-load under a feeder over long term. In this paper, we model the probabilistic distribution functions of daily peak-load under a feeder using power law distributions, which is tested by improved Kolmogorov Smirnov test enhanced by the Monte Carlo simulation approach. Read More

This paper proposes a new objective function and quantile regression (QR) algorithm for load forecasting (LF). In LF, the positive forecasting errors often have different economic impact from the negative forecasting errors. Considering this difference, a new objective function is proposed to put different prices on the positive and negative forecasting errors. Read More

This paper develops an approximate Bayesian doubly-robust (DR) estimation method to quantify the causal effect of speed cameras on road traffic accidents. Previous empirical work on this topic, which shows a diverse range of estimated effects, is based largely on outcome regression (OR) models using the Empirical Bayes approach or on simple before and after comparisons. Issues of causality and confounding have received little formal attention. Read More

Variance based sensitivity indices have established themselves as a reference among practitioners of sensitivity analysis of model output. It is not unusual to consider a variance based sensitivity analysis as informative if it produces at least the first order sensitivity indices Sj and the so-called total-effect sensitivity indices STj or Tj for all the uncertain factors of the mathematical model under analysis. Computational economy is critical in sensitivity analysis. Read More

Accurately predicting the future capacity and remaining useful life of batteries is necessary to ensure reliable system operation and to minimise maintenance costs. The complex nature of battery degradation has meant that mechanistic modelling of capacity fade has thus far remained intractable; however, with the advent of cloud-connected devices, data from cells in various applications is becoming increasingly available, and the feasibility of data-driven methods for battery prognostics is increasing. Here we propose Gaussian process (GP) regression for forecasting battery state of health, and highlight various advantages of GPs over other data-driven and mechanistic approaches. Read More

Modern social media platforms facilitate the rapid spread of information online. Modelling phenomena such as social contagion and information diffusion are contingent upon a detailed understanding of the information-sharing processes. In Twitter, an important aspect of this occurs with retweets, where users rebroadcast the tweets of other users. Read More

We consider the problem related to clustering of gamma-ray bursts (from "BATSE" catalogue) through kernel principal component analysis in which our proposed kernel outperforms results of other competent kernels in terms of clustering accuracy and we obtain three physically interpretable groups of gamma-ray bursts. The effectivity of the suggested kernel in combination with kernel principal component analysis in revealing natural clusters in noisy and nonlinear data while reducing the dimension of the data is also explored in two simulated data sets. Read More

Steganography is collection of methods to hide secret information ("payload") within non-secret information ("container"). Its counterpart, Steganalysis, is the practice of determining if a message contains a hidden payload, and recovering it if possible. Presence of hidden payloads is typically detected by a binary classifier. Read More

This is a hands-on introduction to Generalised Additive Mixed Models (GAMMs) in the context of linguistics with a particular focus on dynamic speech analysis (e.g. formant contours, pitch tracks, diachronic change, etc. Read More

Brain mapping is an increasingly important tool in neurology and psychiatry researches for the realization of data-driven personalized medicine in the big data era, which learns the statistical links between brain images and subject level features. Taking images as responses, the task raises a lot of challenges due to the high dimensionality of the image with relatively small number of samples, as well as the noisiness of measurements in medical images. In this paper we propose a novel method {\it Smooth Image-on-scalar Regression} (SIR) for recovering the true association between an image outcome and scalar predictors. Read More

Adaptive designs for multi-armed clinical trials have become increasingly popular recently in many areas of medical research because of their potential to shorten development times and to increase patient response. However, developing response-adaptive trial designs that offer patient benefit while ensuring the resulting trial avoids bias and provides a statistically rigorous comparison of the different treatments included is highly challenging. In this paper, the theory of Multi-Armed Bandit Problems is used to define a family of near optimal adaptive designs in the context of a clinical trial with a normally distributed endpoint with known variance. Read More

Pay-for-performance approaches have been widely adopted in order to drive improvements in the quality of healthcare provision. Previous studies evaluating the impact of these programs are either limited by the number of health outcomes or of medical conditions considered. In this paper, we evaluate the effectiveness of a pay-for-performance program on the basis of five health outcomes and across a wide range of medical conditions. Read More

Diffuse reflectance spectroscopy is a powerful technique to predict soil properties. It can be used in situ to provide data inexpensively and rapidly compared to the standard laboratory measurements. Because most spectral data bases contain air-dried samples scanned in the laboratory, field spectra acquired in situ are either absent or rare in calibration data sets. Read More

Predictive modeling is increasingly being employed to assist human decision-makers. One purported advantage of replacing or augmenting human judgment with computer models in high stakes settings-- such as sentencing, hiring, policing, college admissions, and parole decisions-- is the perceived "neutrality" of computers. It is argued that because computer models do not hold personal prejudice, the predictions they produce will be equally free from prejudice. Read More

In this paper we present an alternative representation of the Negative Binomial--Lindley distribution recently proposed by Zamani and Ismail (2010) which shows some advantages over the latter model. This new formulation provides a tractable model with attractive properties which makes it suitable for application not only in insurance settings but also in other fields where overdispersion is observed. Basic properties of the new distribution are studied. Read More

The main limitation that constrains the fast and comprehensive application of Wireless Local Area Network (WLAN) based indoor localization systems with Received Signal Strength (RSS) positioning algorithms is the building of the fingerprinting radio map, which is time-consuming especially when the indoor environment is large and/or with high frequent changes. Different approaches have been proposed to reduce workload, including fingerprinting deployment and update efforts, but the performance degrades greatly when the workload is reduced below a certain level. In this paper, we propose an indoor localization scenario that applies metric learning and manifold alignment to realize direct mapping localization (DML) using a low resolution radio map with single sample of RSS that reduces the fingerprinting workload by up to 87\%. Read More

Procrustes Analysis is a Morphometric method based on Configurations of Landmarks that estimates the superimposition parameters by least-squares; for this reason, the procedure is very sensitive to outliers. In the first part of the paper we robustify this technique to classify individuals from a descriptive point of view. In the literature there are also classical results, based on the normality of the observations, to test whether there are significant differences between individuals. Read More

Response-adaptive randomisation (RAR) can considerably improve the chances of a successful treatment outcome for patients in a clinical trial by skewing the allocation probability towards better performing treatments as data accumulates. There is considerable interest in using RAR designs in drug development for rare diseases, where traditional designs are not feasible or ethically objectionable. In this paper we discuss and address a major criticism of RAR: the undesirable type I error inflation due to unknown time trends in the trial. Read More

Facing increasing domestic energy consumption from population growth and industrialization, Saudi Arabia is aiming to reduce its reliance on fossil fuels and to broaden its energy mix by expanding investment in renewable energy sources, including wind energy. A preliminary task in the development of wind energy infrastructure is the assessment of wind energy potential, a key aspect of which is the characterization of its spatio-temporal behavior. In this study we examine the impact of internal climate variability on seasonal wind power density fluctuations using 30 simulations from the Large Ensemble Project (LENS) developed at the National Center for Atmospheric Research. Read More

The ungrammatical sentence "The key to the cabinets are on the table" is known to lead to an illusion of grammaticality. As discussed in the meta-analysis by Jaeger et al., 2017, faster reading times are observed at the verb are in the agreement-attraction sentence above compared to the equally ungrammatical sentence "The key to the cabinet are on the table". Read More

In recent years, there has been strong interest in neuroscience studies to investigate brain organization through networks of brain regions that demonstrate strong functional connectivity (FC). These networks are extracted from observed fMRI using data-driven analytic methods such as independent component analysis (ICA). A notable limitation of these FC methods is that they do not provide any information on the underlying structural connectivity (SC), which is believed to serve as the basis for interregional interactions in brain activity. Read More

Feature extraction and dimension reduction for networks is critical in a wide variety of domains. Efficiently and accurately learning features for multiple graphs has important applications in statistical inference on graphs. We propose a method to jointly embed multiple undirected graphs. Read More

We present a latent feature allocation model to reconstruct tumor subclones subject to phylogenetic evolution that mimics tumor evolution. Similar to most current methods, we consider data from next-generation sequencing. Unlike most methods that use information in short reads mapped to single nucleotide variants (SNVs), we consider subclone reconstruction using pairs of two proximal SNVs that can be mapped by the same short reads. Read More

In temperate climates, mortality is seasonal with a winter-dominant pattern, due in part to pneumonia and influenza. Cardiac causes, which are the leading cause of death in the United States, are also winter-seasonal although it is not clear why. Interactions between circulating respiratory viruses (f. Read More

Five year post-transplant survival rate is an important indicator on quality of care delivered by kidney transplant centers in the United States. To provide a fair assessment of each transplant center, an effect that represents the center-specific care quality, along with patient level risk factors, is often included in the risk adjustment model. In the past, the center effects have been modeled as either fixed effects or Gaussian random effects, with various pros and cons. Read More

In this paper, adaptive non-uniform compressive sampling (ANCS) of time-varying signals, which are sparse in a proper basis, is introduced. ANCS employs the measurements of previous time steps to distribute the sensing energy among coefficients more intelligently. To this aim, a Bayesian inference method is proposed that does not require any prior knowledge of importance levels of coefficients or sparsity of the signal. Read More

In the point process context, kernel intensity estimation has been mainly restricted to exploratory analysis due to its lack of consistency. However the use of covariates has allow to design consistent alternatives under some restrictive assumptions. In this paper we focus our attention on de\-fi\-ning an appropriate framework to derive a consistent kernel intensity estimator using covariates, as well as a consistent smooth bootstrap procedure. Read More

Researchers interested in statistically modeling network data have a well-established and quickly growing set of approaches from which to choose. Several of these methods have been regularly applied in research on political networks, while others have yet to permeate the field. Here, we review the most prominent methods of inferential network analysis---both for cross-sectionally and longitudinally observed networks including (temporal) exponential random graph models, latent space models, the quadratic assignment procedure, and stochastic actor oriented models. Read More

Blind Source Separation (BSS) is a challenging matrix factorization problem that plays a central role in multichannel imaging science. In a large number of applications, such as astrophysics, current unmixing methods are limited since real-world mixtures are generally affected by extra instrumental effects like blurring. Therefore, BSS has to be solved jointly with a deconvolution problem, which requires tackling a new inverse problem: deconvolution BSS (DBSS). Read More

The development of Smart Grid in Norway in specific and Europe/US in general will shortly lead to the availability of massive amount of fine-grained spatio-temporal consumption data from domestic households. This enables the application of data mining techniques for traditional problems in power system. Clustering customers into appropriate groups is extremely useful for operators or retailers to address each group differently through dedicated tariffs or customer-tailored services. Read More