# Using Aggregated Relational Data to feasibly identify network structure without network data

Social and economic network data can be useful for both researchers and policymakers, but can often be impractical to collect. We propose collecting Aggregated Relational Data (ARD) using questions that are simple and easy to add to any survey. These question are of the form "how many of your friends in the village have trait k?" We show that by collecting ARD on even a small share of the population, researchers can recover the likely distribution of statistics from the underlying network. We provide three empirical examples. We first apply the technique to the 75 village networks in Karnataka, India, where Banerjee et al. (2016b) collected near-complete network data. We show that with ARD alone on even a 29% sample, we can accurately estimate both node-level features (such as eigenvector centrality, clustering) and network-level features (such as the maximum eigenvalue, average path length). To further demonstrate the power of the approach, we apply our technique to two settings analyzed previously by the authors. We show ARD could have been used to predict how to assign monitors to savers to increase savings in rural villages (Breza and Chandrasekhar, 2016). ARD would have led to the same conclusions the authors arrived at when they used expensive near-complete network data. We then provide an example where survey ARD was collected, along with some partial network data, and demonstrate that the same conclusions would have been drawn using only the ARD data, and that with the ARD, the researchers could more generally measure the impact of microfinance exposure on social capital in urban slums (Banerjee et al., 2016a).

## Similar Publications

Bayesian inference for complex models is challenging due to the need to explore high-dimensional spaces and multimodality and standard Monte Carlo samplers can have difficulties effectively exploring the posterior. We introduce a general purpose rejection-free ensemble Markov Chain Monte Carlo (MCMC) technique to improve on existing poorly mixing samplers. This is achieved by combining parallel tempering and an auxiliary variable move to exchange information between the chains. Read More

The regsem package in R, an implementation of regularized structural equation modeling (RegSEM; Jacobucci, Grimm, and McArdle 2016), was recently developed with the goal of incorporating various forms of penalized likelihood estimation in a broad array of structural equations models. The forms of regularization include both the ridge (Hoerl and Kennard 1970) and the least absolute shrinkage and selection operator (lasso; Tibshirani 1996), along with sparser extensions. RegSEM is particularly useful for structural equation models that have a small parameter to sample size ratio, as the addition of penalties can reduce the complexity, thus reducing the bias of the parameter estimates. Read More

In the study of complex physical and biological systems represented by multivariate stochastic processes, an issue of great relevance is the description of the system dynamics spanning multiple temporal scales. While methods to assess the dynamic complexity of individual processes at different time scales are well-established, the multiscale evaluation of directed interactions between processes is complicated by theoretical and practical issues such as filtering and downsampling. Here we extend the very popular measure of Granger causality (GC), a prominent tool for assessing directed lagged interactions between joint processes, to quantify information transfer across multiple time scales. Read More

We introduce a new type of point process model to describe the incidence of contagious diseases. The model is a variant of the Hawkes self-exciting process and exhibits similar clustering but without the restriction that the component describing the contagion must remain static over time. Instead, our proposed model prescribes that the degree of contagion (or productivity) changes as a function of the conditional intensity; of particular interest is the special case where the productivity is inversely proportional to the conditional intensity. Read More

Continuous-time multi-state survival models can be used to describe health-related processes over time. In the presence of interval-censored times for transitions between the living states, the likelihood is constructed using transition probabilities. Models can be specified using parametric or semi-parametric shapes for the hazards. Read More

**Affiliations:**

^{1}LPMA, UAC,

^{2}LPMA

**Category:**Statistics - Methodology

This paper focuses on the multivariate linear mixed-effects model, including all the correlations between the random effects when the marginal residual terms are assumed uncorrelated and homoscedastic with possibly different standard deviations. The random effects covariance matrix is Cholesky factorized to directly estimate the variance components of these random effects. This strategy enables a consistent estimate of the random effects covariance matrix which, generally, has a poor estimate when it is grossly (or directly) estimated, using the estimating methods such as the EM algorithm. Read More

In this paper, we study a novel approach for the estimation of quantiles when facing potential right censoring of the responses. Contrary to the existing literature on the subject, the adopted strategy of this paper is to tackle censoring at the very level of the loss function usually employed for the computation of quantiles, the so-called "check" function. For interpretation purposes, a simple comparison with the latter reveals how censoring is accounted for in the newly proposed loss function. Read More

Cross-validation is one of the most popular model selection methods in statistics and machine learning. Despite its wide applicability, traditional cross-validation methods tend to select overfitting models, unless the ratio between the training and testing sample sizes is much smaller than conventional choices. We argue that such an overfitting tendency of cross-validation is due to the ignorance of the uncertainty in the testing sample. Read More

Particle filters are a popular and flexible class of numerical algorithms to solve a large class of nonlinear filtering problems. However, standard particle filters with importance weights have been shown to require a sample size that increases exponentially with the dimension D of the state space in order to achieve a certain performance, which precludes their use in very high-dimensional filtering problems. Here, we focus on the dynamic aspect of this curse of dimensionality (COD) in continuous time filtering, which is caused by the degeneracy of importance weights over time. Read More

Energy statistics are estimators of the energy distance that depend on the distances between observations. The idea behind energy statistics is to consider a statistical potential energy that would parallel Newton's gravitational potential energy. This statistical potential energy is zero if and only if a certain null hypothesis relating two distributions holds true. Read More