Using Aggregated Relational Data to feasibly identify network structure without network data

Social and economic network data can be useful for both researchers and policymakers, but can often be impractical to collect. We propose collecting Aggregated Relational Data (ARD) using questions that are simple and easy to add to any survey. These question are of the form "how many of your friends in the village have trait k?" We show that by collecting ARD on even a small share of the population, researchers can recover the likely distribution of statistics from the underlying network. We provide three empirical examples. We first apply the technique to the 75 village networks in Karnataka, India, where Banerjee et al. (2016b) collected near-complete network data. We show that with ARD alone on even a 29% sample, we can accurately estimate both node-level features (such as eigenvector centrality, clustering) and network-level features (such as the maximum eigenvalue, average path length). To further demonstrate the power of the approach, we apply our technique to two settings analyzed previously by the authors. We show ARD could have been used to predict how to assign monitors to savers to increase savings in rural villages (Breza and Chandrasekhar, 2016). ARD would have led to the same conclusions the authors arrived at when they used expensive near-complete network data. We then provide an example where survey ARD was collected, along with some partial network data, and demonstrate that the same conclusions would have been drawn using only the ARD data, and that with the ARD, the researchers could more generally measure the impact of microfinance exposure on social capital in urban slums (Banerjee et al., 2016a).

Similar Publications

We propose a vector generalized additive modeling framework for taking into account the effect of covariates on angular density functions in a multivariate extreme value context. The proposed methods are tailored for settings where the dependence between extreme values may change according to covariates. We devise a maximum penalized log-likelihood estimator, discuss details of the estimation procedure, and derive its consistency and asymptotic normality. Read More

Several attempts were made in the literature to generalize univariate reliability concepts to bivariate as well as multivariate set up. Here we extend the univariate quantile based reliability concepts to bivariate case based on quantile curves. We propose quantile curves based bivariate hazard rate and bivariate mean residual life function and study their uniqueness properties to determine the underlying quantile curve. Read More

Parametric empirical Bayes (EB) estimators have been widely used in variety of fields including small area estimation, disease mapping. Since EB estimator is constructed by plugging in the estimator of parameters in prior distributions, it might perform poorly if the estimator of parameters is unstable. This can happen when the number of samples are small or moderate. Read More

Many modern big data applications feature large scale in both numbers of responses and predictors. Better statistical efficiency and scientific insights can be enabled by understanding the large-scale response-predictor association network structures via layers of sparse latent factors ranked by importance. Yet sparsity and orthogonality have been two largely incompatible goals. Read More

Dynamic treatment regimes (DTRs) aim to formalize personalized medicine by tailoring treatment decisions to individual patient characteristics. G-estimation for DTR identification targets the parameters of a structural nested mean model known as the blip function from which the optimal DTR is derived. Despite considerable work deriving such estimation methods, there has been little focus on extending G-estimation to the case of non-additive effects, non-continuous outcomes or on model selection. Read More

Missing data are a common problem for both the construction and implementation of a prediction algorithm. Pattern mixture kernel submodels (PMKS) - a series of submodels for every missing data pattern that are fit using only data from that pattern - are a computationally efficient remedy for both stages. Here we show that PMKS yield the most predictive algorithm among all standard missing data strategies. Read More

The areas of model selection and model evaluation for predictive modeling have received extensive treatment in the statistics literature, leading to both theoretical advances and practical methods based on covariance penalties and other approaches. However, the majority of this work, and especially the practical approaches, are based on the "Fixed-X assumption", where covariate values are assumed to be non-random and known. By contrast, in most modern predictive modeling applications, it is more reasonable to take the "Random-X" view, where future prediction points are random and new. Read More

There is a growing demand for nonparametric conditional density estimators (CDEs) in fields such as astronomy and economics. In astronomy, for example, one can dramatically improve estimates of the parameters that dictate the evolution of the Universe by working with full conditional densities instead of regression (i.e. Read More

This note proposes a consistent bootstrap-based distributional approximation for cube root consistent estimators such as the maximum score estimator of Manski (1975) and the isotonic density estimator of Grenander (1956). In both cases, the standard nonparametric bootstrap is known to be inconsistent. Our method restores consistency of the nonparametric bootstrap by altering the shape of the criterion function defining the estimator whose distribution we seek to approximate. Read More

Under the banner of `Big Data', the detection and classification of structure in extremely large, high dimensional, data sets, is, one of the central statistical challenges of our times. Among the most intriguing approaches to this challenge is `TDA', or `Topological Data Analysis', one of the primary aims of which is providing non-metric, but topologically informative, pre-analyses of data sets which make later, more quantitative analyses feasible. While TDA rests on strong mathematical foundations from Topology, in applications it has faced challenges due to an inability to handle issues of statistical reliability and robustness and, most importantly, in an inability to make scientific claims with verifiable levels of statistical confidence. Read More