Rebecca C. Steorts

Rebecca C. Steorts
Are you Rebecca C. Steorts?

Claim your profile, edit publications, add additional information:

Contact Details

Name
Rebecca C. Steorts
Affiliation
Location

Pubs By Year

Pub Categories

 
Statistics - Methodology (11)
 
Statistics - Applications (9)
 
Statistics - Machine Learning (5)
 
Computer Science - Databases (2)
 
Statistics - Computation (2)
 
Statistics - Theory (2)
 
Mathematics - Statistics (2)
 
Mathematics - Information Theory (1)
 
Computer Science - Information Theory (1)

Publications Authored By Rebecca C. Steorts

Record linkage involves merging records in large, noisy databases to remove duplicate entities. It has become an important area because of its widespread occurrence in bibliometrics, public health, official statistics production, political science, and beyond. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. Read More

Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. Read More

A plethora of networks is being collected in a growing number of fields, including disease transmission, international relations, social interactions, and others. As data streams continue to grow, the complexity associated with these highly multidimensional connectivity data presents novel challenges. In this paper, we focus on the time-varying interconnections among a set of actors in multiple contexts, called layers. Read More

Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some tasks, this assumption is undesirable. Read More

Estimation of death counts and associated standard errors is of great importance in armed conflict such as the ongoing violence in Syria, as well as historical conflicts in Guatemala, Per\'u, Colombia, Timor Leste, and Kosovo. For example, statistical estimates of death counts were cited as important evidence in the trial of General Efra\'in R\'ios Montt for acts of genocide in Guatemala. Estimation relies on both record linkage and multiple systems estimation. Read More

We develop constrained Bayesian estimation methods for small area problems: those requiring smoothness with respect to similarity across areas, such as geographic proximity or clustering by covariates; and benchmarking constraints, requiring (weighted) means of estimates to agree across levels of aggregation. We develop methods for constrained estimation decision-theoretically and discuss their geometric interpretation. Our constrained estimators are the solutions to tractable optimization problems and have closed-form solutions. Read More

Bayesian entity resolution merges together multiple, noisy databases and returns the minimal collection of unique individuals represented, together with their true, latent record values. Bayesian methods allow flexible generative models that share power across databases as well as principled quantification of uncertainty for queries of the final, resolved database. However, existing Bayesian methods for entity resolution use Markov monte Carlo method (MCMC) approximations and are too slow to run on modern databases containing millions or billions of records. Read More

Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records to unobserved latent entities. Read More

Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sensitive hashing, sometimes referred to as "private blocking. Read More

We congratulate the authors for a stimulating and valuable manuscript, providing a careful review of the state-of the-art in cross-sectional and time-series benchmarking procedures for small area estimation. They develop a novel two-stage benchmarking method for hierarchical time series models, where they evaluate their procedure by estimating monthly total unemployment using data from the U.S. Read More

Discussion of "Estimating the Distribution of Dietary Consumption Patterns" by Raymond J. Carroll [arXiv:1405.4667]. Read More

We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate $k$-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Read More

Functional neuroimaging measures how the brain responds to complex stimuli. However, sample sizes are modest, noise is substantial, and stimuli are high dimensional. Hence, direct estimates are inherently imprecise and call for regularization. Read More

We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Read More

There has been recent growth in small area estimation due to the need for more precise estimation of small geographic areas, which has led to groups such as the U.S. Census Bureau, Google, and the RAND corporation utilizing small area estimation procedures. Read More

The PITCHf/x database has allowed the statistical analysis of of Major League Baseball (MLB) to flourish since its introduction in late 2006. Using PITCHf/x, pitches have been classified by hand, requiring considerable effort, or using neural network clustering and classification, which is often difficult to interpret. To address these issues, we use model-based clustering with a multivariate Gaussian mixture model and an appropriate adjustment factor as an alternative to current methods. Read More

We consider benchmarked empirical Bayes (EB) estimators under the basic area-level model of Fay and Herriot while requiring the standard benchmarking constraint. In this paper we determine the excess mean squared error (MSE) from constraining the estimates through benchmarking. We show that the increase due to benchmarking is O(m^{-1}), where m is the number of small areas. Read More