Performance Bounds for Graphical Record Linkage

Record linkage involves merging records in large, noisy databases to remove duplicate entities. It has become an important area because of its widespread occurrence in bibliometrics, public health, official statistics production, political science, and beyond. Traditional linkage methods directly linking records to one another are computationally infeasible as the number of records grows. As a result, it is increasingly common for researchers to treat record linkage as a clustering task, in which each latent entity is associated with one or more noisy database records. We critically assess performance bounds using the Kullback-Leibler (KL) divergence under a Bayesian record linkage framework, making connections to Kolchin partition models. We provide an upper bound using the KL divergence and a lower bound on the minimum probability of misclassifying a latent entity. We give insights for when our bounds hold using simulated data and provide practical user guidance.

Comments: 11 pages with supplement; 4 figures and 2 tables; to appear in AISTATS 2017

Similar Publications

Particle filters are a popular and flexible class of numerical algorithms to solve a large class of nonlinear filtering problems. However, standard particle filters with importance weights have been shown to require a sample size that increases exponentially with the dimension D of the state space in order to achieve a certain performance, which precludes their use in very high-dimensional filtering problems. Here, we focus on the dynamic aspect of this curse of dimensionality (COD) in continuous time filtering, which is caused by the degeneracy of importance weights over time. Read More

We consider the statistical inverse problem to recover $f$ from noisy measurements $Y = Tf + \sigma \xi$ where $\xi$ is Gaussian white noise and $T$ a compact operator between Hilbert spaces. Considering general reconstruction methods of the form $\hat f_\alpha = q_\alpha \left(T^*T\right)T^*Y$ with an ordered filter $q_\alpha$, we investigate the choice of the regularization parameter $\alpha$ by minimizing an unbiased estimate of the predictive risk $\mathbb E\left[\Vert Tf - T\hat f_\alpha\Vert^2\right]$. The corresponding parameter $\alpha_{\mathrm{pred}}$ and its usage are well-known in the literature, but oracle inequalities and optimality results in this general setting are unknown. Read More

We study detection methods for multivariable signals under dependent noise. The main focus is on three-dimensional signals, i.e. Read More

We propose an objective prior distribution on correlation kernel parameters for Simple Kriging models in the spirit of reference priors. Because it is proper and defined through its conditional densities, it and its associated posterior distribution lend themselves well to Gibbs sampling, thus making the full-Bayesian procedure tractable. Numerical examples show it has near-optimal frequentist performance in terms of prediction interval coverage Read More

In this work, nonparametric statistical inference is provided for the continuous-time M/G/1 queueing model from a Bayesian point of view. The inference is based on observations of the inter-arrival and service times. Beside other characteristics of the system, particular interest is in the waiting time distribution which is not accessible in closed form. Read More

This paper studies the minimum distance estimation problem for panel data model. We propose the minimum distance estimators of regression parameters of the panel data model and investigate their asymptotic distributions. This paper contains two main contributions. Read More

We consider a compound testing problem within the Gaussian sequence model in which the null and alternative are specified by a pair of closed, convex cones. Such cone testing problem arise in various applications, including detection of treatment effects, trend detection in econometrics, signal detection in radar processing, and shape-constrained inference in non-parametric statistics. We provide a sharp characterization of the GLRT testing radius up to a universal multiplicative constant in terms of the geometric structure of the underlying convex cones. Read More

Principal Component Analysis (PCA) is a classical method for reducing the dimensionality of data by projecting them onto a subspace that captures most of their variation. Effective use of PCA in modern applications requires understanding its performance for data that are both high-dimensional (i.e. Read More

Decision-makers often learn by acquiring information from distinct sources that possibly provide complementary information. We consider a decision-maker who sequentially samples from a finite set of Gaussian signals, and wants to predict a persistent multi-dimensional state at an unknown final period. What signal should he choose to observe in each period? Related problems about optimal experimentation and dynamic learning tend to have solutions that can only be approximated or implicitly characterized. Read More