# David L. Woodruff - University of California Davis

## Contact Details

NameDavid L. Woodruff |
||

AffiliationUniversity of California Davis |
||

CityDavis |
||

CountryUnited States |
||

## Pubs By Year |
||

## Pub CategoriesComputer Science - Data Structures and Algorithms (43) Computer Science - Learning (15) Computer Science - Computational Complexity (6) Mathematics - Numerical Analysis (5) Mathematics - Information Theory (5) Computer Science - Information Theory (5) Statistics - Machine Learning (4) Computer Science - Numerical Analysis (3) Computer Science - Discrete Mathematics (2) Computer Science - Distributed; Parallel; and Cluster Computing (2) Computer Science - Computational Geometry (1) Statistics - Computation (1) Computer Science - Databases (1) |

## Publications Authored By David L. Woodruff

We consider the problem of approximating a given matrix by a low-rank matrix so as to minimize the entrywise $\ell_p$-approximation error, for any $p \geq 1$; the case $p = 2$ is the classical SVD problem. We obtain the first provably good approximation algorithms for this version of low-rank approximation that work for every value of $p \geq 1$, including $p = \infty$. Our algorithms are simple, easy to implement, work well in practice, and illustrate interesting tradeoffs between the approximation quality, the running time, and the rank of the approximating matrix. Read More

We study the strong duality of non-convex matrix factorization: we show under certain dual conditions, non-convex matrix factorization and its dual have the same optimum. This has been well understood for convex optimization, but little was known for matrix factorization. We formalize the strong duality of matrix factorization through a novel analytical framework, and show that the duality gap is zero for a wide class of matrix factorization problems. Read More

We consider relative error low rank approximation of {\it tensors} with respect to the Frobenius norm: given an order-$q$ tensor $A \in \mathbb{R}^{\prod_{i=1}^q n_i}$, output a rank-$k$ tensor $B$ for which $\|A-B\|_F^2 \leq (1+\epsilon)$OPT, where OPT $= \inf_{\textrm{rank-}k~A'} \|A-A'\|_F^2$. Despite the success on obtaining relative error low rank approximations for matrices, no such results were known for tensors. One structural issue is that there may be no rank-$k$ tensor $A_k$ achieving the above infinum. Read More

**Affiliations:**

^{1}University of California Davis,

^{2}University of Duisburg-Essen

**Category:**Statistics - Computation

In this note we describe experiments on an implementation of two methods proposed in the literature for computing regions that correspond to a notion of order statistics for multidimensional data. Our implementation, which works for any dimension greater than one, is the only that we know of to be publicly available. Experiments run using the software confirm that half-space peeling generally gives better results than directly peeling convex hulls, but at a computational cost. Read More

Understanding the singular value spectrum of a matrix $A \in \mathbb{R}^{n \times n}$ is a fundamental task in countless applications. In matrix multiplication time, it is possible to perform a full SVD and directly compute the singular values $\sigma_1,.. Read More

We show how to compute a relative-error low-rank approximation to any positive semidefinite (PSD) matrix in sublinear time, i.e., for any $n \times n$ PSD matrix $A$, in $\tilde O(n \cdot poly(k/\epsilon))$ time we output a rank-$k$ matrix $B$, in factored form, for which $\|A-B\|_F^2 \leq (1+\epsilon)\|A-A_k\|_F^2$, where $A_k$ is the best rank-$k$ approximation to $A$. Read More

In the communication problem $\mathbf{UR}$ (universal relation) [KRW95], Alice and Bob respectively receive $x, y \in\{0,1\}^n$ with the promise that $x\neq y$. The last player to receive a message must output an index $i$ such that $x_i\neq y_i$. We prove that the randomized one-way communication complexity of this problem in the public coin model is exactly $\Theta(\min\{n,\log(1/\delta)\log^2(\frac n{\log(1/\delta)})\})$ for failure probability $\delta$. Read More

Given an $n \times d$ matrix $A$, its Schatten-$p$ norm, $p \geq 1$, is defined as $\|A\|_p = \left (\sum_{i=1}^{\textrm{rank}(A)}\sigma_i(A)^p \right )^{1/p}$, where $\sigma_i(A)$ is the $i$-th largest singular value of $A$. These norms have been studied in functional analysis in the context of non-commutative $\ell_p$-spaces, and recently in data stream and linear sketching models of computation. Basic questions on the relations between these norms, such as their embeddability, are still open. Read More

Clustering large datasets is a fundamental problem with a number of applications in machine learning. Data is often collected on different sites and clustering needs to be performed in a distributed manner with low communication. We would like the quality of the clustering in the distributed setting to match that in the centralized setting for which all the data resides on a single site. Read More

Kernel Ridge Regression (KRR) is a simple yet powerful technique for non-parametric regression whose computation amounts to solving a linear system. This system is usually dense and highly ill-conditioned. In addition, the dimensions of the matrix are the same as the number of data points, so direct methods are unrealistic for large-scale datasets. Read More

The technique of matrix sketching, such as the use of random projections, has been shown in recent years to be a powerful tool for accelerating many important statistical learning techniques. Research has so far focused largely on using sketching for the "vanilla" un-regularized versions of these techniques. Here we study sketching methods for regularized variants of linear regression, low rank approximations, and canonical correlation analysis. Read More

We study the $\ell_1$-low rank approximation problem, where for a given $n \times d$ matrix $A$ and approximation factor $\alpha \geq 1$, the goal is to output a rank-$k$ matrix $\widehat{A}$ for which $$\|A-\widehat{A}\|_1 \leq \alpha \cdot \min_{\textrm{rank-}k\textrm{ matrices}~A'}\|A-A'\|_1,$$ where for an $n \times d$ matrix $C$, we let $\|C\|_1 = \sum_{i=1}^n \sum_{j=1}^d |C_{i,j}|$. This error measure is known to be more robust than the Frobenius norm in the presence of outliers and is indicated in models where Gaussian assumptions on the noise may not apply. The problem was shown to be NP-hard by Gillis and Vavasis and a number of heuristics have been proposed. Read More

Have you ever wanted to multiply an $n \times d$ matrix $X$, with $n \gg d$, on the left by an $m \times n$ matrix $\tilde G$ of i.i.d. Read More

For any real number $p > 0$, we nearly completely characterize the space complexity of estimating $\|A\|_p^p = \sum_{i=1}^n \sigma_i^p$ for $n \times n$ matrices $A$ in which each row and each column has $O(1)$ non-zero entries and whose entries are presented one at a time in a data stream model. Here the $\sigma_i$ are the singular values of $A$, and when $p \geq 1$, $\|A\|_p^p$ is the $p$-th power of the Schatten $p$-norm. We show that when $p$ is not an even integer, to obtain a $(1+\epsilon)$-approximation to $\|A\|_p^p$ with constant probability, any $1$-pass algorithm requires $n^{1-g(\epsilon)}$ bits of space, where $g(\epsilon) \rightarrow 0$ as $\epsilon \rightarrow 0$ and $\epsilon > 0$ is a constant independent of $n$. Read More

An old and fundamental problem in databases and data streams is that of finding the heavy hitters, also known as the top-$k$, most popular items, frequent items, elephants, or iceberg queries. There are several variants of this problem, which quantify what it means for an item to be frequent, including what are known as the $\ell_1$-heavy hitters and $\ell_2$-heavy hitters. There are a number of algorithmic solutions for these problems, starting with the work of Misra and Gries, as well as the CountMin and CountSketch data structures, among others. Read More

The task of finding heavy hitters is one of the best known and well studied problems in the area of data streams. In sub-polynomial space, the strongest guarantee available is the $\ell_2$ guarantee, which requires finding all items that occur at least $\varepsilon\|f\|_2$ times in the stream, where the $i$th coordinate of the vector $f$ is the number of occurrences of $i$ in the stream. The first algorithm to achieve the $\ell_2$ guarantee was the CountSketch of [CCF04], which for constant $\varepsilon$ requires $O(\log n)$ words of memory and $O(\log n)$ update time, and is known to be space-optimal if the stream allows for deletions. Read More

We give the first optimal bounds for returning the $\ell_1$-heavy hitters in a data stream of insertions, together with their approximate frequencies, closing a long line of work on this problem. For a stream of $m$ items in $\{1, 2, \dots, n\}$ and parameters $0 < \epsilon < \phi \leq 1$, let $f_i$ denote the frequency of item $i$, i.e. Read More

We study distributed low rank approximation in which the matrix to be approximated is only implicitly represented across the different servers. For example, each of $s$ servers may have an $n \times d$ matrix $A^t$, and we may be interested in computing a low rank approximation to $A = f(\sum_{t=1}^s A^t)$, where $f$ is a function which is applied entrywise to the matrix $\sum_{t=1}^s A^t$. We show for a wide class of functions $f$ it is possible to efficiently compute a $d \times d$ rank-$k$ projection matrix $P$ for which $\|A - AP\|_F^2 \leq \|A - [A]_k\|_F^2 + \varepsilon \|A\|_F^2$, where $AP$ denotes the projection of $A$ onto the row span of $P$, and $[A]_k$ denotes the best rank-$k$ approximation to $A$ given by the singular value decomposition. Read More

A central problem in the theory of algorithms for data streams is to determine which functions on a stream can be approximated in sublinear, and especially sub-polynomial or poly-logarithmic, space. Given a function $g$, we study the space complexity of approximating $\sum_{i=1}^n g(|f_i|)$, where $f\in\mathbb{Z}^n$ is the frequency vector of a turnstile stream. This is a generalization of the well-known frequency moments problem, and previous results apply only when $g$ is monotonic or has a special functional form. Read More

We undertake a systematic study of sketching a quadratic form: given an $n \times n$ matrix $A$, create a succinct sketch $\textbf{sk}(A)$ which can produce (without further access to $A$) a multiplicative $(1+\epsilon)$-approximation to $x^T A x$ for any desired query $x \in \mathbb{R}^n$. While a general matrix does not admit non-trivial sketches, positive semi-definite (PSD) matrices admit sketches of size $\Theta(\epsilon^{-2} n)$, via the Johnson-Lindenstrauss lemma, achieving the "for each" guarantee, namely, for each query $x$, with a constant probability the sketch succeeds. (For the stronger "for all" guarantee, where the sketch succeeds for all $x$'s simultaneously, again there are no non-trivial sketches. Read More

Given a stream $p_1, \ldots, p_m$ of items from a universe $\mathcal{U}$, which, without loss of generality we identify with the set of integers $\{1, 2, \ldots, n\}$, we consider the problem of returning all $\ell_2$-heavy hitters, i.e., those items $j$ for which $f_j \geq \epsilon \sqrt{F_2}$, where $f_j$ is the number of occurrences of item $j$ in the stream, and $F_2 = \sum_{i \in [n]} f_i^2$. Read More

In the subspace approximation problem, we seek a k-dimensional subspace F of R^d that minimizes the sum of p-th powers of Euclidean distances to a given set of n points a_1, ... Read More

We prove, using the subspace embedding guarantee in a black box way, that one can achieve the spectral norm guarantee for approximate matrix multiplication with a dimensionality-reducing map having $m = O(\tilde{r}/\varepsilon^2)$ rows. Here $\tilde{r}$ is the maximum stable rank, i.e. Read More

We study the tradeoff between the statistical error and communication cost of distributed statistical estimation problems in high dimensions. In the distributed sparse Gaussian mean estimation problem, each of the $m$ machines receives $n$ data points from a $d$-dimensional Gaussian distribution with unknown mean $\theta$ which is promised to be $k$-sparse. The machines communicate by message passing and aim to estimate the mean $\theta$. Read More

Recently [Bhattacharya et al., STOC 2015] provide the first non-trivial algorithm for the densest subgraph problem in the streaming model with additions and deletions to its edges, i.e. Read More

We study the Principal Component Analysis (PCA) problem in the distributed and streaming models of computation. Given a matrix $A \in R^{m \times n},$ a rank parameter $k < rank(A)$, and an accuracy parameter $0 < \epsilon < 1$, we want to output an $m \times k$ orthonormal matrix $U$ for which $$ || A - U U^T A ||_F^2 \le \left(1 + \epsilon \right) \cdot || A - A_k||_F^2, $$ where $A_k \in R^{m \times n}$ is the best rank-$k$ approximation to $A$. This paper provides improved algorithms for distributed PCA and streaming PCA. Read More

We initiate the study of trade-offs between sparsity and the number of measurements in sparse recovery schemes for generic norms. Specifically, for a norm $\|\cdot\|$, sparsity parameter $k$, approximation factor $K>0$, and probability of failure $P>0$, we ask: what is the minimal value of $m$ so that there is a distribution over $m \times n$ matrices $A$ with the property that for any $x$, given $Ax$, we can recover a $k$-sparse approximation to $x$ in the given norm with probability at least $1-P$? We give a partial answer to this problem, by showing that for norms that admit efficient linear sketches, the optimal number of measurements $m$ is closely related to the doubling dimension of the metric induced by the norm $\|\cdot\|$ on the set of all $k$-sparse vectors. By applying our result to specific norms, we cast known measurement bounds in our general framework (for the $\ell_p$ norms, $p \in [1,2]$) as well as provide new, measurement-efficient schemes (for the Earth-Mover Distance norm). Read More

Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear features from data. In the presence of a large volume of high dimensional data collected in a distributed fashion, it becomes very costly to communicate all of this data to a single data center and then perform kernel PCA. Can we perform kernel PCA on the entire dataset in a distributed and communication efficient fashion while maintaining provable and strong guarantees in solution quality? In this paper, we give an affirmative answer to the question by developing a communication efficient algorithm to perform kernel PCA in the distributed setting. Read More

We describe a new algorithm called Frequent Directions for deterministic matrix sketching in the row-updates model. The algorithm is presented an arbitrary input matrix $A \in R^{n \times d}$ one row at a time. It performed $O(d \times \ell)$ operations per row and maintains a sketch matrix $B \in R^{\ell \times d}$ such that for any $k < \ell$ $\|A^TA - B^TB \|_2 \leq \|A - A_k\|_F^2 / (\ell-k)$ and $\|A - \pi_{B_k}(A)\|_F^2 \leq \big(1 + \frac{k}{\ell-k}\big) \|A-A_k\|_F^2 $ . Read More

We study the problem of compressing a weighted graph $G$ on $n$ vertices, building a "sketch" $H$ of $G$, so that given any vector $x \in \mathbb{R}^n$, the value $x^T L_G x$ can be approximated up to a multiplicative $1+\epsilon$ factor from only $H$ and $x$, where $L_G$ denotes the Laplacian of $G$. One solution to this problem is to build a spectral sparsifier $H$ of $G$, which, using the result of Batson, Spielman, and Srivastava, consists of $O(n \epsilon^{-2})$ reweighted edges of $G$ and has the property that simultaneously for all $x \in \mathbb{R}^n$, $x^T L_H x = (1 \pm \epsilon) x^T L_G x$. The $O(n \epsilon^{-2})$ bound is optimal for spectral sparsifiers. Read More

This survey highlights the recent advances in algorithms for numerical linear algebra that have come from the technique of linear sketching, whereby given a matrix, one first compresses it to a much smaller matrix by multiplying it by a (usually) random matrix with certain properties. Much of the expensive computation can then be performed on the smaller matrix, thereby accelerating the solution for the original problem. In this survey we consider least squares as well as robust regression problems, low rank approximation, and graph sparsification. Read More

We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. A key task in this setting is Principal Component Analysis (PCA), in which the servers would like to compute a low dimensional subspace capturing as much of the variance of the union of their point sets as possible. Given a procedure for approximate PCA, one can use it to approximately solve $\ell_2$-error fitting problems such as $k$-means clustering and subspace clustering. Read More

We study the communication complexity of linear algebraic problems over finite fields in the multi-player message passing model, proving a number of tight lower bounds. Specifically, for a matrix which is distributed among a number of players, we consider the problem of determining its rank, of computing entries in its inverse, and of solving linear equations. We also consider related problems such as computing the generalized inner product of vectors held on different servers. Read More

The CUR decomposition of an $m \times n$ matrix $A$ finds an $m \times c$ matrix $C$ with a subset of $c < n$ columns of $A,$ together with an $r \times n$ matrix $R$ with a subset of $r < m$ rows of $A,$ as well as a $c \times r$ low-rank matrix $U$ such that the matrix $C U R$ approximates the matrix $A,$ that is, $ || A - CUR ||_F^2 \le (1+\epsilon) || A - A_k||_F^2$, where $||.||_F$ denotes the Frobenius norm and $A_k$ is the best $m \times n$ matrix of rank $k$ constructed via the SVD. We present input-sparsity-time and deterministic algorithms for constructing such a CUR decomposition where $c=O(k/\epsilon)$ and $r=O(k/\epsilon)$ and rank$(U) = k$. Read More

We study the problem of sketching an input graph, so that given the sketch, one can estimate the weight of any cut in the graph within factor $1+\epsilon$. We present lower and upper bounds on the size of a randomized sketch, focusing on the dependence on the accuracy parameter $\epsilon>0$. First, we prove that for every $\epsilon > 1/\sqrt n$, every sketch that succeeds (with constant probability) in estimating the weight of all cuts $(S,\bar S)$ in an $n$-vertex graph (simultaneously), must be of size $\Omega(n/\epsilon^2)$ bits. Read More

Oblivious low-distortion subspace embeddings are a crucial building block for numerical linear algebra problems. We show for any real $p, 1 \leq p < \infty$, given a matrix $M \in \mathbb{R}^{n \times d}$ with $n \gg d$, with constant probability we can choose a matrix $\Pi$ with $\max(1, n^{1-2/p}) \poly(d)$ rows and $n$ columns so that simultaneously for all $x \in \mathbb{R}^d$, $\|Mx\|_p \leq \|\Pi Mx\|_{\infty} \leq \poly(d) \|Mx\|_p.$ Importantly, $\Pi M$ can be computed in the optimal $O(\nnz(M))$ time, where $\nnz(M)$ is the number of non-zero entries of $M$. Read More

We consider a number of fundamental statistical and graph problems in the message-passing model, where we have $k$ machines (sites), each holding a piece of data, and the machines want to jointly solve a problem defined on the union of the $k$ data sets. The communication is point-to-point, and the goal is to minimize the total communication among the $k$ machines. This model captures all point-to-point distributed computational models with respect to minimizing communication costs. Read More

We consider algorithmic problems in the setting in which the input data has been partitioned arbitrarily on many servers. The goal is to compute a function of all the data, and the bottleneck is the communication used by the algorithm. We present algorithms for two illustrative problems on massive data sets: (1) computing a low-rank approximation of a matrix $A=A^1 + A^2 + \ldots + A^s$, with matrix $A^t$ stored on server $t$ and (2) computing a function of a vector $a_1 + a_2 + \ldots + a_s$, where server $t$ has the vector $a_t$; this includes the well-studied special case of computing frequency moments and separable functions, as well as higher-order correlations such as the number of subgraphs of a specified type occurring in a graph. Read More

The set disjointness problem is one of the most fundamental and well-studied problems in communication complexity. In this problem Alice and Bob hold sets $S, T \subseteq [n]$, respectively, and the goal is to decide if $S \cap T = \emptyset$. Reductions from set disjointness are a canonical way of proving lower bounds in data stream algorithms, data structures, and distributed computation. Read More

Linear sketches are powerful algorithmic tools that turn an n-dimensional input into a concise lower-dimensional representation via a linear transformation. Such sketches have seen a wide range of applications including norm estimation over data streams, compressed sensing, and distributed computing. In almost any realistic setting, however, a linear sketch faces the possibility that its inputs are correlated with previous evaluations of the sketch. Read More

We design a new distribution over $\poly(r \eps^{-1}) \times n$ matrices $S$ so that for any fixed $n \times d$ matrix $A$ of rank $r$, with probability at least 9/10, $\norm{SAx}_2 = (1 \pm \eps)\norm{Ax}_2$ simultaneously for all $x \in \mathbb{R}^d$. Such a matrix $S$ is called a \emph{subspace embedding}. Furthermore, $SA$ can be computed in $\nnz(A) + \poly(d \eps^{-1})$ time, where $\nnz(A)$ is the number of non-zero entries of $A$. Read More

We provide fast algorithms for overconstrained $\ell_p$ regression and related problems: for an $n\times d$ input matrix $A$ and vector $b\in\mathbb{R}^n$, in $O(nd\log n)$ time we reduce the problem $\min_{x\in\mathbb{R}^d} \|Ax-b\|_p$ to the same problem with input matrix $\tilde A$ of dimension $s \times d$ and corresponding $\tilde b$ of dimension $s\times 1$. Here, $\tilde A$ and $\tilde b$ are a coreset for the problem, consisting of sampled and rescaled rows of $A$ and $b$; and $s$ is independent of $n$ and polynomial in $d$. Our results improve on the best previous algorithms when $n\gg d$, for all $p\in[1,\infty)$ except $p=2$. Read More

We study classic streaming and sparse recovery problems using deterministic linear sketches, including l1/l1 and linf/l1 sparse recovery problems (the latter also being known as l1-heavy hitters), norm estimation, and approximate inner product. We focus on devising a fixed matrix A in R^{m x n} and a deterministic recovery/estimation procedure which work for all possible input vectors simultaneously. Our results improve upon existing work, the following being our main contributions: * A proof that linf/l1 sparse recovery and inner product estimation are equivalent, and that incoherent matrices can be used to solve both problems. Read More

We give lower bounds for the problem of stable sparse recovery from /adaptive/ linear measurements. In this problem, one would like to estimate a vector $x \in \R^n$ from $m$ linear measurements $A_1x,.. Read More

We resolve several fundamental questions in the area of distributed functional monitoring, initiated by Cormode, Muthukrishnan, and Yi (SODA, 2008). In this model there are $k$ sites each tracking their input and communicating with a central coordinator that continuously maintain an approximate output to a function $f$ computed over the union of the inputs. The goal is to minimize the communication. Read More

The problem central to sparse recovery and compressive sensing is that of stable sparse recovery: we want a distribution of matrices A in R^{m\times n} such that, for any x \in R^n and with probability at least 2/3 over A, there is an algorithm to recover x* from Ax with ||x* - x||_p <= C min_{k-sparse x'} ||x - x'||_p for some constant C > 1 and norm p. The measurement complexity of this problem is well understood for constant C > 1. However, in a variety of applications it is important to obtain C = 1 + eps for a small eps > 0, and this complexity is not well understood. Read More

The goal of (stable) sparse recovery is to recover a $k$-sparse approximation $x*$ of a vector $x$ from linear measurements of $x$. Specifically, the goal is to recover $x*$ such that ||x-x*||_p <= C min_{k-sparse x'} ||x-x'||_q for some constant $C$ and norm parameters $p$ and $q$. It is known that, for $p=q=1$ or $p=q=2$, this task can be accomplished using $m=O(k \log (n/k))$ non-adaptive measurements [CRT06] and that this bound is tight [DIPW10,FPRU10,PW11]. Read More

The statistical leverage scores of a matrix $A$ are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr\"{o}m-based low-rank matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary $n \times d$ matrix $A$, with $n \gg d$, and that returns as output relative-error approximations to all $n$ of the statistical leverage scores. Read More

We consider the following k-sparse recovery problem: design an m x n matrix A, such that for any signal x, given Ax we can efficiently recover x' satisfying ||x-x'||_1 <= C min_{k-sparse} x"} ||x-x"||_1. It is known that there exist matrices A with this property that have only O(k log (n/k)) rows. In this paper we show that this bound is tight. Read More

Given a directed graph G and an integer k >= 1, a k-transitive-closure-spanner (k-TCspanner) of G is a directed graph H that has (1) the same transitive-closure as G and (2) diameter at most k. In some applications, the shortcut paths added to the graph in order to obtain small diameter can use Steiner vertices, that is, vertices not in the original graph G. The resulting spanner is called a Steiner transitive-closure spanner (Steiner TC-spanner). Read More