Block CUR : Decomposing Large Distributed Matrices

A common problem in large-scale data analysis is to approximate a matrix using a combination of specifically sampled rows and columns, known as CUR decomposition. Unfortunately, in many real-world environments, the ability to sample specific individual rows or columns of the matrix is limited by either system constraints or cost. In this paper, we consider matrix approximation by sampling predefined blocks of columns (or rows) from the matrix. This regime is commonly found when data is distributed across multiple nodes in a compute cluster, where such blocks correspond to columns (or rows) of the matrix stored on the same node, which can be retrieved with much less overhead than retrieving individual columns stored across different nodes. We propose a novel algorithm for sampling useful column blocks and provide guarantees for the quality of the approximation. We demonstrate the practical utility of this algorithm for computing the block CUR decomposition of large matrices in a distributed setting using Apache Spark. Using our proposed block CUR algorithms, we can achieve a significant speed-up compared to a regular CUR decomposition with the same quality of approximation.

Similar Publications

We consider the problem of estimating a consensus community structure by combining information from multiple layers of a multi-layer network or multiple snapshots of a time-varying network. Numerous methods have been proposed in the literature for the more general problem of multi-view clustering in the past decade based on the spectral clustering or a low-rank matrix factorization. As a general theme, these "intermediate fusion" methods involve obtaining a low column rank matrix by optimizing an objective function and then using the columns of the matrix for clustering. Read More

We propose a novel optimization approach for learning a low-rank matrix which is also constrained to be in a given linear subspace. Low-rank constraints are regularly employed in applications such as recommender systems and multi-task learning. In addition, several system identification problems require a learning matrix with both low-rank and linear subspace constraints. Read More

When tracking user-specific online activities, each user's preference is revealed in the form of choices and comparisons. For example, a user's purchase history tracks her choices, i.e. Read More

The scalable calculation of matrix determinants has been a bottleneck to the widespread application of many machine learning methods such as determinantal point processes, Gaussian processes, generalised Markov random fields, graph models and many others. In this work, we estimate log determinants under the framework of maximum entropy, given information in the form of moment constraints from stochastic trace estimation. The estimates demonstrate a significant improvement on state-of-the-art alternative methods, as shown on a wide variety of UFL sparse matrices. Read More

We present here a new model and algorithm which performs an efficient Natural gradient descent for Multilayer Perceptrons. Natural gradient descent was originally proposed from a point of view of information geometry, and it performs the steepest descent updates on manifolds in a Riemannian space. In particular, we extend an approach taken by the "Whitened neural networks" model. Read More

Global constraints and reranking have not been used in cognates detection research to date. We propose methods for using global constraints by performing rescoring of the score matrices produced by state of the art cognates detection systems. Using global constraints to perform rescoring is complementary to state of the art methods for performing cognates detection and results in significant performance improvements beyond current state of the art performance on publicly available datasets with different language pairs and various conditions such as different levels of baseline state of the art performance and different data size conditions, including with more realistic large data size conditions than have been evaluated with in the past. Read More

Current statistical inference problems in areas like astronomy, genomics, and marketing routinely involve the simultaneous testing of thousands -- even millions -- of null hypotheses. For high-dimensional multivariate distributions, these hypotheses may concern a wide range of parameters, with complex and unknown dependence structures among variables. In analyzing such hypothesis testing procedures, gains in efficiency and power can be achieved by performing variable reduction on the set of hypotheses prior to testing. Read More

Variable clustering is one of the most important unsupervised learning methods, ubiquitous in most research areas. In the statistics and computer science literature, most of the clustering methods lead to non-overlapping partitions of the variables. However, in many applications, some variables may belong to multiple groups, yielding clusters with overlap. Read More

Though the deep learning is pushing the machine learning to a new stage, basic theories of machine learning are still limited. The principle of learning, the role of the a prior knowledge, the role of neuron bias, and the basis for choosing neural transfer function and cost function, etc., are still far from clear. Read More

Matrix completion models are among the most common formulations of recommender systems. Recent works have showed a boost of performance of these techniques when introducing the pairwise relationships between users/items in the form of graphs, and imposing smoothness priors on these graphs. However, such techniques do not fully exploit the local stationarity structures of user/item graphs, and the number of parameters to learn is linear w. Read More