Computer Science - Data Structures and Algorithms Publications (50)


Computer Science - Data Structures and Algorithms Publications

We study the problem of estimating the maximum matching size in graphs whose edges are revealed in a streaming manner. We consider both insertion-only streams and dynamic streams and present new upper and lower bound results for both models. On the upper bound front, we show that an $\alpha$-approximate estimate of the matching size can be computed in dynamic streams using $\widetilde{O}({n^2/\alpha^4})$ space, and in insertion-only streams using $\widetilde{O}(n/\alpha^2)$-space. Read More

A sketch is a probabilistic data structure that is used to record frequencies of items in a multi-set. Various types of sketches have been proposed in literature and applied in a variety of fields, such as data stream processing, natural language processing, distributed data sets etc. While several variants of sketches have been proposed in the past, existing sketches still have a significant room for improvement in terms of accuracy. Read More

Monitoring the traffic volumes of elephant flows, including the total byte count per flow, is a fundamental capability for online network measurements. We present an asymptotically optimal algorithm for solving this problem in terms of both space and time complexity. This improves on previous approaches, which can only count the number of packets in constant time. Read More

How many quantum queries are required to determine the coefficients of a degree-$d$ polynomial in $n$ variables? We present and analyze quantum algorithms for this multivariate polynomial interpolation problem over the fields $\mathbb{F}_q$, $\mathbb{R}$, and $\mathbb{C}$. We show that $k_{\mathbb{C}}$ and $2k_{\mathbb{C}}$ queries suffice to achieve probability $1$ for $\mathbb{C}$ and $\mathbb{R}$, respectively, where $k_{\mathbb{C}}=\smash{\lceil\frac{1}{n+1}{n+d\choose d}\rceil}$ except for $d=2$ and four other special cases. For $\mathbb{F}_q$, we show that $\smash{\lceil\frac{d}{n+d}{n+d\choose d}\rceil}$ queries suffice to achieve probability approaching $1$ for large field order $q$. Read More

Mahlmann and Schindelhauer (2005) defined a Markov chain which they called $k$-Flipper, and showed that it is irreducible on the set of all connected regular graphs of a given degree (at least 3). We study the 1-Flipper chain, which we call the flip chain, and prove that the flip chain converges rapidly to the uniform distribution over connected $2r$-regular graphs with $n$ vertices, where $n\geq 8$ and $r = r(n)\geq 2$. Formally, we prove that the distribution of the flip chain will be within $\varepsilon$ of uniform in total variation distance after $\text{poly}(n,r,\log(\varepsilon^{-1}))$ steps. Read More

We present methods for k-means clustering on a stream with a focus on providing fast responses to clustering queries. When compared with the current state-of-the-art, our methods provide a substantial improvement in the time to answer a query for cluster centers, while retaining the desirable properties of provably small approximation error, and low space usage. Our algorithms are based on a novel idea of "coreset caching" that reuses coresets (summaries of data) computed for recent queries in answering the current clustering query. Read More

Super-resolution imaging aims at improving the resolution of an image by enhancing it with other images or data that might have been acquired using different imaging techniques or modalities. In this paper we consider the task of doubling the resolution of tomographic grayscale images of binary objects by fusion with double-resolution tomographic data that has been acquired from two viewing angles. We show that this task is polynomial-time solvable if the gray levels have been reliably determined. Read More

In a recent breakthrough, Paz and Schwartzman [SODA'17] presented a single-pass ($2+\epsilon$)-approximation algorithm for the maximum weight matching problem in the semi-streaming model. Their algorithm uses $O(n\log^2 n)$ bits of space, for any constant $\epsilon>0$. In this note, we present a different analysis, for essentially the same algorithm, that improves the space complexity to the optimal bound of $O(n\log n)$ bits, while also providing a more intuitive explanation of the process. Read More

We study the problem of computing the \textsc{Maxima} of a set of $n$ $d$-dimensional points. For dimensions 2 and 3, there are algorithms to solve the problem with order-oblivious instance-optimal running time. However, in higher dimensions there is still room for improvements. Read More

Sums of independent, bounded random variables concentrate around their expectation approximately as well a Gaussian of the same variance. Well known results of this form include the Bernstein, Hoeffding, and Chernoff inequalities and many others. We present an alternative proof of these tail bounds based on what we call a stability argument, which avoids bounding the moment generating function or higher-order moments of the distribution. Read More

A common method to define a parallel solution for a computational problem consists in finding a way to use the Divide and Conquer paradigm in order to have processors acting on its own data and scheduled in a parallel fashion. MapReduce is a programming model that follows this paradigm, and allows for the definition of efficient solutions by both decomposing a problem into steps on subsets of the input data and combining the results of each step to produce final results. Albeit used for the implementation of a wide variety of computational problems, MapReduce performance can be negatively affected whenever the replication factor grows or the size of the input is larger than the resources available at each processor. Read More

In this paper, we address the problem of sampling from a set and reconstructing a set stored as a Bloom filter. To the best of our knowledge our work is the first to address this question. We introduce a novel hierarchical data structure called BloomSampleTree that helps us design efficient algorithms to extract an almost uniform sample from the set stored in a Bloom filter and also allows us to reconstruct the set efficiently. Read More

In the classical problem of scheduling on unrelated parallel machines, a set of jobs has to be assigned to a set of machines. The jobs have a processing time depending on the machine and the goal is to minimize the makespan, that is the maximum machine load. It is well known that this problem is NP-hard and does not allow polynomial time approximation algorithms with approximation guarantees smaller than $1. Read More

Given $N$ cities and $R < N^2 - N$ directed (unidirectional/one way) roads does there exist a tour of all $N$ cities stopping at each city exactly once using the given roads (a Hamiltonian cycle)? This Hamiltonian cycle problem (HCP) is an NP-complete problem, for which there is no known polynomial time solution algorithm. The HCP has important practical applications, for example, to logistical problems. It was claimed that an adiabatic quantum computer could solve an NP-complete problem faster than classical algorithms, but claim appears to have been debunked. Read More

We revisit the linear search problem where a robot, initially placed at the origin on an infinite line, tries to locate a stationary target placed at an unknown position on the line. Unlike previous studies, in which the robot travels along the line at a constant speed, we consider settings where the robot's speed can depend on the direction of travel along the line, or on the profile of the terrain, e.g. Read More

We present PFDCMSS, a novel message-passing based parallel algorithm for mining time-faded heavy hitters. The algorithm is a parallel version of the recently published FDCMSS sequential algorithm. We formally prove its correctness by showing that the underlying data structure, a sketch augmented with a Space Saving stream summary holding exactly two counters, is mergeable. Read More

We present a general technique for approximating bicriteria minimization problems with positive-valued, polynomially computable objective functions. Given $0<\epsilon\leq1$ and a polynomial-time $\alpha$-approximation algorithm for the corresponding weighted sum problem, we show how to obtain a bicriteria $(\alpha\cdot(1+2\epsilon),\alpha\cdot(1+\frac{2}{\epsilon}))$-approximation algorithm for the budget-constrained problem whose running time is polynomial in the encoding length of the input and linear in $\frac{1}{\epsilon}$. Moreover, we show that our method can be extended to compute an $(\alpha\cdot(1+2\epsilon),\alpha\cdot(1+\frac{2}{\epsilon}))$-approximate Pareto curve under the same assumptions. Read More

In the Survivable Network Design Problem (SNDP), the input is an edge-weighted (di)graph $G$ and an integer $r_{uv}$ for every pair of vertices $u,v\in V(G)$. The objective is to construct a subgraph $H$ of minimum weight which contains $r_{uv}$ edge-disjoint (or node-disjoint) $u$-$v$ paths. This is a fundamental problem in combinatorial optimization that captures numerous well-studied problems in graph theory and graph algorithms. Read More

Choi et. al (2011) introduced a minimum spanning tree (MST)-based method called CLGrouping, for constructing tree-structured probabilistic graphical models, a statistical framework that is commonly used for inferring phylogenetic trees. While CLGrouping works correctly if there is a unique MST, we observe an indeterminacy in the method in the case that there are multiple MSTs. Read More

Two strings x and y are said to be Abelian equivalent if x is a permutation of y, or vice versa. If a string z satisfies z = xy with x and y being Abelian equivalent, then z is said to be an Abelian square. If a string w can be factorized into a sequence v_1,. Read More

Let $M$ be a real $r\times c$ matrix and let $k$ be a positive integer. In the column subset selection problem (CSSP), we need to minimize the quantity $\|M-SA\|$, where $A$ can be an arbitrary $k\times c$ matrix, and $S$ runs over all $r\times k$ submatrices of $M$. This problem and its applications in numerical linear algebra are being discussed for several decades, but its algorithmic complexity remained an open issue. Read More

Social media users and microbloggers post about a wide variety of (off-line) collective social activities as they participate in them, ranging from concerts and sporting events to political rallies and civil protests. In this context, people who take part in the same collective social activity often post closely related content from nearby locations at similar times, resulting in distinctive spatiotemporal patterns. Can we automatically detect these patterns and thus provide insights into the associated activities? In this paper, we propose a modeling framework for clustering streaming spatiotemporal data, the Spatial Dirichlet Hawkes Process (SDHP), which allows us to automatically uncover a wide variety of spatiotemporal patterns of collective social activity from geolocated online traces. Read More

In parallel computing, a valid graph coloring yields a lock-free processing of the colored tasks, data points, etc., without expensive synchronization mechanisms. However, coloring is not free and the overhead can be significant. Read More

The Cell Formation Problem has been studied as an optimization problem in manufacturing for more than 90 years. It consists of grouping machines and parts into manufacturing cells in order to maximize loading of cells and minimize movement of parts from one cell to another. Many heuristic algorithms have been proposed which are doing well even for large-sized instances. Read More

We consider the problem of finding a homomorphism from an input digraph G to a fixed digraph H. We show that if H admits a weak-near-unanimity polymorphism $\phi$ then deciding whether G admits a homomorphism to H (HOM(H)) is polynomial time solvable. This confirms the conjecture of Bulatov, Jeavons, and Krokhin, in the form postulated by Maroti and McKenzie, and consequently implies the validity of the celebrated dichotomy conjecture due to Feder and Vardi. Read More

We present a new algorithm which detects the maximal number of matched disjoint pairs satisfying a given caliper when the matching is done with respect to a scalar index (e.g., propensity score), and constructs a corresponding matching. Read More

We consider a multi-level aggregation problem in a weighted rooted tree, studied recently by Bienkowski et al. (2015). In this problem requests arrive over time at the nodes of the tree, and each request specifies a deadline. Read More

Matrix multiplicative weight update (MMWU) is an extremely powerful algorithmic tool for computer science and related fields. However, it comes with a slow running time due to the matrix exponential and eigendecomposition computations. For this reason, many researchers studied the followed-the-perturbed-leader (FTPL) framework which is faster, but a factor $\sqrt{d}$ worse than the optimal regret of MMWU for dimension-$d$ matrices. Read More

Evaluating influence spread in social networks is a fundamental procedure to estimate the word-of-mouth effect in viral marketing. There are enormous studies about this topic; however, under the standard stochastic cascade models, the exact computation of influence spread is known to be #P-hard. Thus, the existing studies have used Monte-Carlo simulation-based approximations to avoid exact computation. Read More

In data centers, data replication is the primary method used to ensure availability of customer data. To avoid correlated failure, cloud storage infrastructure providers model hierarchical failure domains using a tree, and avoid placing a large number of data replicas within the same failure domain (i.e. Read More

In recent years, several convex programming relaxations have been proposed to estimate the permanent of a non-negative matrix, notably in the works of Gurvits and Samorodnitsky. However, the origins of these relaxations and their relationships to each other have remained somewhat mysterious. We present a conceptual framework, implicit in the belief propagation literature, to systematically arrive at these convex programming relaxations for estimating the permanent -- as approximations to an exponential-sized max-entropy convex program for computing the permanent. Read More

Classical spectral clustering is based on a spectral decomposition of a graph Laplacian, obtained from a graph adjacency matrix representing positive graph edge weights describing similarities of graph vertices. In signed graphs, the graph edge weights can be negative to describe disparities of graph vertices, for example, negative correlations in the data. Negative weights lead to possible negative spectrum of the standard graph Laplacian, which is cured by defining a signed Laplacian. Read More

Spectral based heuristics belong to well-known commonly used methods for finding a minimum-size bisection in a graph. The heuristics are usually easy to implement and they work well for several practice-relevant classes of graphs. However, only a few research efforts are focused on providing rigorous analysis of such heuristics and often they lack of proven optimality or approximation quality. Read More

Data is continuously generated by modern data sources, and a recent challenge in machine learning has been to develop techniques that perform well in an incremental (streaming) setting. In this paper, we investigate the problem of private machine learning, where as common in practice, the data is not given at once, but rather arrives incrementally over time. We introduce the problems of private incremental ERM and private incremental regression where the general goal is to always maintain a good empirical risk minimizer for the history observed under differential privacy. Read More

Intruders detection in computer networks has some deficiencies from machine learning approach, given by the nature of the application. The principal problem is the modest display of detection systems based on learning algorithms under the constraints imposed by real environments. This article focuses on the machine learning approach for network intrusion detection in batch and data stream environments. Read More

In this paper we will present a general agglomeration law for sorting networks. Agglomeration is a common technique when designing parallel programmes to control the granularity of the computation thereby finding a better fit between the algorithm and the machine on which the algorithm runs. Usually this is done by grouping smaller tasks and computing them en bloc within one parallel process. Read More

The discrete Heisenberg group $\mathbb{H}_{\mathbb{Z}}^{2k+1}$ is the group generated by $a_1,b_1,\ldots,a_k,b_k,c$, subject to the relations $[a_1,b_1]=\ldots=[a_k,b_k]=c$ and $[a_i,a_j]=[b_i,b_j]=[a_i,b_j]=[a_i,c]=[b_i,c]=1$ for every distinct $i,j\in \{1,\ldots,k\}$. Denote $S=\{a_1^{\pm 1},b_1^{\pm 1},\ldots,a_k^{\pm 1},b_k^{\pm 1}\}$. The horizontal boundary of $\Omega\subset \mathbb{H}_{\mathbb{Z}}^{2k+1}$, denoted $\partial_{h}\Omega$, is the set of all $(x,y)\in \Omega\times (\mathbb{H}_{\mathbb{Z}}^{2k+1}\setminus \Omega)$ such that $x^{-1}y\in S$. Read More

In this paper, we first propose a novel generalized power iteration method (GPI) to solve the quadratic problem on the Stiefel manifold (QPSM) as min_{W^TW=I}Tr(W^TAW-2W^TB) along with the theoretical analysis. Accordingly, its special case known as the orthogonal least square regression (OLSR) is under further investigation. Based on the aforementioned studies, we then cast major focus on solving the unbalanced orthogonal procrustes problem (UOPP). Read More

We consider the three graph search algorithm LexDFS, LexUP and LexDOWN. We show that LexUP orderings can be computed in linear time by an algorithm similar to the one which compute LexBFS. Furthermore, LexDOWN orderings and LexDFS orderings can be computed in time $\left(n+m\log m\right)$ where $n$ is the number of vertices and $m$ the number of edges. Read More

Generating random variates from high-dimensional distributions is often done approximately using Markov chain Monte Carlo. In certain cases, perfect simulation algorithms exist that allow one to draw exactly from the stationary distribution, but most require $O(n \ln(n))$ time, where $n$ measures the size of the input. In this work a new protocol for creating perfect simulation algorithms that runs in $O(n)$ time for a wider range of parameters on several models (such as Strauss, Ising, and random cluster) than was known previously. Read More

We consider the non-adaptive bit-probe complexity of the set membership problem, where a set S of size at most n from a universe of size m is to be represented as a short bit vector in order to answer membership queries of the form "Is x in S?" by non-adaptively probing the bit vector at t places. Let s_N(m,n,t) be the minimum number of bits of storage needed for such a scheme. In this work, we show existence of non-adaptive and adaptive schemes for a range of t that improves an upper bound of Buhrman, Miltersen, Radhakrishnan and Srinivasan (2002) on s_N(m,n,t). Read More

This paper initiates the studies of parallel algorithms for core maintenance in dynamic graphs. The core number is a fundamental index reflecting the cohesiveness of a graph, which are widely used in large-scale graph analytics. The core maintenance problem requires to update the core numbers of vertices after a set of edges and vertices are inserted into or deleted from the graph. Read More

The online search problem is a fundamental problem in finance. The numerous direct applications include searching for optimal prices for commodity trading and trading foreign currencies. In this paper, we analyze the advice complexity of this problem. Read More

This paper concerns asynchrony in iterative processes, focusing on gradient descent and tatonnement, a fundamental price dynamic. Gradient descent is an important class of iterative algorithms for minimizing convex functions. Classically, gradient descent has been a sequential and synchronous process, although distributed and asynchronous variants have been studied since the 1980s. Read More

Comparison of two numbers in RNS systems is a challenging task. In this paper, a new algorithm to compare the magnitude of two RNS numbers, using a clustering method has been proposed. In the clustering process, each inputted number is assigned to a cluster. Read More

We consider a dynamical process in a network which distributes all tokens located at a node among its neighbors, in a round-robin manner. We show that in the recurrent state of this dynamics, the number of particles located on a given edge, averaged over an interval of time, is tightly concentrated around the average particle density in the system. Formally, for a system of $k$ particles in a graph of $m$ edges, during any interval of length $T$, this time-averaged value is $k/m \pm \widetilde{O}(1/T)$, whenever $\gcd(m,k) = \widetilde{O}(1)$ (and so, e. Read More

There are several data structures which can calculate the prefix sums of an array efficiently, while handling point updates on the array, such as Segment Trees and Binary Indexed Trees (BIT). Both these data structures can handle the these two operations (query and update) in $O(\log{n})$ time. In this paper, we present a data structure similar to the BIT, but with an even smaller constant. Read More

We give a quantum algorithm for finding a marked element on the grid when there are multiple marked elements. Our algorithm uses quadratically fewer steps than a random walk on the grid, ignoring logarithmic factors. This is the first known quantum walk that finds a marked element in a number of steps less than the square-root of the extended hitting time. Read More

A seminal result of Bulow and Klemperer [1989] demonstrates the power of competition for extracting revenue: when selling a single item to $n$ bidders whose values are drawn i.i.d. Read More

Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Finding parameters with the maximum likelihood is NP-hard even in very simple settings. In recent years, provably efficient algorithms were nevertheless developed for models with linear structures: topic models, mixture models, hidden markov models, etc. Read More