Anish Das Sarma - Stanford University

Anish Das Sarma
Are you Anish Das Sarma?

Claim your profile, edit publications, add additional information:

Contact Details

Anish Das Sarma
Stanford University
United States

Pubs By Year

Pub Categories

Computer Science - Databases (9)
Computer Science - Data Structures and Algorithms (4)
Computer Science - Distributed; Parallel; and Cluster Computing (2)
Computer Science - Learning (1)
Computer Science - Computer Science and Game Theory (1)

Publications Authored By Anish Das Sarma

Many applications rely on Web data and extraction systems to accomplish knowledge-driven tasks. Web information is not curated, so many sources provide inaccurate, or conflicting information. Moreover, extraction systems introduce additional noise to the data. Read More

In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not "embarrassingly parallel," the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model of problems that can be solved in a single round of map-reduce computation. Read More

A significant amount of recent research work has addressed the problem of solving various data management problems in the cloud. The major algorithmic challenges in map-reduce computations involve balancing a multitude of factors such as the number of machines available for mappers/reducers, their memory requirements, and communication cost (total amount of data sent from mappers to reducers). Most past work provides custom solutions to specific problems, e. Read More

Given a large graph G = (V,E) with millions of nodes and edges, how do we compute its connected components efficiently? Recent work addresses this problem in map-reduce, where a fundamental trade-off exists between the number of map-reduce rounds and the communication of each round. Denoting d the diameter of the graph, and n the number of nodes in the largest component, all prior map-reduce techniques either require d rounds, or require about n|V| + |E| communication per round. We propose two randomized map-reduce algorithms -- (i) Hash-Greater-To-Min, which provably requires at most 3log(n) rounds with high probability, and at most 2(|V| + |E|) communication per round, and (ii) Hash-to-Min, which has a worse theoretical complexity, but in practice completes in at most 2log(d) rounds and 3(|V| + |E|) communication per rounds. Read More

Knowledge bases of entities and relations (either constructed manually or automatically) are behind many real world search engines, including those at Yahoo!, Microsoft, and Google. Those knowledge bases can be viewed as graphs with nodes representing entities and edges representing (primary) relationships, and various studies have been conducted on how to leverage them to answer entity seeking queries. Meanwhile, in a complementary direction, analyses over the query logs have enabled researchers to identify entity pairs that are statistically correlated. Read More

De-duplication---identification of distinct records referring to the same real-world entity---is a well-known challenge in data integration. Since very large datasets prohibit the comparison of every pair of records, {\em blocking} has been identified as a technique of dividing the dataset for pairwise comparisons, thereby trading off {\em recall} of identified duplicates for {\em efficiency}. Traditional de-duplication tasks, while challenging, typically involved a fixed schema such as Census data or medical records. Read More

Affiliations: 1Stanford University, 2Yahoo! Research, 3Stanford University, 4UC Santa Cruz, 5Stanford University

We consider the problem of human-assisted graph search: given a directed acyclic graph with some (unknown) target node(s), we consider the problem of finding the target node(s) by asking an omniscient human questions of the form "Is there a target node that is reachable from the current node?". This general problem has applications in many domains that can utilize human intelligence, including curation of hierarchies, debugging workflows, image segmentation and categorization, interactive search and filter synthesis. To our knowledge, this work provides the first formal algorithmic study of the optimization of human computation for this problem. Read More

We present a formal model for studying fashion trends, in terms of three parameters of fashionable items: (1) their innate utility; (2) individual boredom associated with repeated usage of an item; and (3) social influences associated with the preferences from other people. While there are several works that emphasize the effect of social influence in understanding fashion trends, in this paper we show how boredom plays a strong role in both individual and social choices. We show how boredom can be used to explain the cyclic choices in several scenarios such as an individual who has to pick a restaurant to visit every day, or a society that has to repeatedly `vote' on a single fashion style from a collection. Read More

Complex information extraction (IE) pipelines assembled by plumbing together off-the-shelf operators, specially customized operators, and operators re-used from other text processing pipelines are becoming an integral component of most text processing frameworks. A critical task faced by the IE pipeline user is to run a post-mortem analysis on the output. Due to the diverse nature of extraction operators (often implemented by independent groups), it is time consuming and error-prone to describe operator semantics formally or operationally to a provenance system. Read More

In this paper, we identify a fundamental algorithmic problem that we term succinct dynamic covering (SDC), arising in many modern-day web applications, including ad-serving and online recommendation systems in eBay and Netflix. Roughly speaking, SDC applies two restrictions to the well-studied Max-Coverage problem: Given an integer k, X={1,2,.. Read More

Affiliations: 1Universite de Rennes 1, 2Stanford University, 3Luna, 4AT&T Labs-Research, 5Rutgus University, 6ATT Labs-Research

The Web has enabled the availability of a huge amount of useful information, but has also eased the ability to spread false information and rumors across multiple sources, making it hard to distinguish between what is true and what is not. Recent examples include the premature Steve Jobs obituary, the second bankruptcy of United airlines, the creation of Black Holes by the operation of the Large Hadron Collider, etc. Since it is important to permit the expression of dissenting and conflicting opinions, it would be a fallacy to try to ensure that the Web provides only consistent information. Read More