Constructing a Consensus Phylogeny from a Leaf-Removal Distance

Understanding the evolution of a set of genes or species is a fundamental problem in evolutionary biology. The problem we study here takes as input a set of trees describing {possibly discordant} evolutionary scenarios for a given set of genes or species, and aims at finding a single tree that minimizes the leaf-removal distance to the input trees. This problem is a specific instance of the general consensus/supertree problem, widely used to combine or summarize discordant evolutionary trees. The problem we introduce is specifically tailored to address the case of discrepancies between the input trees due to the misplacement of individual taxa. Most supertree or consensus tree problems are computationally intractable, and we show that the problem we introduce is also NP-hard. We provide tractability results in form of a 2-approximation algorithm and a parameterized algorithm with respect to the number of removed leaves. We also introduce a variant that minimizes the maximum number $d$ of leaves that are removed from any input tree, and provide a parameterized algorithm for this problem with parameter $d$.

Similar Publications

Let $G$ be a graph such that each vertex has its list of available colors, and assume that each list is a subset of the common set consisting of $k$ colors. For two given list colorings of $G$, we study the problem of transforming one into the other by changing only one vertex color assignment at a time, while at all times maintaining a list coloring. This problem is known to be PSPACE-complete even for bounded bandwidth graphs and a fixed constant $k$. Read More

The widely-studied radio network model [Chlamtac and Kutten, 1985] is a graph-based description that captures the inherent impact of collisions in wireless communication. In this model, the strong assumption is made that node $v$ receives a message from a neighbor if and only if exactly one of its neighbors broadcasts. We relax this assumption by introducing a new noisy radio network model in which random faults occur at senders or receivers. Read More

The Arrow protocol is a simple and elegant protocol to coordinate exclusive access to a shared object in a network. The protocol solves the underlying distributed queueing problem by using path reversal on a pre-computed spanning tree (or any other tree topology simulated on top of the given network). It is known that the Arrow protocol solves the problem with a competitive ratio of O(log D) on trees of diameter D. Read More

Longest Common Subsequence ($LCS$) deals with the problem of measuring similarity of two strings. While this problem has been analyzed for decades, the recent interest stems from a practical observation that considering single characters is often too simplistic. Therefore, recent works introduce the variants of $LCS$ based on shared substrings of length exactly or at least $k$ ($LCS_{k}$ and $LCS_{k+}$ respectively). Read More

In many industrial applications of big data, the Jaccard Similarity Computation has been widely used to measure the distance between two profiles or sets respectively owned by two users. Yet, one semi-honest user with unpredictable knowledge may also deduce the private or sensitive information (e.g. Read More

Recently, there has been substantial interest in clustering research that takes a beyond worst-case approach to the analysis of algorithms. The typical idea is to design a clustering algorithm that outputs a near-optimal solution, provided the data satisfy a natural stability notion. For example, Bilu and Linial (2010) and Awasthi et al. Read More

In this work, we present CacheShuffle, an algorithm for obliviously shuffling data on an untrusted server. Our methods improve the previously best known results by reducing the number of block accesses to (4 + \epsilon)N which is close to the lower bound of 2N. Experimentation shows that for practical sizes of N, there is a 4x improvement over the previous best result. Read More

In Packet Scheduling with Adversarial Jamming packets of arbitrary sizes arrive over time to be transmitted over a channel in which instantaneous jamming errors occur at times chosen by the adversary and not known to the algorithm. The transmission taking place at the time of jamming is corrupt, and the algorithm learns this fact immediately. An online algorithm maximizes the total size of packets it successfully transmits and the goal is to develop an algorithm with the lowest possible asymptotic competitive ratio, where the additive constant may depend on packet sizes. Read More

Estimating frequencies of items over data streams is a common building block in streaming data measurement and analysis. Misra and Gries introduced their seminal algorithm for the problem in 1982, and the problem has since been revisited many times due its practicality and applicability. We describe a highly optimized version of Misra and Gries' algorithm that is suitable for deployment in industrial settings. Read More

Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows-Wheeler Transform (BWT): recent developments in this field have resulted in external memory algorithms, motivated by the large requirements of in-memory approaches. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is often instrumental in several algorithms: for example, to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In this paper we propose a new external memory method to simultaneously build the BWT and the LCP array on a collection of $m$ strings of length $k$ with $O(mkl)$ time and I/O complexity, using $O(k + m)$ main memory, where $l$ is the maximum value in the LCP array. Read More