# Replicable Parallel Branch and Bound Search

Branch and bound searches are a common technique for solving global optimisation and decision problems, yet their irregularity, search order dependence, and the need to share bound information globally makes it challenging to implement them in parallel, and to reason about their parallel performance. We identify three key parallel search properties for replicable branch and bound implementations: Sequential Lower Bound, Non-increasing Runtimes, and Repeatability. We define a formal model for parallel branch and bound search problems and show its generality by using it to define three benchmarks: finding a Maximum Clique in a graph, 0/1 Knapsack and Travelling Salesperson (TSP). We present a Generic Branch and Bound search API that conforms to the model. For reusability we encapsulate the search behaviours as a pair of algorithmic based skeletons in a distributed-memory parallel Haskell. Crucially the Ordered skeleton is designed to guarantee the parallel search properties, potentially at a performance cost compared with the Unordered skeleton. We compare the sequential performance of the skeletons with a class leading C++ search implementation. We then use relative speedups to evaluate the skeletons for 40 benchmark instances on a cluster using 200 workers. The Ordered skeleton preserves the Sequential Lower Bound for all benchmark instances while the Unordered skeleton violates the property for 5 TSP instances. The Ordered skeleton preserves Non-increasing Runtimes for all benchmark instances while the Unordered skeleton violates the property for many instances of all three benchmarks. The Ordered skeleton delivers far more repeatable performance than the Unordered skeleton (Repeatability property) with a median relative standard deviation (RSD) of 1.78% vs 5.56%, 1.83% vs 87.56% and 1.96% vs 8.61% for all Maximum Clique, Knapsack and TSP instances respectively.

**Comments:**38 pages, 12 figures, submitted to the Journal of Parallel and Distributed Computing

## Similar Publications

Consider a set of agents in a peer-to-peer communication network, where each agent has a personal dataset and a personal learning objective. The main question addressed in this paper is: how can agents collaborate to improve upon their locally learned model without leaking sensitive information about their data? Our first contribution is to reformulate this problem so that it can be solved by a block coordinate descent algorithm. We obtain an efficient and fully decentralized protocol working in an asynchronous fashion. Read More

We analyze the caching overhead incurred by a class of multithreaded algorithms when scheduled by an arbitrary scheduler. We obtain bounds that match or improve upon the well-known $O(Q+S \cdot (M/B))$ caching cost for the randomized work stealing (RWS) scheduler, where $S$ is the number of steals, $Q$ is the sequential caching cost, and $M$ and $B$ are the cache size and block (or cache line) size respectively. Read More

A common approach for designing scalable algorithms for massive data sets is to distribute the computation across, say $k$, machines and process the data using limited communication between them. A particularly appealing framework here is the simultaneous communication model whereby each machine constructs a small representative summary of its own data and one obtains an approximate/exact solution from the union of the representative summaries. If the representative summaries needed for a problem are small, then this results in a communication-efficient and round-optimal protocol. Read More

The massive quantities of genomic data being made available through gene sequencing techniques are enabling breakthroughs in genomic science in many areas such as medical advances in the diagnosis and treatment of diseases. Analyzing this data, however, is a computational challenge insofar as the computational costs of the relevant algorithms can grow with quadratic, cubic or higher complexity--leading to the need for leadership scale computing. In this paper we describe a new approach to calculations of the Custom Correlation Coefficient (CCC) between Single Nucleotide Polymorphisms (SNPs) across a population, suitable for parallel systems equipped with graphics processing units (GPUs) or Intel Xeon Phi processors. Read More

The surge in availability of genomic data holds promise for enabling determination of genetic causes of observed individual traits, with applications to problems such as discovery of the genetic roots of phenotypes, be they molecular phenotypes such as gene expression or metabolite concentrations, or complex phenotypes such as diseases. However, the growing sizes of these datasets and the quadratic, cubic or higher scaling characteristics of the relevant algorithms pose a serious computational challenge necessitating use of leadership scale computing. In this paper we describe a new approach to performing vector similarity metrics calculations, suitable for parallel systems equipped with graphics processing units (GPUs) or Intel Xeon Phi processors. Read More

We study the problem of testing conductance in the distributed computing model and give a two-sided tester that takes $\mathcal{O}(\log n)$ rounds to decide if a graph has conductance at least $\Phi$ or is $\epsilon$-far from having conductance at least $\Theta(\Phi^2)$ in the distributed CONGEST model. We also show that $\Omega(\log n)$ rounds are necessary for testing conductance even in the LOCAL model. In the case of a connected graph, we show that we can perform the test even when the number of vertices in the graph is not known a priori. Read More

New cloud programming and deployment models pose challenges to software application engineers who are looking, often in vain, for tools to automate any necessary code adaptation and transformation. Function-as-a-Service interfaces are particular non-trivial targets when considering that most cloud applications are implemented in non-functional languages. Among the most widely used of these languages is Python. Read More

A liquid system provides durable object storage based on spreading redundantly generated data across a network of hundreds to thousands of potentially unreliable storage nodes. A liquid system uses a combination of a large code, lazy repair, and a flow storage organization. We show that a liquid system can be operated to enable flexible and essentially optimal combinations of storage durability, storage overhead, repair bandwidth usage, and access performance. Read More

High network communication cost for synchronizing gradients and parameters is the well-known bottleneck of distributed training. In this work, we propose TernGrad that uses ternary gradients to accelerate distributed deep learning in data parallelism. Our approach requires only three numerical levels {-1,0,1} which can aggressively reduce the communication time. Read More

We study local symmetry breaking problems in the CONGEST model, focusing on ruling set problems, which generalize the fundamental Maximal Independent Set (MIS) problem. A $\beta$-ruling set is an independent set such that every node in the graph is at most $\beta$ hops from a node in the independent set. Our work is motivated by the following central question: can we break the $\Theta(\log n)$ time complexity barrier and the $\Theta(m)$ message complexity barrier in the CONGEST model for MIS or closely-related symmetry breaking problems? We present the following results: - Time Complexity: We show that we can break the $O(\log n)$ "barrier" for 2- and 3-ruling sets. Read More