Computer Science - Performance Publications (50)


Computer Science - Performance Publications

The 2-D discrete wavelet transform (DWT) can be found in the heart of many image-processing algorithms. Until recently, several studies have compared the performance of such transform on various shared-memory parallel architectures, especially on graphics processing units (GPUs). All these studies, however, considered only separable calculation schemes. Read More

Efficient Nearest Neighbor (NN) search in high-dimensional spaces is a foundation of many multimedia retrieval systems. Because it offers low responses times, Product Quantization (PQ) is a popular solution. PQ compresses high-dimensional vectors into short codes using several sub-quantizers, which enables in-RAM storage of large databases. Read More

Mission critical data dissemination in massive Internet of things (IoT) networks imposes constraints on the message transfer delay between devices. Due to low power and communication range of IoT devices, data is foreseen to be relayed over multiple device-to-device (D2D) links before reaching the destination. The coexistence of a massive number of IoT devices poses a challenge in maximizing the successful transmission capacity of the overall network alongside reducing the multi-hop transmission delay in order to support mission critical applications. Read More

Integrating a product of linear forms over the unit simplex can be done in polynomial time if the number of variables n is fixed (V. Baldoni et al., 2011). Read More

The main goal for this article is to compare performance penalties when using KVM virtualization and Docker containers for creating isolated environments for HPC applications. The article provides both data obtained using commonly accepted synthetic tests (High Performance Linpack) and real life applications (OpenFOAM). The article highlights the influence on resulting application performance of major infrastructure configuration options: CPU type presented to VM, networking connection type used. Read More

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. However, exploiting the available performance of heterogeneous architectures may be challenging. Read More

It has been shown that it is impossible to achieve both stringent end-to-end deadline and reliability guarantees in a large network without having complete information of all future packet arrivals. In order to maintain desirable performance in the presence of uncertainty of future packet arrivals, common practice is to add redundancy by increasing link capacities. This paper studies the amount of capacity needed to provide stringent performance guarantees. Read More

An large number of different caching mechanisms have been previously proposed [1], [3], exploring different insertion and eviction policies, and the performance of networks of them have been analyzed in different ways. We review stationary Markovian models of caching networks with independent incident (data) object demand processes and independent storing decisions (object insertion) and forwarding decisions (owing to cache misses) by the caches. We obtain a novel closed-form stationary invariant distribution for LRU and MRU caching nodes, and numerically compare with a simple "Incremental Rank Progress" caching mechanism where cache hits result in slower progress through the ranks than LRU. Read More

Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Read More

This paper considers a general data-fitting problem over a networked system, in which many computing nodes are connected by an undirected graph. This kind of problem can find many real-world applications and has been studied extensively in the literature. However, existing solutions either need a central controller for information sharing or requires slot synchronization among different nodes, which increases the difficulty of practical implementations, especially for a very large and heterogeneous system. Read More

Persistent spread measurement is to count the number of distinct elements that persist in each network flow for predefined time periods. It has many practical applications, including detecting long-term stealthy network activities in the background of normal-user activities, such as stealthy DDoS attack, stealthy network scan, or faked network trend, which cannot be detected by traditional flow cardinality measurement. With big network data, one challenge is to measure the persistent spreads of a massive number of flows without incurring too much memory overhead as such measurement may be performed at the line speed by network processors with fast but small on-chip memory. Read More

Energy consumption is a major concern in multicore systems. Perhaps the simplest strategy for reducing energy costs is to use only as many cores as necessary while still being able to deliver a desired quality of service. Motivated by earlier work on a dynamic (heterogeneous) core allocation scheme for H. Read More

R is a popular language and programming environment for data scientists. It is increasingly co-packaged with both relational and Hadoop-based data platforms and can often be the most dominant computational component in data analytics pipelines. Recent work has highlighted inefficiencies in executing R programs, both in terms of execution time and memory requirements, which in practice limit the size of data that can be analyzed by R. Read More

Cloud computing has allowed applications to allocate and elastically utilize massive amounts of resources of different types, leading to an exponential growth of the applications' configuration space and increased difficulty in predicting their performance. In this work, we describe a novel, automated profiling methodology that makes no assumptions on application structure. Our approach utilizes oblique Decision Trees in order to recursively partition an application's configuration space in disjoint regions, choose a set of representative samples from each subregion according to a defined policy and returns a model for the entire configuration space as a composition of linear models over each subregion. Read More

Motivated by emerging vision-based intelligent services, we consider the problem of rate adaptation for high quality and low delay visual information delivery over wireless networks using scalable video coding. Rate adaptation in this setting is inherently challenging due to the interplay between the variability of the wireless channels, the queuing at the network nodes and the frame-based decoding and playback of the video content at the receiver at very short time scales. To address the problem, we propose a low-complexity, model-based rate adaptation algorithm for scalable video streaming systems, building on a novel performance model based on stochastic network calculus. Read More

The rapidly growing number of large network analysis problems has led to the emergence of many parallel and distributed graph processing systems---one survey in 2014 identified over 80. Since then, the landscape has evolved; some packages have become inactive while more are being developed. Determining the best approach for a given problem is infeasible for most developers. Read More

Convolutional neural networks (CNNs) have emerged as one of the most successful machine learning technologies for image and video processing. The most computationally intensive parts of CNNs are the convolutional layers, which convolve multi-channel images with multiple kernels. A common approach to implementing convolutional layers is to expand the image into a column matrix (im2col) and perform Multiple Channel Multiple Kernel (MCMK) convolution using an existing parallel General Matrix Multiplication (GEMM) library. Read More

Consider a single server queue serving a multiclass population. Some popular scheduling policies for such a system (and of interest in this paper) are the discriminatory processor sharing (DPS), discriminatory random order service (DROS), generalized processor sharing (GPS) and weighted fair queueing (WFQ). The aim of this paper is to show a certain equivalence between these scheduling policies for the special case when the multiclass population have identical and exponential service requirements. Read More

Stencil computations occur in a multitude of scientific simulations and therefore have been the subject of many domain-specific languages including the OPS (Oxford Parallel library for Structured meshes) DSL embedded in C/C++/Fortran. OPS is currently used in several large partial differential equations (PDE) applications, and has been used as a vehicle to experiment with, and deploy performance improving optimisations. The key common bottleneck in most stencil codes is data movement, and other research has shown that improving data locality through optimisations that schedule across loops do particularly well. Read More

The use of synthetic graph generators is a common practice among graph-oriented benchmark designers, as it allows obtaining graphs with the required scale and characteristics. However, finding a graph generator that accurately fits the needs of a given benchmark is very difficult, thus practitioners end up creating ad-hoc ones. Such a task is usually time-consuming, and often leads to reinventing the wheel. Read More

The Breadth First Search (BFS) algorithm is the foundation and building block of many higher graph-based operations such as spanning trees, shortest paths and betweenness centrality. The importance of this algorithm increases each day due to it is a key requirement for many data structures which are becoming popular nowadays. When the BFS algorithm is parallelized by distributing the graph between several processors the interconnection network limits the performance. Read More

Supermarket models are a class of interesting parallel queueing networks with dynamic randomized load balancing and real-time resource management. When the parallel servers are subject to breakdowns and repairs, analysis of such a supermarket model becomes more difficult and challenging. In this paper, we apply the mean-field theory to studying four interrelated supermarket models with repairable servers, and numerically indicate impact of the different repairman groups on performance of the systems. Read More

A fundamental challenge in large-scale cloud networks and data centers is to achieve highly efficient server utilization and limit energy consumption, while providing excellent user-perceived performance in the presence of uncertain and time-varying demand patterns. Auto-scaling provides a popular paradigm for automatically adjusting service capacity in response to demand while meeting performance targets, and queue-driven auto-scaling techniques have been widely investigated in the literature. In typical data center architectures and cloud environments however, no centralized queue is maintained, and load balancing algorithms immediately distribute incoming tasks among parallel queues. Read More

The need for modern data analytics to combine relational, procedural, and map-reduce-style functional processing is widely recognized. State-of-the-art systems like Spark have added SQL front-ends and relational query optimization, which promise an increase in expressiveness and performance. But how good are these extensions at extracting high performance from modern hardware platforms? While Spark has made impressive progress, we show that for relational workloads, there is still a significant gap compared with best-of-breed query engines. Read More

This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluates and tunes kernel performance of a generic, user-defined search space of possible parameter-value combinations. Example parameters include the OpenCL workgroup size, vector data-types, tile sizes, and loop unrolling factors. Read More

In Operation Research, practical evaluation is essential to validate the efficacy of optimization approaches. This paper promotes the usage of performance profiles as a standard practice to visualize and analyze experimental results. It introduces a Web tool to construct and export performance profiles as SVG or HTML files. Read More

Energy efficiency is becoming increasingly important for computing systems, in particular for large scale HPC facilities. In this work we evaluate, from an user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS) techniques, assisted by the power and energy monitoring capabilities of modern processors in order to tune applications for energy efficiency. We run selected kernels and a full HPC application on two high-end processors widely used in the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. Read More

Software tracing techniques are well-established and used by instrumentation tools to extract run-time information for program analysis and debugging. Dynamic binary instrumentation as one tool instruments program binaries to extract information. Unfortunately, instrumentation causes perturbation that is unacceptable for time-sensitive applications. Read More

Base station cooperation is a promising scheme to improve network performance for next generation cellular networks. Up to this point research has focused on station grouping criteria based solely on geographic proximity. However, for the cooperation to be meaningful, each station participating in a group should have sufficient available resources to share with others. Read More

This paper presents a survey of architectural features among four generations of Intel server processors (Sandy Bridge, Ivy Bridge, Haswell, and Broad- well) with a focus on performance with floating point workloads. Starting on the core level and going down the memory hierarchy we cover instruction throughput for floating-point instructions, L1 cache, address generation capabilities, core clock speed and its limitations, L2 and L3 cache bandwidth and latency, the impact of Cluster on Die (CoD) and cache snoop modes, and the Uncore clock speed. Using microbenchmarks we study the influence of these factors on code performance. Read More

This paper introduces a novel method for automatically tuning the selection of compiler flags to optimize the performance of software intended to run on embedded hardware platforms. We begin by developing our approach on code compiled by the GNU C Compiler (GCC) for the ARM Cortex-M3 (CM3) processor; and we show how our method outperforms the industry standard -O3 optimization level across a diverse embedded benchmark suite. First we quantify the potential gains by using existing iterative compilation approaches that time-intensively search for optimal configurations for each benchmark. Read More

The well-known Smith-Waterman (SW) algorithm is the most commonly used method for local sequence alignments. However, SW is very computationally demanding for large protein databases. There exist several implementations that take advantage of computing parallelization on many-cores, FPGAs or GPUs, in order to increase the alignment throughtput. Read More

The aim of this study is the characterization of the computing resources used by researchers at the "Instituto de Astrof\'isica de Canarias" (IAC). Since there is a huge demand of computing time and we use tools such as HTCondor to implement High Throughput Computing (HTC) across all available PCs, it is essential for us to assess in a quantitative way, using objective parameters, the performances of our computing nodes. In order to achieve that, we have run a set of benchmark tests on a number of different desktop and laptop PC models among those used in our institution. Read More

Molecular Dynamics is an important tool for computational biologists, chemists, and materials scientists, consuming a sizable amount of supercomputing resources. Many of the investigated systems contain charged particles, which can only be simulated accurately using a long-range solver, such as PPPM. We extend the popular LAMMPS molecular dynamics code with an implementation of PPPM particularly suitable for the second generation Intel Xeon Phi. Read More

We consider an infinite-buffer single-server queue where inter-arrival times are phase-type ($PH$), the service is provided according to Markovian service process $(MSP)$, and the server may take single, exponentially distributed vacations when the queue is empty. The proposed analysis is based on roots of the associated characteristic equation of the vector-generating function (VGF) of system-length distribution at a pre-arrival epoch. Also, we obtain the steady-state system-length distribution at an arbitrary epoch along with some important performance measures such as the mean number of customers in the system and the mean system sojourn time of a customer. Read More

We present a comparative analysis of the maximum performance achieved by the Linpack benchmark on compute intensive hardware publicly available from multiple cloud providers. We study both performance within a single compute node, and speedup for distributed memory calculations with up to 32 nodes or at least 512 computing cores. We distinguish between hyper-threaded and non-hyper-threaded scenarios and estimate the performance per single computing core. Read More

Estimating multiple sparse Gaussian Graphical Models (sGGMs) jointly for many related tasks (large $K$) under a high-dimensional (large $p$) situation is an important task. Most previous studies for the joint estimation of multiple sGGMs rely on penalized log-likelihood estimators that involve expensive and difficult non-smooth optimizations. We propose a novel approach, FASJEM for \underline{fa}st and \underline{s}calable \underline{j}oint structure-\underline{e}stimation of \underline{m}ultiple sGGMs at a large scale. Read More

Fast Fourier Transforms (FFTs) are exploited in a wide variety of fields ranging from computer science to natural sciences and engineering. With the rising data production bandwidths of modern FFT applications, judging best which algorithmic tool to apply, can be vital to any scientific endeavor. As tailored FFT implementations exist for an ever increasing variety of high performance computer hardware, choosing the best performing FFT implementation has strong implications for future hardware purchase decisions, for resources FFTs consume and for possibly decisive financial and time savings ahead of the competition. Read More

In this paper we investigate an emerging application, 3D scene understanding, likely to be significant in the mobile space in the near future. The goal of this exploration is to reduce execution time while meeting our quality of result objectives. In previous work we showed for the first time that it is possible to map this application to power constrained embedded systems, highlighting that decision choices made at the algorithmic design-level have the most impact. Read More

Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good choice. Similarly, compilers can generate working code, but may miss tuning opportunities by not targeting GPU models or performing code transformations. Read More

This is an annotated bibliography on estimation and inference results for queues and related stochastic models. The purpose of this document is to collect and categorise works in the field, allowing for researchers and practitioners to explore the various types of results that exist. This bibliography attempts to include all known works that satisfy both of these requirements: -Works that deal with queueing models. Read More

We consider a multi-server queueing system under the power-of-two policy with Poisson job arrivals, heterogeneous servers and a general job requirement distribution; each server operates under the first-come first-serve policy and there are no buffer constraints. We analyze the performance of this system in light traffic by evaluating the first two light traffic derivatives of the average job response time. These expressions point to several interesting structural features associated with server heterogeneity in light traffic: For unequal capacities, the average job response time is seen to decrease for small values of the arrival rate, and the more diverse the server speeds, the greater the gain in performance. Read More

Continuous Time Markov Chain (CMTC) is widely used to describe and analyze systems in several knowledge areas. Steady state availability is one important analysis that can be made through Markov chain formalism that allows researchers generate equations for several purposes, such as channel capacity estimation in wireless networks as well as system performance estimations. The problem with this kind of analysis is the complex process to generating these equations. Read More

Graphics Processing Units (GPUs) support dynamic voltage and frequency scaling (DVFS) in order to balance computational performance and energy consumption. However, there still lacks simple and accurate performance estimation of a given GPU kernel under different frequency settings on real hardware, which is important to decide best frequency configuration for energy saving. This paper reveals a fine-grained model to estimate the execution time of GPU kernels with both core and memory frequency scaling. Read More

Artificial Neural Networks (ANNs) have received increasing attention in recent years with applications that span a wide range of disciplines including vital domains such as medicine, network security and autonomous transportation. However, neural network architectures are becoming increasingly complex and with an increasing need to obtain real-time results from such models, it has become pivotal to use parallelization as a mechanism for speeding up network training and deployment. In this work we propose an implementation of Network Parallel Training through Cannon's Algorithm for matrix multiplication. Read More

Achieving optimal program performance requires deep insight into the interaction between hardware and software. For software developers without an in-depth background in computer architecture, understanding and fully utilizing modern architectures is close to impossible. Analytic loop performance modeling is a useful way to understand the relevant bottlenecks of code execution based on simple machine models. Read More

We study the scheduling polices for asymptotically optimal delay in queueing systems with switching overhead. Such systems consist of a single server that serves multiple queues, and some capacity is lost whenever the server switches to serve a different set of queues. The capacity loss due to this switching overhead can be significant in many emerging applications, and needs to be explicitly addressed in the design of scheduling policies. Read More

SRAM-based FPGAs are increasingly popular in the aerospace industry due to their field programmability and low cost. However, they suffer from cosmic radiation induced Single Event Upsets (SEUs). In safety-critical applications, the dependability of the design is a prime concern since failures may have catastrophic consequences. Read More

OpenRuleBench is a large benchmark suite for rule engines, which includes deductive databases. We previously proposed a translation of Datalog to C++ based on a method that "pushes" derived tuples immediately to places where they are used. In this paper, we report performance results of various implementation variants of this method compared to XSB, YAP and DLV. Read More

Convolutions have long been regarded as fundamental to applied mathematics, physics and engineering. Their mathematical elegance allows for common tasks such as numerical differentiation to be computed efficiently on large data sets. Efficient computation of convolutions is critical to artificial intelligence in real-time applications, like machine vision, where convolutions must be continuously and efficiently computed on tens to hundreds of kilobytes per second. Read More