Computer Science - Performance Publications (50)


Computer Science - Performance Publications

The well-known Smith-Waterman (SW) algorithm is the most commonly used method for local sequence alignments. However, SW is very computationally demanding for large protein databases. There exist several implementations that take advantage of computing parallelization on many-cores, FPGAs or GPUs, in order to increase the alignment throughtput. Read More

The aim of this study is the characterization of the computing resources used by researchers at the "Instituto de Astrof\'isica de Canarias" (IAC). Since there is a huge demand of computing time and we use tools such as HTCondor to implement High Throughput Computing (HTC) across all available PCs, it is essential for us to assess in a quantitative way, using objective parameters, the performances of our computing nodes. In order to achieve that, we have run a set of benchmark tests on a number of different desktop and laptop PC models among those used in our institution. Read More

Molecular Dynamics is an important tool for computational biologists, chemists, and materials scientists, consuming a sizable amount of supercomputing resources. Many of the investigated systems contain charged particles, which can only be simulated accurately using a long-range solver, such as PPPM. We extend the popular LAMMPS molecular dynamics code with an implementation of PPPM particularly suitable for the second generation Intel Xeon Phi. Read More

We consider an infinite-buffer single-server queue where inter-arrival times are phase-type ($PH$), the service is provided according to Markovian service process $(MSP)$, and the server may take single, exponentially distributed vacations when the queue is empty. The proposed analysis is based on roots of the associated characteristic equation of the vector-generating function (VGF) of system-length distribution at a pre-arrival epoch. Also, we obtain the steady-state system-length distribution at an arbitrary epoch along with some important performance measures such as the mean number of customers in the system and the mean system sojourn time of a customer. Read More

We present a comparative analysis of the maximum performance achieved by the Linpack benchmark on compute intensive hardware publicly available from multiple cloud providers. We study both performance within a single compute node, and speedup for distributed memory calculations with up to 32 nodes or at least 512 computing cores. We distinguish between hyper-threaded and non-hyper-threaded scenarios and estimate the performance per single computing core. Read More

Fast Fourier Transforms (FFTs) are exploited in a wide variety of fields ranging from computer science to natural sciences and engineering. With the rising data production bandwidths of modern FFT applications, judging best which algorithmic tool to apply, can be vital to any scientific endeavor. As tailored FFT implementations exist for an ever increasing variety of high performance computer hardware, choosing the best performing FFT implementation has strong implications for future hardware purchase decisions, for resources FFTs consume and for possibly decisive financial and time savings ahead of the competition. Read More

In this paper we investigate an emerging application, 3D scene understanding, likely to be significant in the mobile space in the near future. The goal of this exploration is to reduce execution time while meeting our quality of result objectives. In previous work we showed for the first time that it is possible to map this application to power constrained embedded systems, highlighting that decision choices made at the algorithmic design-level have the most impact. Read More

Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good choice. Similarly, compilers can generate working code, but may miss tuning opportunities by not targeting GPU models or performing code transformations. Read More

This is an annotated bibliography on estimation and inference results for queues and related stochastic models. The purpose of this document is to collect and categorise works in the field, allowing for researchers and practitioners to explore the various types of results that exist. This bibliography attempts to include all known works that satisfy both of these requirements: -Works that deal with queueing models. Read More

We consider a multi-server queueing system under the power-of-two policy with Poisson job arrivals, heterogeneous servers and a general job requirement distribution; each server operates under the first-come first-serve policy and there are no buffer constraints. We analyze the performance of this system in light traffic by evaluating the first two light traffic derivatives of the average job response time. These expressions point to several interesting structural features associated with server heterogeneity in light traffic: For unequal capacities, the average job response time is seen to decrease for small values of the arrival rate, and the more diverse the server speeds, the greater the gain in performance. Read More

Continuous Time Markov Chain (CMTC) is widely used to describe and analyze systems in several knowledge areas. Steady state availability is one important analysis that can be made through Markov chain formalism that allows researchers generate equations for several purposes, such as channel capacity estimation in wireless networks as well as system performance estimations. The problem with this kind of analysis is the complex process to generating these equations. Read More

Graphics Processing Units (GPUs) support dynamic voltage and frequency scaling (DVFS) in order to balance computational performance and energy consumption. However, there still lacks simple and accurate performance estimation of a given GPU kernel under different frequency settings on real hardware, which is important to decide best frequency configuration for energy saving. This paper reveals a fine-grained model to estimate the execution time of GPU kernels with both core and memory frequency scaling. Read More

Artificial Neural Networks (ANNs) have received increasing attention in recent years with applications that span a wide range of disciplines including vital domains such as medicine, network security and autonomous transportation. However, neural network architectures are becoming increasingly complex and with an increasing need to obtain real-time results from such models, it has become pivotal to use parallelization as a mechanism for speeding up network training and deployment. In this work we propose an implementation of Network Parallel Training through Cannon's Algorithm for matrix multiplication. Read More

Achieving optimal program performance requires deep insight into the interaction between hardware and software. For software developers without an in-depth background in computer architecture, understanding and fully utilizing modern architectures is close to impossible. Analytic loop performance modeling is a useful way to understand the relevant bottlenecks of code execution based on simple machine models. Read More

We study the scheduling polices for asymptotically optimal delay in queueing systems with switching overhead. Such systems consist of a single server that serves multiple queues, and some capacity is lost whenever the server switches to serve a different set of queues. The capacity loss due to this switching overhead can be significant in many emerging applications, and needs to be explicitly addressed in the design of scheduling policies. Read More

SRAM-based FPGAs are increasingly popular in the aerospace industry due to their field programmability and low cost. However, they suffer from cosmic radiation induced Single Event Upsets (SEUs). In safety-critical applications, the dependability of the design is a prime concern since failures may have catastrophic consequences. Read More

OpenRuleBench is a large benchmark suite for rule engines, which includes deductive databases. We previously proposed a translation of Datalog to C++ based on a method that "pushes" derived tuples immediately to places where they are used. In this paper, we report performance results of various implementation variants of this method compared to XSB, YAP and DLV. Read More

Convolutions have long been regarded as fundamental to applied mathematics, physics and engineering. Their mathematical elegance allows for common tasks such as numerical differentiation to be computed efficiently on large data sets. Efficient computation of convolutions is critical to artificial intelligence in real-time applications, like machine vision, where convolutions must be continuously and efficiently computed on tens to hundreds of kilobytes per second. Read More

Flexibility at hardware level is the main driving force behind adaptive systems whose aim is to realise microarhitecture deconfiguration 'online'. This feature allows the software/hardware stack to tolerate drastic changes of the workload in data centres. With emerge of FPGA reconfigurablity this technology is becoming a mainstream computing paradigm. Read More

Frameworks, such as MapReduce and Hadoop are abundant nowadays. They seek to reap benefits of parallelization, albeit subject to a synchronization constraint at the output. Fork-Join (FJ) queuing models are used to analyze such systems. Read More

Fork-Join (FJ) queueing models capture the dynamics of system parallelization under synchronization constraints, for example, for applications such as MapReduce, multipath transmission and RAID systems. Arriving jobs are first split into tasks and mapped to servers for execution, such that a job can only leave the system when all of its tasks are executed. In this paper, we provide computable stochastic bounds for the waiting and response time distributions for heterogeneous FJ systems under general parallelization benefit. Read More

We present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical and experimental analysis of classical HT is performed for opportunities to exhibit higher degree of parallelism where parallelism is quantified as a number of parallel operations per level in the Directed Acyclic Graph (DAG) of the transform. Based on theoretical analysis of classical HT, an opportunity re-arrange computations in the classical HT is identified that results in Modified HT (MHT) where it is shown that MHT exhibits 1. Read More

This article introduces a novel family of decentralised caching policies, applicable to wireless networks with finite storage at the edge-nodes (stations). These policies, that are based on the Least-Recently-Used replacement principle, are here referred to as spatial multi-LRU. They update cache inventories in a way that provides content diversity to users who are covered by, and thus have access to, more than one station. Read More

Demand-Response (DR) programs, whereby users of an electricity network are encouraged by economic incentives to rearrange their consumption in order to reduce production costs, are envisioned to be a key feature of the smart grid paradigm. Several recent works proposed DR mechanisms and used analytical models to derive optimal incentives. Most of these works, however, rely on a macroscopic description of the population that does not model individual choices of users. Read More

"Geographic Load Balancing" is a strategy for reducing the energy cost of data centers spreading across different terrestrial locations. In this paper, we focus on load balancing among micro-datacenters powered by renewable energy sources. We model via a Markov Chain the problem of scheduling jobs by prioritizing datacenters where renewable energy is currently available. Read More

The paper introduces RADULS, a new parallel sorter based on radix sort algorithm, intended to organize ultra-large data sets efficiently. For example 4G 16-byte records can be sorted with 16 threads in less than 15 seconds on Intel Xeon-based workstation. The implementation of RADULS is not only highly optimized to gain such an excellent performance, but also parallelized in a cache friendly manner to make the most of modern multicore architectures. Read More

We consider a Markovian many server queueing system in which customers are preemptively scheduled according to exogenously assigned priority levels. The priority levels are randomly assigned from a continuous probability measure rather than a discrete one and hence, the queue is modeled by an infinite dimensional stochastic process. We analyze the equilibrium behavior of the system and provide several results. Read More

The rapid advances in sensors and ultra-low power wireless communication has enabled a new generation of wireless sensor networks: Wireless Body Area Networks (WBAN). To the best of our knowledge the current paper is the first to address broadcast in WBAN. We first analyze several broadcast strategies inspired from the area of Delay Tolerant Networks (DTN). Read More

Nowadays Big Data are becoming more and more important. Many sectors of our economy are now guided by data-driven decision processes. Big Data and business intelligence applications are facilitated by the MapReduce programming model while, at infrastructural layer, cloud computing provides flexible and cost effective solutions for allocating on demand large clusters. Read More

In this paper, we present a stochastic queuing model for the road traffic, which captures the stationary density-flow relationships in both uncongested and congestion conditions. The proposed model is based on the $M/g/c/c$ state dependent queuing model of Jain and Smith, and is inspired from the deterministic Godunov scheme for the road traffic simulation. We first propose a reformulation of the $M/g/c/c$ state dependent model that works with density-flow fundamental diagrams rather than density-speed relationships. Read More

We present a modular framework, the Workload Characterisation Framework (WCF), that is developed to reproducibly obtain, store and compare key characteristics of radio astronomy processing software. As a demonstration, we discuss the experiences using the framework to characterise a LOFAR calibration and imaging pipeline. Read More

The IPv4 addresses exhaustion demands a protocol transition from IPv4 to IPv6. The original transition technique, the dual stack, is not widely deployed yet and it demanded the creation of new transition techniques to extend the transition period. This work makes an experimental comparison of techniques that use dual stack with a limited IPv4 address. Read More

Demand response is a crucial technology to allow large-scale penetration of intermittent renewable energy sources in the electric grid. This paper is based on the thesis that datacenters represent especially attractive candidates for providing flexible, real-time demand response services to the grid; they are capable of finely-controllable power consumption, fast power ramp-rates, and large dynamic range. This paper makes two main contributions: (a) it provides detailed experimental evidence justifying this thesis, and (b) it presents a comparative investigation of three candidate software interfaces for power control within the servers. Read More

DESP-C++ is a C++ discrete-event random simulation engine that has been designed to be fast, very easy to use and expand, and valid. DESP-C++ is based on the resource view. Its complete architecture is presented in detail, as well as a short " user manual ". Read More

High speed wireless access on 60 GHz spectrum relies on high-gain directional antennas to overcome the severe signal attenuation. However, perfect alignment between transmitting and receiving antenna beams is rare in practice and overheard signals from concurrent transmissions may cause significant interference. In this paper we analyze the impact of antenna beam misalignment on the system performance of 60 GHz wireless access. Read More

There exist multitudes of cloud performance metrics, including workload performance, application placement, software/hardware optimization, scalability, capacity, reliability, agility and so on. In this paper, we consider jointly optimizing the performance of the software applications in the cloud. The challenges lie in bringing a diversity of raw data into tidy data format, unifying performance data from multiple systems based on timestamps, and assessing the quality of the processed performance data. Read More

The term "performance portability" has been informally used in computing to refer to a variety of notions which generally include: 1) the ability to run one application across multiple hardware platforms; and 2) achieving some notional level of performance on these platforms. However, there has been a noticeable lack of consensus on the precise meaning of the term, and authors' conclusions regarding their success (or failure) to achieve performance portability have thus been subjective. Comparing one approach to performance portability with another has generally been marked with vague claims and verbose, qualitative explanation of the comparison. Read More

We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first technique promotes worker sharing (WS) between the two tasks, allowing the threads of the task that completes first to be reallocated for use by the costlier task. Read More

A fundamental building block for supporting better utilization of radio spectrum involves predicting the impact that an emitter will have at different geographic locations. To this end, fixed sensors can be deployed to spatially sample the RF environment over an area of interest, with interpolation methods used to infer received power at locations between sensors. This paper describes a radio map interpolation method that exploits the known properties of most path loss models, with the aim of minimizing the RMS errors in predicted dB-power. Read More

We describe a high-performance implementation of the lattice Boltzmann method (LBM) for sparse 3D geometries on graphic processors (GPU). The main contribution of this work is a data layout that allows to minimise the number of redundant memory transactions during the propagation step of LBM. We show that by using a uniform mesh of small three-dimensional tiles and a careful data placement it is possible to utilise more than 70% of maximum theoretical GPU memory bandwidth for D3Q19 lattice and double precision numbers. Read More

In this paper, a methodology is presented and employed for simulating the Internet of Things (IoT). The requirement for scalability, due to the possibly huge amount of involved sensors and devices, and the heterogeneous scenarios that might occur, impose resorting to sophisticated modeling and simulation techniques. In particular, multi-level simulation is regarded as a main framework that allows simulating large-scale IoT environments while keeping high levels of detail, when it is needed. Read More

We consider an input queued switch operating under the MaxWeight scheduling algorithm. This system is interesting to study because it is a model for Internet routers and data center networks. Recently, it was shown that the MaxWeight algorithm has optimal heavy-traffic queue length scaling when all ports are uniformly saturated. Read More

In this paper we focus on the integration of high-performance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations developed in the Forschungszentrum J\"ulich over the course of two decades. The presented work follows up on a previous effort to modernize legacy code by re-engineering and rewriting it in terms of highly optimized libraries. Read More

Nowadays, more and more increasingly hard computations are performed in challenging fields like weather forecasting, oil and gas exploration, and cryptanalysis. Many of such computations can be implemented using a computer cluster with a large number of servers. Incoming computation requests are then, via a so-called load balancing policy, distributed over the servers to ensure optimal performance. Read More

Affiliations: 1University of Trieste, CNR-ISTI, 2University of Edinburgh

In this paper we focus on spatial Markov population models, describing the stochastic evolution of populations of agents, explicitly modelling their spatial distribution, representing space as a discrete, finite graph. More specifically, we present a heuristic approach to aggregating spatial locations, which is designed to preserve the dynamical behaviour of the model whilst reducing the computational cost of analysis. Our approach combines stochastic approximation ideas (moment closure, linear noise), with computational statistics (spectral clustering) to obtain an efficient aggregation, which is experimentally shown to be reasonably accurate on two case studies: an instance of epidemic spreading and a London bike sharing scenario. Read More

Polyhedral compilers perform optimizations such as tiling and parallelization; when doing both, they usually generate code that executes "barrier-synchronized wavefronts" of tiles. We present a system to express and generate code for hybrid schedules, where some constraints are automatically satisfied through the structure of the code, and the remainder are dynamically enforced at run-time with data flow mechanisms. We prove bounds on the added overheads that are better, by at least one polynomial degree, than those of previous techniques. Read More

Multi-server systems have received increasing attention with important implementations such as Google MapReduce, Hadoop, and Spark. Common to these systems are a fork operation, where jobs are first divided into tasks that are processed in parallel, and a later join operation, where completed tasks wait until the results of all tasks of a job can be combined and the job leaves the system. The synchronization constraint of the join operation makes the analysis of fork-join systems challenging and few explicit results are known. Read More

A breakdown of a benchmark score is how much each aspect of the system performance affects the score. Existing methods require internal analysis on the benchmarking program and then involve the following problems: (1) require a certain amount of labor for code analysis, profiling, simulation, and so on and (2) require the benchmarking program itself. In this paper, we present a method for breaking down a benchmark score without internal analysis of the benchmarking program. Read More

Device-to-Device (D2D) communication, which enables direct communication between nearby mobile devices, is an attractive add-on component to improve spectrum efficiency and user experience by reusing licensed cellular spectrum. Nowadays, LTE-unlicensed (LTE-U) emerges to extend the cellular network to the unlicensed spectrum to alleviate the spectrum scarcity issue. In this paper, we propose to enable D2D communication in unlicensed spectrum (D2D-U) as an underlay of the uplink cellular network for further booming the network capacity. Read More

We consider a mathematical model for streaming media packets (as the motivating key example) from a transmitter buffer to a receiver over a wireless link while controlling the transmitter power (hence, the packet/job processing rate). When each packet comes to the head-of-line (HOL) in the buffer, it is given a deadline $D$ which is the maximum number of times the transmitter can attempt retransmission in order to successfully transmit the packet. If this number of transmission attempts is exhausted, the packet is ejected from the buffer and the next packet comes to the HOL. Read More