Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications

Energy efficiency is becoming increasingly important for computing systems, in particular for large scale HPC facilities. In this work we evaluate, from an user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS) techniques, assisted by the power and energy monitoring capabilities of modern processors in order to tune applications for energy efficiency. We run selected kernels and a full HPC application on two high-end processors widely used in the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate the available trade-offs between energy-to-solution and time-to-solution, attempting a function-by-function frequency tuning. We finally estimate the benefits obtainable running the full code on a HPC multi-GPU node, with respect to default clock frequency governors. We instrument our code to accurately monitor power consumption and execution time without the need of any additional hardware, and we enable it to change CPUs and GPUs clock frequencies while running. We analyze our results on the different architectures using a simple energy-performance model, and derive a number of energy saving strategies which can be easily adopted on recent high-end HPC systems for generic applications.


Similar Publications

With the surge of multi- and manycores, much research has focused on algorithms for mapping and scheduling on these complex platforms. Large classes of these algorithms face scalability problems. This is why diverse methods are commonly used for reducing the search space. Read More


In asynchronous distributed systems it is very hard to assess if one of the processes taking part in a computation is operating correctly or has failed. To overcome this problem, distributed algorithms are created using unreliable failure detectors that capture in an abstract way timing assumptions necessary to assess the operating status of a process. One particular type of failure detector is a leader election, that indicates a single process that has not failed. Read More


The celebrated Time Hierarchy Theorem for Turing machines states, informally, that more problems can be solved given more time. The extent to which a time hierarchy-type theorem holds in the distributed LOCAL model has been open for many years. It is consistent with previous results that all natural problems in the LOCAL model can be classified according to a small constant number of complexities, such as $O(1),O(\log^* n), O(\log n), 2^{O(\sqrt{\log n})}$, etc. Read More


Boolean networks is a well-established formalism for modelling biological systems. A vital challenge for analysing a Boolean network is to identify all the attractors. This becomes more challenging for large asynchronous Boolean networks, due to the asynchronous updating scheme. Read More


Grids allow users flexible on-demand usage of computing resources through remote communication networks. A remarkable example of a Grid in High Energy Physics (HEP) research is used in the ALICE experiment at European Organization for Nuclear Research CERN. Physicists can submit jobs used to process the huge amount of particle collision data produced by the Large Hadron Collider (LHC). Read More


The $CONGEST$ model for distributed network computing is well suited for analyzing the impact of limiting the throughput of a network on its capacity to solve tasks efficiently. For many "global" problems there exists a lower bound of $\Omega(D + \sqrt{n/B})$, where $B$ is the amount of bits that can be exchanged between two nodes in one round of communication, $n$ is the number of nodes and $D$ is the diameter of the graph. Typically, upper bounds are given only for the case $B=O(\log n)$, or for the case $B = +\infty$. Read More


We consider the problem of routing in presence of faults in undirected weighted graphs. More specifically, we focus on the design of compact name-independent fault-tolerant routing schemes, where the designer of the scheme is not allowed to assign names to nodes, i.e. Read More


On the one hand, the correctness of routing protocols in networks is an issue of utmost importance for guaranteeing the delivery of messages from any source to any target. On the other hand, a large collection of routing schemes have been proposed during the last two decades, with the objective of transmitting messages along short routes, while keeping the routing tables small. Regrettably, all these schemes share the property that an adversary may modify the content of the routing tables with the objective of, e. Read More


The paper is devoted to an analytical study of the "master-worker" framework scalability on multiprocessors with distributed memory. A new model of parallel computations called BSF is proposed. The BSF model is based on BSP and SPMD models. Read More


In the context of distributed synchronous computing, processors perform in rounds, and the time-complexity of a distributed algorithm is classically defined as the number of rounds before all computing nodes have output. Hence, this complexity measure captures the running time of the slowest node(s). In this paper, we are interested in the running time of the ordinary nodes, to be compared with the running time of the slowest nodes. Read More