A Hierarchical Framework of Cloud Resource Allocation and Power Management Using Deep Reinforcement Learning

Automatic decision-making approaches, such as reinforcement learning (RL), have been applied to (partially) solve the resource allocation problem adaptively in the cloud computing system. However, a complete cloud resource allocation framework exhibits high dimensions in state and action spaces, which prohibit the usefulness of traditional RL techniques. In addition, high power consumption has become one of the critical concerns in design and control of cloud computing systems, which degrades system reliability and increases cooling cost. An effective dynamic power management (DPM) policy should minimize power consumption while maintaining performance degradation within an acceptable level. Thus, a joint virtual machine (VM) resource allocation and power management framework is critical to the overall cloud computing system. Moreover, novel solution framework is necessary to address the even higher dimensions in state and action spaces. In this paper, we propose a novel hierarchical framework for solving the overall resource allocation and power management problem in cloud computing systems. The proposed hierarchical framework comprises a global tier for VM resource allocation to the servers and a local tier for distributed power management of local servers. The emerging deep reinforcement learning (DRL) technique, which can deal with complicated control problems with large state space, is adopted to solve the global tier problem. Furthermore, an autoencoder and a novel weight sharing structure are adopted to handle the high-dimensional state space and accelerate the convergence speed. On the other hand, the local tier of distributed server power managements comprises an LSTM based workload predictor and a model-free RL based power manager, operating in a distributed manner.

Comments: accepted by 37th IEEE International Conference on Distributed Computing (ICDCS 2017)

Similar Publications

With the surge of multi- and manycores, much research has focused on algorithms for mapping and scheduling on these complex platforms. Large classes of these algorithms face scalability problems. This is why diverse methods are commonly used for reducing the search space. Read More


In asynchronous distributed systems it is very hard to assess if one of the processes taking part in a computation is operating correctly or has failed. To overcome this problem, distributed algorithms are created using unreliable failure detectors that capture in an abstract way timing assumptions necessary to assess the operating status of a process. One particular type of failure detector is a leader election, that indicates a single process that has not failed. Read More


The celebrated Time Hierarchy Theorem for Turing machines states, informally, that more problems can be solved given more time. The extent to which a time hierarchy-type theorem holds in the distributed LOCAL model has been open for many years. It is consistent with previous results that all natural problems in the LOCAL model can be classified according to a small constant number of complexities, such as $O(1),O(\log^* n), O(\log n), 2^{O(\sqrt{\log n})}$, etc. Read More


Boolean networks is a well-established formalism for modelling biological systems. A vital challenge for analysing a Boolean network is to identify all the attractors. This becomes more challenging for large asynchronous Boolean networks, due to the asynchronous updating scheme. Read More


Grids allow users flexible on-demand usage of computing resources through remote communication networks. A remarkable example of a Grid in High Energy Physics (HEP) research is used in the ALICE experiment at European Organization for Nuclear Research CERN. Physicists can submit jobs used to process the huge amount of particle collision data produced by the Large Hadron Collider (LHC). Read More


The $CONGEST$ model for distributed network computing is well suited for analyzing the impact of limiting the throughput of a network on its capacity to solve tasks efficiently. For many "global" problems there exists a lower bound of $\Omega(D + \sqrt{n/B})$, where $B$ is the amount of bits that can be exchanged between two nodes in one round of communication, $n$ is the number of nodes and $D$ is the diameter of the graph. Typically, upper bounds are given only for the case $B=O(\log n)$, or for the case $B = +\infty$. Read More


We consider the problem of routing in presence of faults in undirected weighted graphs. More specifically, we focus on the design of compact name-independent fault-tolerant routing schemes, where the designer of the scheme is not allowed to assign names to nodes, i.e. Read More


On the one hand, the correctness of routing protocols in networks is an issue of utmost importance for guaranteeing the delivery of messages from any source to any target. On the other hand, a large collection of routing schemes have been proposed during the last two decades, with the objective of transmitting messages along short routes, while keeping the routing tables small. Regrettably, all these schemes share the property that an adversary may modify the content of the routing tables with the objective of, e. Read More


The paper is devoted to an analytical study of the "master-worker" framework scalability on multiprocessors with distributed memory. A new model of parallel computations called BSF is proposed. The BSF model is based on BSP and SPMD models. Read More


In the context of distributed synchronous computing, processors perform in rounds, and the time-complexity of a distributed algorithm is classically defined as the number of rounds before all computing nodes have output. Hence, this complexity measure captures the running time of the slowest node(s). In this paper, we are interested in the running time of the ordinary nodes, to be compared with the running time of the slowest nodes. Read More