LTH-image

Brownout: Building More Robust Cloud Applications

Cristian Klein, Umeå University

Abstract:

Presentation slides

Resource allocation in clouds is mostly done assuming hard requirements, applications either receive the requested resources or fail. Given the dynamic nature of workloads and the risk of cascading failures, guaranteeing on-demand allocations requires large spare capacity. Hence, one cannot have a system that is both reliable and efficient.
To solve this issue, we introduce brownout, a new paradigm to improve the robustness of replicated cloud applications. Brownout applications contain some optional code that can be dynamically deactivated as needed. Although this idea is simple and fairly non-intrusive to application code, properly supporting it required changes in several components. First, at the replica level, we synthesize a replica controller to decide when to execute the optional code and when to skip it. Second, we propose a resource manager to decide allocations among multiple brownout applications in a fair manner. Third, we propose two novel load-balancing algorithms, specifically designed for brownout replicas, to maximize the amount of optional content served. We theoretically prove properties of the overall system using control and game theory.
To show the practical applicability, we implemented brownout versions of RUBiS and RUBBoS with less than 170 lines of code. The load-balancing algorithms were implemented on top of lighttpd with less than 180 lines of code. Experiments show that brownout may enable considerable improvements in withstanding flash-crowds or hardware failures. Brownout opens up more flexibility in cloud resource management, which is why we encourage further research by publishing all source code.

Biography:Cristian Klein is a post-doctoral researcher at Umeå University, Sweden in the Distributed Systems group. Cristian holds a Ph.D. degree in Computing Science from École Normalé Supérieure de Lyon / INRIA, France. His current research interests include resource management in parallel and distributed systems, such as Grids and Clouds, with a focus on dealing with capacity shortages.