LTH-image

Deadline scheduling for big-data jobs and fault-tolerance of datacenter applications

Peter Bodik, Microsoft Research

Abstract:

Presentation slides

I will describe two projects in the space of resource allocation in large-scale datacenters: Jockey -- scheduling big-data jobs to meet latency deadlines and application placement in datacenters to survive large-scale hardware failures.

Many big-data jobs, running in Hadoop MapReduce or Microsoft’s Cosmos, require completion by a certain deadline. Missing a deadline might lead to reduced productivity of data analysts, stale content presented by a search engine, or even a financial penalty. However, today’s cluster schedulers do not support specifying a deadline for a job and provide no guarantees on job completion. I will describe Jockey, a framework for providing deadline guarantees for big-data jobs. Offline, Jockey uses past executions of a job to build a model of the job and then, during job execution, uses the model in a control loop to adjust job resources to meet the specified deadline.

In the second half of the talk, I will talk about improving fault tolerance of applications deployed in datacenters. Datacenter networks have been designed to tolerate failures of network equipment and provide sufficient bandwidth. In practice, however, failures and maintenance of networking and power equipment often make tens to thousands of servers unavailable, and network congestion can increase service latency. Unfortunately, there exists an inherent tradeoff between achieving high fault tolerance for applications deployed in a datacenter and reducing bandwidth usage in network core. Spreading servers across fault domains improves fault tolerance, but requires additional bandwidth, while deploying servers together reduces bandwidth usage, but also decreases fault tolerance. We present a detailed analysis of a large-scale Web application and its communication patterns. Based on that, we propose and evaluate a novel optimization framework that achieves both high fault tolerance and significantly reduces bandwidth usage in the network core by exploiting the skewness in the observed communication patterns.

Biography:Peter Bodik is a Researcher at Microsoft Research in the Mobility and Networking Research group. He obtained his PhD from UC Berkeley, working under the supervision of Dave Patterson, Armando Fox and Michael Jordan, working on applying Machine Learning techniques to problems in diagnosis and automatic scaling of large-scale distributed system. At MSR, he is working on many aspects of distributed systems and networking, such as resource allocation and scheduling in low-latency and batch-processing clusters, performance optimization and isolation in multi-tenant systems. He has published papers in top-tier conferences such as SIGCOMM, EuroSys, and FAST.