Skip to content

KubeCon Proposal

QI JUN edited this page Feb 4, 2020 · 6 revisions

Session Details

Session Title (75 character max)

ElasticDL: a Kubernetes-native Deep Learning Framework

Topic

Kubernetes and Machine Learning

Session Description (900 character max)

Tom will introduce ElasticDL, a Kubernetes-native deep learning framework that supports fault-tolerance and elastic scheduling.

Benefits to the Ecosystem (1500 character max)

In large-scale training, it is probable that a training job fails due to hardware failure or getting preempted by the job scheduling mechanism. ElasticDL support fault-tolerance to make sure the job could go on other than crashing.

The feature of fault-tolerance makes ElasticDL works with the priority-based preemption of Kubernetes to achieve elastic scheduling. When Kubernetes kills some processes of a job to free resources for new- coming jobs with higher priority, the current job doesn't fail but continues with fewer resources.

Elastic scheduling could significantly improve the overall utilization of a cluster. Suppose that a cluster has N GPUs, and a job is using one of them. Without elastic scheduling, a new job claiming N GPUs would have to wait for the first job to complete before starting. This pending time could be hours, days, or even weeks. During this very long time, the utilization of the cluster is 1/N. With elastic scheduling, the new job could start running immediately with N-1 GPUs, and Kubernetes might increase its GPU consumption by 1 after the first job completes. In this case, the overall utilization is 100%.

We provide a distributed communication package, which implements two mainstream communication strategies: parameter server and Allreduce. Both of these two strategies support fault-tolerance and elastic scheduling.

Session Format

Dual Presentation: 35 minutes, 2 speakers presenting on a topic

Clone this wiki locally