Skip to content

KubeCon Proposal

QI JUN edited this page Feb 17, 2020 · 6 revisions

Session Details

Session Title (75 character max)

ElasticDL: a Kubernetes-native Deep Learning Framework

Topic

Kubernetes and Machine Learning

Session Description (900 character max)

Tom will introduce ElasticDL, a Kubernetes-native deep learning framework built on top of TensorFlow 2.0. Through the Kubernetes-native design, ElasticDL enables fault-tolerance and works with the priority-based preemption of Kubernetes to achieve elastic scheduling for deep learning tasks.

Benefits to the Ecosystem (1500 character max)

Industrial companies and research labs often runs deep learning jobs in an exclusive GPU cluster managed by MPI. It has drawbacks. Suppose that a cluster has N GPUs, and a job is using one of them. A new job claiming N GPUs would have to wait for the first job to complete before starting. With elastic scheduling, the new job could start running immediately with N-1 GPUs, and Kubernetes might increase its GPU consumption by 1 after the first job completes. In this case, the overall utilization is 100%.

Besides, in large-scale training, it is probable that a training job fails due to hardware failure or getting preempted by the job scheduling mechanism in a multi-task cluster. For example, the deep learning training jobs could be preempted by a high prority data processing job. ElasticDL support fault-tolerance to make sure the deep learning training job could go on other than crashing. The feature of fault-tolerance makes ElasticDL works with the priority-based preemption of Kubernetes to achieve elastic scheduling. When Kubernetes kills some processes of a job to free resources for new-coming jobs with higher priority, the current job doesn't fail but continues with fewer resources.

ElasticDL bridges the most popular deep learning framework, TensorFlow, with Kubernetes. It brings significant value to both deep learning community and cloud-native community.

Session Format

Dual Presentation: 35 minutes, 2 speakers presenting on a topic

Clone this wiki locally