Job Recommendation Service for GPU Sharing in Kubernetes
Cloud infrastructures encourage the multi-tenancy of hardware resources. User-defined Machine Learning (ML) training jobs are offloaded to the cloud for efficient training. State-of-the-art resource schedulers do not preserve user privacy by accessing sensitive meta-data of the user-defined training workload. We present the design of a fine-grain, online, privacy-preserving job scheduler built on top of the Kubernetes platform in combination with Argo workflow. We categorize ML training workloads on standard benchmark architectures and datasets over sixty-six different features, cluster them based on exploratory data analysis, and perform inter - and intra-cluster task interference. We assume black-box access to the user-defined ML training jobs and refrain from accessing sensitive meta-data. We define three scheduler-level objectives to maximize gains from users' and cloud providers' perspectives. Our scheduler promotes multi-tenancy by intelligently selecting competitor jobs for concurrent execution in every pod while abiding by scheduler-level objectives.