Overall Satisfaction with Kubernetes
Kubernetes is currently used as an experimental product for building and managing Machine Learning pipelines (ML) at LinkedIn. It is currently used by very few teams to access GPU clusters. Kubernetes makes it easy to deploy training and monitoring workloads on clusters really simple with a robust CLI. It has a very small learning curve as is mainly driven by config files.
- Complex cluster management can be done with simple commands with strong authentication and authorization schemes
- Exhaustive documentation and open community smoothens the learning process
- As a user a few concepts like pod, deployment and service are sufficient to go a long way
- We had several problems with its NFS, which is responsible for syncing the code across the cluster
- On several instances the pods go into UNKNOWN state in which case restarting the entire node is the only solution
- As a user of the existing setup given to me, I wasn't able to allocate only some CPU cores on a single host. It was either all or zero making cluster utilization sub-optimal
- It enabled us to move faster with our experimental ML pipeline
- Being an experimental setup, we faced several hiccups during deployments like pods going into UNKNOWN state demanding immediate attention
- Though it had rough edges, NFS was quite useful for distributed Machine Learning training. It made development very simple