Metadata
- Author: Eric Sigler
- Full Title:: Scaling Kubernetes to 7,500 Nodes
- Category:: 🗞️Articles
- URL:: https://openai.com/research/scaling-kubernetes-to-7500-nodes
- Finished date:: 2023-03-17
Highlights
A large machine learning job spans many nodes and runs most efficiently when it has access to all of the hardware resources on each node. This allows GPUs to cross-communicate directly using NVLink, or GPUs to directly communicate with the NIC using GPUDirect. So for many of our workloads, a single pod occupies the entire node. Any NUMA, CPU, or PCIE resource contention aren’t factors for scheduling. Bin-packing or fragmentation is not a common problem. Our current clusters have full bisection bandwidth, so we also don’t make any rack or network topology considerations. All of this means that, while we have many nodes, there’s relatively low strain on the scheduler. (View Highlight)