Scaling kubernetes to 7500 nodes

rw-book-cover

Metadata

Author: Eric Sigler
Full Title:: Scaling Kubernetes to 7,500 Nodes
Category:: 🗞️Articles
URL:: https://openai.com/research/scaling-kubernetes-to-7500-nodes
Finished date:: 2023-03-17

Highlights

A large machine learning job spans many nodes and runs most efficiently when it has access to all of the hardware resources on each node. This allows GPUs to cross-communicate directly using NVLink, or GPUs to directly communicate with the NIC using GPUDirect. So for many of our workloads, a single pod occupies the entire node. Any NUMA, CPU, or PCIE resource contention aren’t factors for scheduling. Bin-packing or fragmentation is not a common problem. Our current clusters have full bisection bandwidth, so we also don’t make any rack or network topology considerations. All of this means that, while we have many nodes, there’s relatively low strain on the scheduler. (View Highlight)

Dr. Mario's 2nd 🧠

Explorer

Scaling kubernetes to 7500 nodes

Metadata

Highlights

Webmentions

❤️ Likes

🔄 Reposts

💬 Replies

🔗 Mentions

Graph View

Table of Contents