An H100 GPU Datacenter — A Simple Guide to Topology and Bandwidth

Cold Open

It's probably 2 a.m. Your training job which has been running for 7 days is using 32 servers, each with 8 H100 GPUs. As an batch is being processed and step is progressing, the GPUs inside one server swap activations and gradients using NVLink through a chip called NVSwitch. None of this traffic goes through the CPUs because of DMA characterisitcs. But the same thing is also happening on other servers. These servers share and combine results over InfiniBand links at 400 gigabits per second each. With a few links per server, the whole system can move hundreds of gigabytes per second between servers when things are set up well.

If training is slow, there are like half a dozen reasons that come to the top of ones mind. From a network standpoint, it could be because of small messages, unlucky settings, or contention, not the raw link speed.

Prerequisites - important terms we will use later

Node: one server with 8 H100 GPUs. This is the basic unit you schedule.
NVLink: fast "roads" on the motherboard that let GPUs talk to each other without using the CPU. Much much faster than PCIe.
NVSwitch: a switch chip that connects all NVLink roads so any GPU can reach any other GPU inside the server.
Collectives / reductions: group actions (like all‑reduce) where many GPUs send and combine data. These come from parallel programming theory and are some really old, mature concepts. Libraries like NCCL implement them on GPU and optimize these.
InfiniBand (IB): the fast network between servers. In modern clusters you'll see NDR 400 Gb/s per port. Look for NET/IB in logs.
Leaf–spine network: a common topology method where Servers plug into leaf switches and leaves connect to spine switches. To mimic one to all kind of topology. More on this below.
Bisection bandwidth: Data can flow in a bidirectional fashion. Imagine you are at a traffic signal and see 2 cars from one side and 3 cars from other side. The total number of data flowed is 5 cars.
Line rate: the top speed of a link in perfect conditions. It is 400 Gb/s and nearly impossible to hit but good to understand how well we are doing.

1) What is a "node" here?

When people say a datacenter is "full of H100s," they usually mean racks of 8‑GPU DGX/HGX servers. Each node includes:

8× H100 GPUs connected by NVLink through NVSwitch
Several 400 Gb/s network ports (InfiniBand for compute, Ethernet for management/services)
CPUs and PCIe Gen5 for I/O and storage

2) Inside the node: NVLink/NVSwitch (paper vs. practice)

Supermicro 4U Machine with 8 GPUs — Typical H100 rack with 8 GPUs in it. The GPUs are deep inside that chasis. The GPUs are layered on top of metal plate that is cooled by water flowing on top of it. The black pipes circulate water to keep the GPUs cool. There are fans behind the chasis as well to push out air. 8 such chasis are coupled together to create 64 GPUs.

Theoritical Numbers given in roadshows and flyers:

Per‑GPU NVLink capacity: up to ~900 GB/s inside the node
8‑GPU bisection (via NVSwitch): about ~3.6 TB/s across the node
Switch help for reductions: about ~450 GB/s for group combine operations

In practice, our observations are slightly different

These are the speeds you'll actually see when running benchmarks or real workloads, not the theoretical maximums from welltuned tests and marketing flyers:

Inside One Node Performance (What You Actually See)

One GPU sending to another GPU

~330–360 GB/sFast enough to move a 16B parameter model (50 GB!) in just 140 milliseconds

Two GPUs talking to each other simultaneously

~1.25–1.35 TB/s totalBoth directions at once - like two-lane highway vs one-lane

All 8 GPUs exchanging data with each other

~1.6–1.8 TB/s combinedEnough bandwidth for large model training with gradient being exchanged

Takeaway: The broad takeaway here is that if you are operating within one node - 8 GPUs, bandwidth is hardly ever a concern. There are bigger fish to fry... like latency, message size, etc. Also useful is overlapping compute with data transfer. This is where pinned datatransfer, concurrent cuda streams, etc. come into picture. Something we will dive into in a separate post.

3) Between nodes: InfiniBand for compute, Ethernet for everything else

Most H100 clusters use InfiniBand NDR 400 for fast compute traffic and Ethernet for management, logging, and storage gateways. Goal: any server can reach any other at close to full speed.

Per‑port peak: 400 Gb/s ≈ 50 GB/s ≈ 46.6 GiB/s
Per‑node aggregate: many 8‑GPU servers expose up to 8× 400 Gb/s. That's ~3.2 Tb/s ≈ 400 GB/s if fully cabled and used.

Ethernet is still used for coordinating, Master node interacting and managing work with worker nodes, etc.

What a single 400G link usually does:

To understand these numbers: imagine you're training a large language model and need to send gradient updates between servers. Here's what different setups can achieve:

"One‑to‑all" without a cable mess

Large clusters are not wired as a full mesh. They use leaf–spine networks so traffic can flow any‑to‑any at high speed with just 1–2 switch hops. To you, it feels like every node can talk to every other directly.

4) How it looks

Logical any‑to‑any (small 4‑node example)

In a logical view, every node can communicate directly with every other node:

Node 1 ←→ Node 2
Node 1 ←→ Node 3  
Node 1 ←→ Node 4
Node 2 ←→ Node 3
Node 2 ←→ Node 4
Node 3 ←→ Node 4

Typical physical fabric (leaf–spine)

The actual physical implementation uses a leaf-spine topology:

        Spine 1    Spine 2
           |          |
    Leaf A |__________|  Leaf B
       |                   |
   DGX #1-2             DGX #3-4

Practitioner view (logical diagram)

Here is the logical picture many ML folks use. It's not the exact cabling, but it shows how the system feels during training: fast paths inside a node via NVLink/NVSwitch, and fast paths between nodes via InfiniBand.

Logical one-to-all communication view for ML practitioners — Logical any‑to‑any: NVLink/NVSwitch inside the node; InfiniBand across nodes. The runtime chooses the routes; you see uniform reachability.

This design provides any-to-any connectivity with just 1-2 switch hops while minimizing cable complexity.

5) Units and quick conversions

Keep straight what's bits and what's bytes, and what's decimal vs binary prefixes. I make this mistake many times and have to keep myself accountable as well by using the right dimensions.

Gb/s (Gbps) — gigabits per second (decimal). 1 Gb/s = 10^9 bits/s
GB/s — gigabytes per second (decimal). 1 GB/s = 10^9 bytes/s
Gib/s — gibibits per second (binary). 1 Gib/s = 2^30 bits/s
GiB/s — gibibytes per second (binary). 1 GiB/s = 2^30 bytes/s

To summarize:

400 Gb/s ≈ 50 GB/s ≈ 46.6 GiB/s
Gbps → GB/s: divide by 8
GB/s → GiB/s: divide by 1.073741824

8) How this looks on AWS, GCP, Azure

The concepts above assume you're running on dedicated hardware (like CoreWeave, Lambda Labs, or on-premises). Here's how the major cloud providers handle H100 clusters and what changes:

AWS (Amazon EC2 P5 instances)

p5.48xlarge: 8× H100 GPUs per instance, same NVLink/NVSwitch setup as above
AWS has something called EFA (Elastic Fabric Adapter) - to abstract out low level infiniband details
AWS also has Placement groups to figure out which instances to choose for running the job most efficiently
AWS also has custom setups with optimized configs when you lease a large clusters

Takeaway: AWS abstracts a lot of low level networking concepts. It mixes in stuff from scheduling and routing world. I am not sure about pricing. You get somewhat similar performance as a barebones cluster with loss of tight control over the network.

Google Cloud Platform (A3 instances)

a3-highgpu-8g: 8× H100 GPUs, Google's custom interconnect
GPUDirect-TCPX: Google's optimized networking stack for GPU communication
TPU ecosystem: Super well integrated with the TPU ecosystem.

Takeaway: Competitive with dedicated setups, with Google handling most network optimization automatically.

Something I find completely fascinating about Google's ecosystem and their models is that they are written in JAX that is not as popular as pytorch. Plus trained on TPUs that do not have as much literature as Nvidia GPUs. Plus they have their own interconnect that is not elaborately documented. This results in their models coming from a slightly different world with some trade secrets that only folks working on it understand. Not sure if that is a good thing or not. Just my opinion!

Microsoft Azure (ND H100 v5)

Standard_ND96isr_H100_v5: 8× H100 GPUs per VM
InfiniBand HDR/NDR: Similar to dedicated hardware
CycleCloud: Tools for managing large-scale training clusters

Takeaway: Most similar to standard datacenter cluster with more knobs and dials to control network behavior.

When to Choose What

Use cloud providers when:

You need to scale up/down frequently
You don't want to manage infrastructure
You're experimenting or doing research perhaps as workloads do not shutdown or get preempted
Your training jobs are seasonal (or dependent on paper deadlines :D)

Use dedicated hardware when:

You're training continuously
You need maximum performance and control
You have predictable, long-term workloads
Custom setup with tons of teams

9) Conclusion

We covered topology of GPU machines, some diagrams and some info on nvlink and infiniband. We will do a deep dive on the numbers in some future post hopefully. Also I am interested in writing a post on storage and data patterns in such a cluster. Stay tuned for that as well!