Infrastructure Economics

The True Cost of a
10,000 GPU Cluster

A complete financial breakdown of building and operating a hyperscale AI training facility capable of training frontier models.

Key Assumptions

10,000 NVIDIA H100 GPUs
1,250 HGX Compute Nodes
18 MW Power Capacity
1.2 Power Usage Effectiveness
$0.07 Per kWh (blended)
3 Years Depreciation Cycle

Where Every Dollar Goes

A $732 million capital investment broken down into every component.

Detailed Breakdown by Category

Compute Hardware

$400M
54.6%
GPU Complex (8× H100 SXM5/node) $275,000,000

1,250 nodes × $220k avg. Includes NVLink switch fabric on baseboard. HBM3 memory is supply-constrained.

OEM Integration & Markup $50,000,000

Dell, Supermicro, HPE markup for assembly, testing, warranty. ~$40k per node.

Network Interface Cards $37,500,000

10× ConnectX-7 (400G) per node. 8 compute + 2 management NICs @ $12k avg.

CPUs, Memory & Local Storage $37,500,000

Dual Xeon/EPYC + 2TB DDR5 + 30TB Gen5 NVMe per node. $30k combined.

Facility Construction

$270M
36.9%

AI-optimized construction costs $15-20M per MW—70% of silicon cost. The facility is now the bottleneck.

Shell & MEP Systems $200,000,000

18MW data hall. Reinforced floors (3,000+ lb racks), 40ft ceilings, specialized piping for CDUs.

Direct-to-Chip Liquid Cooling $40,000,000

CDUs, manifolds, cold plates. Captures 70-80% of heat. Enables 40-100kW per rack.

Electrical Distribution $20,000,000

Busways, PDUs, panels. Medium voltage distribution throughout facility.

Security & Access $10,000,000

Cameras, biometrics, man traps. Physical security for billion-dollar assets.

Networking Fabric

$45M
6.1%

The network is part of the computer. InfiniBand is mandatory for 900 GB/s GPU-to-GPU bandwidth.

Quantum-2 IB Switches $25,000,000

600-800 QM9700 switches @ $35k avg. 3-tier fat-tree topology for full bisection bandwidth.

Cabling & Optics $20,000,000

20,000+ links. OSFP transceivers ($500-1000 each), AOC/DAC cables. Often underestimated.

Site Infrastructure

$12M
1.6%

Power availability—not land—is the constraint. Substation queue times dictate project timelines.

50 MVA Substation $8,000,000

Dedicated 69kV→12.47kV. 18-36 month utility queue for >10MW loads. The real project bottleneck.

Land Acquisition $2,000,000

20 acres industrial in Phoenix @ $244k/acre. Near power and fiber routes. Zoned for industrial.

Fiber Connectivity $2,000,000

Diverse paths to two carrier hotels. 400G uplinks. Meet-me room build.

Storage

$5M
0.7%

Storage is optimized for speed (TB/s), not capacity (PB). If checkpointing is slow, GPUs sit idle.

10 PB NVMe Parallel FS $4,000,000

Lustre/GPFS @ $250k/PB. QLC flash with software endurance. Checkpoint 10TB in ~30 seconds.

Metadata & Caching Tier $1,000,000

High-IOPS SSD tier for small files. Critical for dataset loading and model weights.

Hidden CAPEX Multipliers

Not in $732M
+15-25%
Cost of Capital $6,000,000+

50% deposit ($200M) locked for 6mo during chip lead time. At 6% cost of capital = $6M lost.

Pre-Payment Requirements $200,000,000

30-50% deposit required on hardware orders. Significant working capital drag.

Rapid Depreciation $133M/year

3-year amortization (not 5). Blackwell eclipses Hopper in 2025-26, crushing resale.

Backup Power $4,000,000

10× Cat 3516 diesel generators (2MW each) @ $400k. N+1 redundancy required.

Annual Operating Expenses

The recurring burn rate. Note: Software licensing exceeds energy costs by 3.5×—the largest OPEX surprise.

💻 Software Licensing
$37,500,000
NVIDIA AI Enterprise @ $4,500/GPU/year × 10,000 (with 17% volume discount). The "silent killer" of AI economics.
Electricity
$10,500,000
17 MW continuous × 8,760 hrs × $0.07/kWh blended (SRP E-65/E-66 tariff). Interruptible rates can save 10-20%.
Personnel
$5,000,000
25-30 FTEs: 24/7 NOC, critical facilities, security, IT ops. Fully loaded cost.
Other Ops
$2,000,000
H100 failure rate <1% /year (~100 GPUs). Maintenance, insurance, property tax, water.
Total Annual OpEx ~$55,000,000/year