Fault Tolerant Llama training
Hacker News
JUNE 23, 2025
Cluster Setup Crusoe graciously lent us a cluster of 300 L40S GPUs. torchft can have many, many hosts in each replica group, but for this cluster, a single host/10 gpus per replica group had the best performance due to limited network bandwidth. Register now! The GPUs were split up across 30 hosts, each with 10 NVIDIA L40S GPUs.
Let's personalize your content