Scaling distributed training with AWS Trainium and Amazon EKS
AWS Machine Learning Blog
FEBRUARY 1, 2023
TorchX has two important dependencies: the Volcano batch scheduler and the etcd server. Volcano handles the scheduling and queuing of training jobs, while the etcd server is a key-value store used by TorchElastic for synchronization and peer discovery during job startup.
Let's personalize your content