Ray Clusters on Amazon EC2

Ray Cluster in Nutshell

Ray is a distributed computing platform that can be used to scale Python applications with minimal effort. It provides a unified way to scale Python and AI applications from a laptop to a cluster. It is designed to be general-purpose and it can run any kind of workloads. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for simplifying ML compute.

ray-cluster-arch

What you will do in this part of the lab

In this workshop, you will learn how to set up a Ray cluster on Amazon EC2, and train a PyTorch model. The workshop includes the following steps:

  • Create IAM roles to be used by the head and worker nodes in the cluster
  • Set up security groups
  • Create an AMI to be used by head and worker nodes in the cluster
  • Set up Amazon FSx for Luster (FSxL) filesystem
  • Set up Amazon CloudWatch agent for resource monitoring
  • Spin up Ray cluster
  • Train PyTorch ResNet18 model on Tiny ImageNet dataset