AWS Self-Managed-ML Workshop > Ray Clusters on Amazon EC2 > h. Prepare Training Data

h. Prepare Training Data

For the model training part of the workshop, we will use the Tiny ImageNet dataset which consists of 100000 images of 200 classes. We can download the data directly to the FSxL filesystem mounted to all the nodes in the cluster. We created the mount directory (/fsx) when creating the AMI in step e. Executing the following command will run wget on the head node to download the data to /fsx directory:

ray exec cluster.yaml 'wget http://cs231n.stanford.edu/tiny-imagenet-200.zip -P /fsx/'

Next, unzip the data file residing in the /fsx directory:

ray exec cluster.yaml 'unzip -d /fsx /fsx/tiny-imagenet-200.zip && rm /fsx/tiny-imagenet-200.zip'

We can check the contents of the data directory executing ls command on head node:

ray exec cluster.yaml 'ls /fsx/tiny-imagenet-200'

In our training code, we will use ImageFolder class from PyTorch to ingest this dataset. The ImageFolder class expects all the images to be stored in separate folders for each class. The structure should look like this:

.
|-- train
|   |-- class1
|   |   |-- image1.jpeg
|   |   |-- image2.jpeg
|   |   |-- image3.jpeg
.
|   |-- class2
|   |   |-- image1.jpeg
|   |   |-- image2.jpeg
|   |   |-- image3.jpeg
.
.

The val folder in the Tiny ImageNet dataset does not have this structure, so we have to rearrange the images in the val directory. This can done by running a simple python script. Copy the following code to data-prep.py file:

import os
import ray

def main():
    ray.init(address="auto")

    root_dir = '/fsx/tiny-imagenet-200/val/'
    annotation_file = 'val_annotations.txt'
    with open(root_dir + annotation_file) as f:
        """
        lines in the val_annotations.txt file:
        val_0.JPEG      n03444034       0       32      44      62
        val_1.JPEG      n04067472       52      55      57      59
        val_2.JPEG      n04070727       4       0       60      55
        """
        lines = f.read().split('\n')
        lines = lines[:-1] # last line is empty

    data = {}
    for line in lines:
        file, label = line.split('\t')[:2]
        data[file] = label

    # create the directories. labels are the directory names
    labels = set(data.values())
    for label in labels:
        os.mkdir(root_dir + label)

    # move files from images folder to the new directories
    for file in data:
        src = root_dir + 'images/' + file
        dst = root_dir + '/' + data[file] + '/' + file
        os.replace(src, dst)

    os.rmdir(root_dir + 'images')
    os.remove(root_dir + annotation_file)

if __name__ == "__main__":
    main()

Finally, execute this code on the ray cluster:

ray submit cluster.yaml data-prep.py