CIRA

I have recently transitioned to a research scientist role at the Cooperative Institute for Research in the Atmosphere (CIRA) located at Colorado State University. With every new position, has new HPC. This page is a place to document the CIRA resources.

Server List

There is no central node that talks to other computers at CIRA. But there is several computer resources available to folks at CIRA to use.

Name	Primary Use	# of CPU Threads	# of GPUs	GPU Type(s)	GPU RAM per Card
shannon	ML Core	32	2/4	RTX6000/RTXA6000	24GB/46GB
dorian	TC Group	76	4	RTX6000	24GB
overcast1	OVERCAST	64	2	RTXA6000	46GB
cloudy1	OVERCAST	112	4	RTXA6000	46GB
cloudy2	OVERCAST	112	4	RTXA6000	46GB
locust	OVERCAST/ ML Core	72	1	GH200	100GB
cicada	OVERCAST/ ML Core	72	1	GH200	100GB

GH200 How To

GH200s are a pre-production hardware model out of NVIDIA. Because of this, the standard way to install pytorch doesnt work. In order to get pytorch to talk to the GPU we have to use something called an NVIDIA toolkit, which has precompiled images of pytorch. To run this nvidia toolkit, we need to use docker. Don’t worry, with the help of ChatGPT I was able to get this going easily. To do so, do the following:

Create Dockerfile

Make a Docker file that you will compile. This file will have several lines in it here is an example of my diffusion env:

# Install specific pytorch image from NVIDIA toolkits
FROM nvcr.io/nvidia/pytorch:24.04-py3

# Install PyTorch-based libraries diffusers and transformers
RUN pip install diffusers["torch"] transformers matplotlib tensorboard accelerate

# Expose the desired port (XXXX in this case)
EXPOSE XXXX

# Create a Jupyter configuration file with the specified port
RUN jupyter lab --generate-config && \
    echo "c.ServerApp.port = XXXX" >> ~/.jupyter/jupyter_lab_config.py && \
    echo "c.ServerApp.ip = '0.0.0.0'" >> ~/.jupyter/jupyter_lab_config.py && \
    echo "c.ServerApp.open_browser = False" >> ~/.jupyter/jupyter_lab_config.py && \
    echo "c.ServerApp.allow_root = True" >> ~/.jupyter/jupyter_lab_config.py

WORKDIR /workspace

 CMD ["bash"]

If you don’t use Jupyter Lab, you can delete the EXPOSE line and the RUN jupyter line.

If you want to use tensorflow, I was able to get the GPU to work with the following but fill in the XXXX with the port you use to tunnel into the machine with

# Install specific pytorch image from NVIDIA toolkits
FROM nvcr.io/nvidia/tensorflow:24.04-tf2-py3

# Install PyTorch-based libraries diffusers and transformers
RUN pip matplotlib tensorboard

# Expose the desired port (XXXX in this case)
EXPOSE XXXX

# Create a Jupyter configuration file with the specified port
RUN jupyter lab --generate-config && \
    echo "c.ServerApp.port = XXXX" >> ~/.jupyter/jupyter_lab_config.py && \
    echo "c.ServerApp.ip = '0.0.0.0'" >> ~/.jupyter/jupyter_lab_config.py && \
    echo "c.ServerApp.open_browser = False" >> ~/.jupyter/jupyter_lab_config.py && \
    echo "c.ServerApp.allow_root = True" >> ~/.jupyter/jupyter_lab_config.py

WORKDIR /workspace

 CMD ["bash"]

Name your file: Dockerfile (important so the config call identifies the right file).

Compile Dockerfile

Go ahead and compile your new docker image

$ docker build -t my_custom_pytorch_image .

This will take a few minutes to run the first time, needs to download all the pytorch stuff etc.

Launch Screen

Before you start a long training job, you will want to launch a screen here so if you get disconnected. This is a helpful tip in general for linux hpc systems

$ screen -S training

This will launch a session that you can reconnect to if you get disconnected by:

$ screen -r training

Run Docker

You are ready to run your docker image, so go ahead and call it

$ docker run --gpus all -it --rm -v /mnt/data1:/mnt/data1/ my_custom_pytorch_image

You will want to mount your data directory to it, to do that you can see the /mnt/data1:/mnt/data1/ which is the source_dir_in_default_machine / where_you_want_the_dir_on_the_docker_image. For this example I just map it to the same directory path.

Launch training

You should be good to go to run your pytorch python code here. To check you can launch a quick python session

$ python
$ import torch
$ torch.cuda.is_available()

It should say True , exit out ($ exit()) and run your python now. Here is an example of me launching my diffusion training

accelerate launch train_diffusion_model.py

Then to exit out of the screen (i.e., to run it in the background) do cntrl + a + d , this will ‘detach’ the screen so you can check nvidia-smi or run tensorboard (to monitor progress).

Optional, if you want Jupyter

If you like using jupyter, just add a port forward to your docker image call (make sure its the same as the port you tunneled with)

$ docker run --gpus all -it --rm -p XXXX:XXXX -v /mnt/data1:/mnt/data1 my_custom_pytorch_image

Now in the docker session run jupyter lab with your port

$ jupyter lab --port=XXXX

It will output something like this:

http://hostname:8888/?token=LETTERandNUMBERS

Copy just the token, then go to your browser and type in localhost:XXXX. Insert the token where it asks for it.

Then you should be good to go.