Difference between revisions of "GPU Cluster"

From ITSwiki
Jump to: navigation, search
[unchecked revision][unchecked revision]
(Cluster Machines)
(Cluster Machines)
Line 50: Line 50:
 
| 11 GB
 
| 11 GB
 
| align="center" | Pascal
 
| align="center" | Pascal
| 2 x Xeon E5-2620 v4<br />8c/16t - 2.1 GHz
 
| 256 GB
 
| ~1 TB
 
|-
 
| style="background: #ccc;"| themis
 
| titan06
 
| align="center" | 8
 
| RTX 2080ti
 
| 11 GB
 
| align="center" | Turing
 
 
| 2 x Xeon E5-2620 v4<br />8c/16t - 2.1 GHz
 
| 2 x Xeon E5-2620 v4<br />8c/16t - 2.1 GHz
 
| 256 GB
 
| 256 GB
Line 77: Line 67:
 
| hyperion
 
| hyperion
 
| titan08
 
| titan08
| align="center" | 8
 
| Titan X
 
| 12 GB
 
| align="center" | Pascal
 
| 2 x Xeon E5-2620 v4<br />8c/16t - 2.1 GHz
 
| 256 GB
 
| ~1 TB
 
|-
 
| coeus
 
| titan09
 
 
| align="center" | 8
 
| align="center" | 8
 
| Titan X
 
| Titan X

Revision as of 17:58, 30 January 2023

Info about DTU Compute GPU Cluster


Credit

Based on original cheat sheet by:

Rasmus R. Paulsen (rapa@dtu.dk)
Thilde Marie Haspang (tmahas@dtu.dk)
Nicki Skafte Detlefsen (nsde@dtu.dk)

Cluster Machines

Hostname a.k.a. # of GPUs GPU type GPU RAM GPU Architecture Host CPU Host RAM /scratch
mnemosyne titan01 7 Titan V 12 GB Volta 2 x Xeon E5-2620 v4
8c/16t - 2.1 GHz
256 GB ~1 TB
theia titan03 8 Titan X 12 GB Maxwell 2 x Xeon E5-2620 v4
8c/16t - 2.1 GHz
256 GB ~1 TB
phoebe titan04 8 GTX 1080ti 11 GB Pascal 2 x Xeon E5-2620 v4
8c/16t - 2.1 GHz
256 GB ~1 TB
oceanus titan07 7 Titan Xp 12 GB Kepler

Pascal

2 x Xeon E5-2620 v4
8c/16t - 2.1 GHz
256 GB ~1 TB
hyperion titan08 8 Titan X 12 GB Pascal 2 x Xeon E5-2620 v4
8c/16t - 2.1 GHz
256 GB ~1 TB
cronus titan10 8 Titan X 12 GB Maxwell 2 x Xeon E5-2620 v4
8c/16t - 2.1 GHz
256 GB ~1 TB
crius titan11 8 Titan X 12 GB Maxwell 2 x Xeon E5-2620 v4
8c/16t - 2.1 GHz
256 GB ~1 TB
iapetus titan12 8 GTX 1080ti 11 GB Pascal 2 x Xeon E5-2620 v4
8c/16t - 2.1 GHz
256 GB ~1 TB
Reserved

Overview

Following is an explanation of basic usage and where to store data

Cluster usage

Currently the cluster is working more like a group of workstations with remote access, so some manual discovery is needed by the user, such as available resources. On a server/node, GPUs are selected with export CUDA_VISIBLE_DEVICES= (see here) and a conda environment should be active with required libs for your code. Please note that the "CUDA toolkit" is often included as a dependency when installing ML/DL frameworks, using the correct version the framework is compiled with. No further action is needed, and no CUDA module needs to be loaded.

Access to a cluster machine

A user need to be whitelistet for the GPU Cluster to be able to login to the cluster machines. This is currently handled by DTU Compute IT Support.
Feel free to contact Ejner Fergo directly if you got authorization to join, and have questions about the cluster generally.

Local Access

Within DTU Compute the cluster machines can be directly reached by using Putty or a similar SSH tool.
To connect from Windows use a Fully Qualified Domain Name:
hostname.compute.dtu.dk
From Linux/macOS using just the hostname is fine, e.g.:
ssh themis or ssh titan06

Remote Access

When not on the DTU Compute network, use your SSH client to access user@thinlinc.compute.dtu.dk and from there use SSH again to login to one of the above cluster machines.

An alternative is to set up a VPN connection.


Filesystem

/home

The users $HOME (/home/user) is shared between the cluster machines, so configuration files and software such as anaconda environments and git repos will only need to be setup once. Please note that this shared home-dir is specific to the GPU cluster, and is not related to the DTU Compute unix-home, so some files will still need to be copied over if necessary.
By default there is a 100GB quota pr. $HOME, so please don't keep overly large data here, but make use of the following directories.
Also be aware that the quota per default is limited to 100.000 files. Anaconda environments use a lot of files, so this limit will most likely hit first, before your disk space runs out.

/scratch

The scratch disk is a local 1TB SSD on each server, meant as a location for datasets and other transient data.
Data older than 45 days will be deleted automatically, unless they are modified (e.g. using the touch command) if users need to work on it longer than that. To touch all files in a directory, use this command:

find /scratch/mydir -type f -exec touch {} +

Even though there is an "automatic cleanup" set up, users should ideally keep track of their files, and delete unnecessary data when they are done using it.

/nobackup

Shared network storage.
Avoid loading datasets from here - instead copy large data to /scratch

/dtu-compute

Shared network storage.
Avoid loading datasets from here - instead copy large data to /scratch

/project

Shared network storage.
Avoid loading datasets from here - instead copy large data to /scratch

Helpful commands

module

Use this command to load specific library versions, primarily CUDA.

efer@themis:~$ module avail
---------------------------------------------------------------- /opt/cuda/modules ----------------------------------------------------------------
CUDA/7.5  CUDA/9.0  CUDA/10.0  CUDA/10.2  CUDNN/5.0  CUDNN/7.4  CUDNN/7.6  CUPTI     NCCL/2.7  
CUDA/8.0  CUDA/9.2  CUDA/10.1  CUDA/11.1  CUDNN/7.1  CUDNN/7.5  CUDNN/8.0  NCCL/2.4

To use CUDNN load CUDA first:

efer@themis:~$ module load CUDA/10.2 CUDNN/7.6

Please NOTE: This is only necessary when compiling CUDA software yourself, not for running CUDA software in a conda environment.

conda

This command is part of the Anaconda Distribution, installed on the cluster, and is used to manage your software/python environment.
Since every user doesn't install and maintain their own Anaconda Distribution, it is important to work inside conda created environments.
After logging in on a server-node the first time, conda needs to be initialized for the user, done with this command:

conda init

This will add some lines in your ~/.bashrc (check if it conflicts with an old setup) Creating an environment called "myjob" using Python 3.7, entering it and then using pip, is done with these commands:

efer@themis:~$ conda create -n myjob python=3.7
efer@themis:~$ conda activate myjob
(myjob) efer@themis:~$ pip install something

Please pay attention to python=3.7 in the command above. It is necessary to specify a Python version for your environment, or if you want to use the latest, default version from Anaconda 3 (which as of this writing is: 3.8.x, just python. By specifying this, a Python environment will be set up in your home, and pip install ... will install to this local site-packages, instead of trying to write to /opt where it will fail.

gpustat

Find free GPUs with:

efer@themis:~$ gpustat
themis                  Tue Nov 10 14:13:05 2020  450.80.02
[0] GeForce RTX 2080 Ti | 23'C,   0 % |     1 / 11019 MB |
[1] GeForce RTX 2080 Ti | 25'C,   0 % |     1 / 11019 MB |
[2] GeForce RTX 2080 Ti | 26'C,   0 % |     1 / 11019 MB |
[3] GeForce RTX 2080 Ti | 25'C,   0 % |     1 / 11019 MB |
[4] GeForce RTX 2080 Ti | 23'C,   0 % |     1 / 11019 MB |
[5] GeForce RTX 2080 Ti | 24'C,   0 % |     1 / 11019 MB |
[6] GeForce RTX 2080 Ti | 21'C,   0 % |     1 / 11019 MB |
[7] GeForce RTX 2080 Ti | 24'C,   0 % |     1 / 11019 MB |

The number to the left is the GPU-ID used when assigning GPUs with CUDA_VISIBLE_DEVICES

screen / tmux

These programs are terminal multiplexers, which means you can have access to as many shells (called "windows") as you'd like, using only a single shell. This is especially useful when using a ssh login.
Another great thing about terminal multiplexers, is that you can detach and attach a running session. This way you can setup and run multiple jobs, detach the session and logout, and later login and reattach the session to check status, etc.

tmux is a more modern version of screen, and shares most commands and functionality with screen. By default the main difference is the Modifer Key (from now on called <mod>) which is used to control the program:

Screen Tmux
Ctrl + a Ctrl + b

Also, tmux makes it obvious that you are running in a session, while screen by default does not.
For long time screen users, this default tmux <mod>-combo can of course be changed, or just keep using screen. Following is a few basic commands:

Create Window <mod> + c
Next Window <mod> + n
Previous Window <mod> + p
Detach Session <mod> + d

The command to (re)attach a session can be expressed in a few ways, here shown with abbreviated flags:

Screen Tmux
screen -r tmux a

screen & tmux are powerful tools that can do much more than what is shown above, and by familiarizing yourself with one of them can help your workflow a lot.
Check links to read more about screen and tmux

df / du

Disk Free - prints out disk space usage per mount point. Typically run like df -h (-h = human readable). Can also list a single mountpoint:

efer@hyperion:~$ df -h /scratch
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       216G   60M  205G   1% /scratch

Disk Usage - prints file/folder size. Typically run like du -sh (-h = human readable, -s = summarize). Works with wildcard (*):

efer@hyperion:~$ du -sh /scratch/efer/*
133M	/scratch/efer/data1.zip
22G	/scratch/efer/data2

top / htop

Process monitor & machine load/status
htop is more "visual" than classic top.

git

To get a DTU Compute GIT account follow the instructions found here.

git clone git@lab.compute.dtu.dk:repo_owner/project_repo

In the above line, repo_owner refers to your DTU login name, or the user who has given you access to a repository.

man / help

A useful command, to learn more about a command or quickly look up which flags to use, is man. Many *nix commandline tools have a manual file associated with it:

man nvidia-smi

If man feels like "too much information" or some commands doesn't have a manual file, an alternative that is available on all respectable tools is the option --help (also abbreviated as -h, if that flag isn't used by another option).

df --help
gpustat -h

Examples & Howtos

Selecting which GPUs to use

First use either the gpustat or nvidia-smi command to find a free GPU. Scripts need to be started with for example:

CUDA_VISIBLE_DEVICES=NUMBER python yourscriptornotebook.py

where NUMBER is for example 2 to run on GPU number 2 or 1,2 to run on GPU number 1 and 2. This is to avoid running on and occupying all GPUs.
This can also be incorporated into the scrip you want to run, by using the os package in python. Thus by adding the following two lines:

import os
os.environ["CUDA_VISIBLE_DEVICES"]="NUMBER"

before importing your favorite deep learning framework. "NUMBER" is for example 2 to run on GPU number 2 or 0,2 to run on GPU number 0 and 2. Then the script/notebook is executed as normal:

(myenv) user@host:~$ python yourscript_or_notebook.py

Transferring Data

Windows

On a Windows PC you can assign a drive letter to a shared drive by starting a command prompt and issuing the command:

net use v: \\comp-nas3.compute.dtu.dk\nobackup\user

The drive letter does not have to be v:
If Windows asks for a username & password, you may have to prepend WIN\ to the username, like so: WIN\user

WinSCP is also an option

Linux & macOS

To transfer files from a Linux or Mac PC directly to a specific cluster machine, you can use scp:

scp -r my_data_dir user@tethys:/path/to/place/

The -r flag means "recursive", so makes sense if copying a directory with content, and not a single file, though you can leave it on in any case.
Specifying user@ before the hostname, is only necessary if your local user differs from the DTU/cluster user.
The example uses an absolute path to copy to, but if you leave out the path and only say tethys: it means the root of your $HOME

If the data is multiple gigabytes, it would be better to use rsync as it can resume copying rather than start over, if the transfer is interrupted.

rsync -av my_data_dir user@tethys:/path/to/place/

The syntax is much like with scp. Here -a means "archive" which is a combination of other flags, and -v means "verbose" so you can see what is currently being copied.

Using scp and rsync to copy to a machine directly as shown in the above examples, makes sense when the receiving directory is local, such as /scratch
Data copied to a shared drive is of course available on all machines mounting that drive.

Alternatively on Linux you can use sshfs to mount a remote directory on your machine:

sshfs remotehost:directory relative/dir/

Here a directory located in $HOME on the remotehost is mounted in a sub-dir relative to where you currently are (probably /home/user/ - notice the leading /)
You can now use a graphical filemanager to work with the files if you wish.
To unmount:

fusermount -u relative/dir

Q & A

GPUs running but fans @ 0%

Yes, they are running, but they are not run at 100%, so they don't get hot enough to start the fans. It looks like they start spinning at around 60C. With nvidia-smi you can see in the column GPU-util that they use around 20% of the processing capacity. This happens because there is some other bottleneck in the program e.g. input/output so the program spends more time writing to disk than actually runninng stuff on the GPUs.

"broken pipe" message

The broken pipe happens when you lose connection to the server for a while. When that happens your ssh session closes and since the python program is a child process of the ssh session it will close as well. This is what screen is for. The screen process is detached from your ssh session and will keep running if you lose the connection and you can log back in to your screen session when you lose the connection.

Scary ssh WARNING

When connecting to a machine via ssh for the first time, you enter 'yes' to confirm your intent, and the hostname + IP + key of the machine is saved in ~/.ssh/known_hosts. If that machine gets reinstalled, while the hostname + IP are the same, its key is different, and that is what the "REMOTE HOST IDENTIFICATION HAS CHANGED" warning message is about.
It is very easy to fix, by simply opening ~/.ssh/known_hosts in your favorite editor, finding the hostname, and deleting that line. Alternatively this command also works:

ssh-keygen -R hostname
Note: This is OK to do here, on these machine with planned maintenance. If you see this warning when connecting to machines you manage yourself, or thirdparty systems such as VPS's, you are right to be suspicious.