Difference between revisions of "GPU Cluster"

From ITSwiki
Jump to: navigation, search
[unchecked revision][unchecked revision]
(Replaced content with "Info about DTU Compute GPU Cluster<br /> <div style="background-color: #FFFF00; border-style: dotted;"><br />'''Go to https://titans.compute.dtu.dk for current informatio...")
 
Line 1: Line 1:
 
Info about DTU Compute GPU Cluster<br />
 
Info about DTU Compute GPU Cluster<br />
  
<div style="background-color: #FFFF00; border-style: dotted;"><br />'''Below documentation is deprecated. Go to https://titans.compute.dtu.dk for current information.'''<br /><br /></div>
+
<div style="background-color: #FFFF00; border-style: dotted;"><br />'''Go to https://titans.compute.dtu.dk for current information.'''<br /><br /></div>
  
<br />
 
 
== Credit ==
 
 
Based on original cheat sheet by:
 
 
Rasmus R. Paulsen ([mailto:rapa@dtu.dk rapa@dtu.dk])<br />
 
Thilde Marie Haspang ([mailto:tmahas@dtu.dk tmahas@dtu.dk])<br />
 
Nicki Skafte Detlefsen ([mailto:nsde@dtu.dk nsde@dtu.dk])
 
 
== Cluster Machines ==
 
 
{| class="wikitable" width="50%" align="left"
 
! scope="row" style="text-align: left" | Hostname
 
! scope="row" style="text-align: left" | a.k.a.
 
! scope="row" style="text-align: left" | # of GPUs
 
! scope="row" style="text-align: left" | GPU type
 
! scope="row" style="text-align: left" | GPU RAM
 
! scope="row" style="text-align: left" | GPU Architecture
 
! scope="row" style="text-align: left" | Host CPU
 
! scope="row" style="text-align: left" | Host RAM
 
! scope="row" style="text-align: left" | /scratch
 
|-
 
| mnemosyne
 
| titan01
 
| align="center" | 7
 
| Titan V
 
| 12 GB
 
| align="center" | Volta
 
| 2 x Xeon E5-2620 v4<br />8c/16t - 2.1 GHz
 
| 256 GB
 
| ~1 TB
 
|-
 
| theia
 
| titan03
 
| align="center" | 8
 
| Titan X
 
| 12 GB
 
| align="center" | Maxwell
 
| 2 x Xeon E5-2620 v4<br />8c/16t - 2.1 GHz
 
| 256 GB
 
| ~1 TB
 
|-
 
| phoebe
 
| titan04
 
| align="center" | 8
 
| GTX 1080ti
 
| 11 GB
 
| align="center" | Pascal
 
| 2 x Xeon E5-2620 v4<br />8c/16t - 2.1 GHz
 
| 256 GB
 
| ~1 TB
 
|-
 
| oceanus
 
| titan07
 
| align="center" | 7
 
| Titan Xp
 
| 12 GB
 
| align="center" | Kepler
 
Pascal
 
| 2 x Xeon E5-2620 v4<br />8c/16t - 2.1 GHz
 
| 256 GB
 
| ~1 TB
 
|-
 
| hyperion
 
| titan08
 
| align="center" | 8
 
| Titan X
 
| 12 GB
 
| align="center" | Pascal
 
| 2 x Xeon E5-2620 v4<br />8c/16t - 2.1 GHz
 
| 256 GB
 
| ~1 TB
 
|-
 
| cronus
 
| titan10
 
| align="center" | 8
 
| Titan X
 
| 12 GB
 
| align="center" | Maxwell
 
| 2 x Xeon E5-2620 v4<br />8c/16t - 2.1 GHz
 
| 256 GB
 
| ~1 TB
 
|-
 
| crius
 
| titan11
 
| align="center" | 8
 
| Titan X
 
| 12 GB
 
| align="center" | Maxwell
 
| 2 x Xeon E5-2620 v4<br />8c/16t - 2.1 GHz
 
| 256 GB
 
| ~1 TB
 
|-
 
| iapetus
 
| titan12
 
| align="center" | 8
 
| GTX 1080ti
 
| 11 GB
 
| align="center" | Pascal
 
| 2 x Xeon E5-2620 v4<br />8c/16t - 2.1 GHz
 
| 256 GB
 
| ~1 TB
 
|}
 
<div style="clear:both;">
 
 
{| class="wikitable" width="50%" align="left"
 
| style="background: #ccc;" | Reserved
 
|}
 
<div style="clear:both;">
 
 
== Overview ==
 
 
Following is an explanation of basic usage and where to store data
 
 
=== Cluster usage ===
 
 
Currently the cluster is working more like a group of workstations with remote access, so some manual discovery is needed by the user, such as available resources. On a server/node, GPUs are selected with <code>export CUDA_VISIBLE_DEVICES=</code> (see [[#Selecting_which_GPUs_to_use|here]]) and a conda environment should be active with required libs for your code. Please note that the "CUDA toolkit" is often included as a dependency when installing ML/DL frameworks, using the correct version the framework is compiled with. No further action is needed, and no CUDA module needs to be loaded.
 
 
=== Access to a cluster machine ===
 
 
A user need to be whitelistet for the GPU Cluster to be able to login to the cluster machines. This is
 
currently handled by [[About_ITS_@_DTU_Compute|DTU Compute IT Support]].<br />
 
Feel free to contact [mailto:efer@dtu.dk Ejner Fergo] directly if you got authorization to join, and have questions about the cluster generally.
 
 
==== Local Access ====
 
 
Within DTU Compute the cluster machines can be directly reached by using Putty or a similar SSH
 
tool.<br />
 
To connect from Windows use a Fully Qualified Domain Name:<br />
 
'''hostname.compute.dtu.dk'''<br />
 
From Linux/macOS using just the hostname is fine, e.g.:<br />
 
'''ssh themis''' or '''ssh titan06'''<br />
 
 
==== Remote Access ====
 
 
When not on the DTU Compute network, use your SSH client to access '''user@thinlinc.compute.dtu.dk''' and ''from there''
 
use SSH again to login to one of the above cluster machines.<br />
 
 
An alternative is to set up a [[OpenVPN|VPN connection]].
 
 
 
=== Filesystem ===
 
 
==== /home ====
 
 
The users <code>$HOME</code> (/home/user) is shared between the cluster machines, so configuration files and software such as '''anaconda environments''' and [[#git|''git'']] repos will only need to be setup once. Please note that this shared home-dir is specific to the GPU cluster, and is not related to the DTU Compute ''unix-home'', so some files will still need to be copied over if necessary.<br />
 
By default there is a '''100GB quota''' pr. <code>$HOME</code>, so please don't keep overly large data here, but make use of the following directories.<br />
 
Also be aware that the quota per default is limited to 100.000 files. Anaconda environments use a lot of files, so this limit will most likely hit first, before your disk space runs out.
 
 
==== /scratch ====
 
 
The scratch disk is a <u>'''local'''</u> 1TB SSD on each server, meant as a location for datasets and other transient data.<br />
 
Data older than 45 days will be deleted automatically, unless they are modified (e.g. using the ''touch'' command) if users need to work on it longer than that. To ''touch'' all files in a directory, use this command:
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
find /scratch/mydir -type f -exec touch {} +
 
</pre>
 
Even though there is an "automatic cleanup" set up, users should ideally keep track of their files, and delete unnecessary data when they are done using it.
 
 
==== /nobackup ====
 
 
Shared network storage.<br />
 
Avoid loading datasets from here - instead copy large data to /scratch<br />
 
 
==== /dtu-compute ====
 
 
Shared network storage.<br />
 
Avoid loading datasets from here - instead copy large data to /scratch
 
 
==== /project ====
 
 
Shared network storage.<br />
 
Avoid loading datasets from here - instead copy large data to /scratch
 
 
== Helpful commands==
 
 
=== module ===
 
Use this command to load specific library versions, primarily CUDA.
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
efer@themis:~$ module avail
 
---------------------------------------------------------------- /opt/cuda/modules ----------------------------------------------------------------
 
CUDA/7.5  CUDA/9.0  CUDA/10.0  CUDA/10.2  CUDNN/5.0  CUDNN/7.4  CUDNN/7.6  CUPTI    NCCL/2.7 
 
CUDA/8.0  CUDA/9.2  CUDA/10.1  CUDA/11.1  CUDNN/7.1  CUDNN/7.5  CUDNN/8.0  NCCL/2.4
 
</pre>
 
 
To use CUDNN load CUDA first:
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
efer@themis:~$ module load CUDA/10.2 CUDNN/7.6
 
</pre>
 
 
'''Please NOTE:''' This is only necessary when ''compiling'' CUDA software yourself, not for running CUDA software in a conda environment.
 
 
=== conda ===
 
This command is part of the Anaconda Distribution, installed on the cluster, and is used to manage your software/python environment.<br />
 
Since every user doesn't install and maintain their own Anaconda Distribution, it is important to work inside conda created environments.<br />
 
After logging in on a server-node the first time, conda needs to be initialized for the user, done with this command:
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
conda init
 
</pre>
 
This will add some lines in your ''~/.bashrc'' (check if it conflicts with an old setup)
 
Creating an environment called "myjob" using Python 3.7, entering it and then using ''pip'', is done with these commands:
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
efer@themis:~$ conda create -n myjob python=3.7
 
efer@themis:~$ conda activate myjob
 
(myjob) efer@themis:~$ pip install something
 
</pre>
 
Please pay attention to <code>python=3.7</code> in the command above. It is necessary to specify a Python version for your environment, or if you want to use the latest, default version from Anaconda 3 (which as of this writing is: <code>3.8.x</code>, just ''python''. By specifying this, a Python environment will be set up in your home, and <code>pip install ...</code> will install to this local site-packages, instead of trying to write to /opt where it will fail.
 
 
=== gpustat ===
 
Find free GPUs with:
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
efer@themis:~$ gpustat
 
themis                  Tue Nov 10 14:13:05 2020  450.80.02
 
[0] GeForce RTX 2080 Ti | 23'C,  0 % |    1 / 11019 MB |
 
[1] GeForce RTX 2080 Ti | 25'C,  0 % |    1 / 11019 MB |
 
[2] GeForce RTX 2080 Ti | 26'C,  0 % |    1 / 11019 MB |
 
[3] GeForce RTX 2080 Ti | 25'C,  0 % |    1 / 11019 MB |
 
[4] GeForce RTX 2080 Ti | 23'C,  0 % |    1 / 11019 MB |
 
[5] GeForce RTX 2080 Ti | 24'C,  0 % |    1 / 11019 MB |
 
[6] GeForce RTX 2080 Ti | 21'C,  0 % |    1 / 11019 MB |
 
[7] GeForce RTX 2080 Ti | 24'C,  0 % |    1 / 11019 MB |
 
</pre>
 
The number to the left is the GPU-ID used when assigning GPUs with <code>CUDA_VISIBLE_DEVICES</code>
 
 
=== screen / tmux ===
 
 
These programs are ''terminal multiplexers'', which means you can have access to as many shells (called "windows") as you'd like, using only a single shell. This is especially useful when using a ssh login.<br />
 
Another great thing about terminal multiplexers, is that you can detach and attach a running session. This way you can setup and run multiple jobs, detach the session and logout, and later login and reattach the session to check status, etc.<br />
 
<br />
 
'''tmux''' is a more modern version of screen, and shares most commands and functionality with '''screen'''. By default the main difference is the '''''Modifer Key''''' (from now on called <mod>) which is used to control the program:
 
 
{| class="wikitable" width="25%"
 
! Screen
 
! Tmux
 
|-
 
| style="text-align: center" | Ctrl + a
 
| style="text-align: center" | Ctrl + b
 
|}
 
Also, '''tmux''' makes it obvious that you are running in a session, while '''screen''' by default does not.<br />
 
For long time screen users, this default tmux <mod>-combo can of course be changed, or just keep using screen. Following is a few basic commands:
 
{| class="wikitable" width="25%"
 
! Create Window
 
| style="text-align: center" | <mod> + c
 
|-
 
! Next Window
 
| style="text-align: center" | <mod> + n
 
|-
 
! Previous Window
 
| style="text-align: center" | <mod> + p
 
|-
 
! Detach Session
 
| style="text-align: center" | <mod> + d
 
|}
 
The command to (re)attach a session can be expressed in a few ways, here shown with abbreviated flags:
 
{| class="wikitable" width="25%"
 
! Screen
 
! Tmux
 
|-
 
| style="text-align: center; background-color: #374048; color: white;" | screen -r
 
| style="text-align: center; background-color: #374048; color: white;" | tmux a
 
|}
 
'''screen & tmux''' are powerful tools that can do much more than what is shown above, and by familiarizing yourself with one of them can help your workflow a lot.<br />
 
Check links to read more about [https://www.gnu.org/software/screen/ screen] and [https://github.com/tmux/tmux/wiki tmux]
 
 
=== df / du ===
 
 
Disk Free - prints out disk space usage per mount point. Typically run like '''df -h''' (-h = human readable). Can also list a single mountpoint:
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
efer@hyperion:~$ df -h /scratch
 
Filesystem      Size  Used Avail Use% Mounted on
 
/dev/sdb1      216G  60M  205G  1% /scratch
 
</pre>
 
Disk Usage - prints file/folder size. Typically run like '''du -sh''' (-h = human readable, -s = summarize). Works with wildcard (*):
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
efer@hyperion:~$ du -sh /scratch/efer/*
 
133M /scratch/efer/data1.zip
 
22G /scratch/efer/data2
 
</pre>
 
 
=== top / htop ===
 
 
Process monitor & machine load/status<br />
 
'''htop''' is more "visual" than classic '''top'''.
 
 
=== git ===
 
 
To get a DTU Compute GIT account follow the instructions found [[Git|here]].
 
<div style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
git clone git@lab.compute.dtu.dk:''repo_owner''/''project_repo''
 
</div>
 
In the above line, ''repo_owner'' refers to your DTU login name, or the user who has given you access to a repository.<br />
 
 
=== man / help ===
 
 
A useful command, to learn more about a command or quickly look up which flags to use, is '''man'''. Many *nix commandline tools have a ''manual file'' associated with it:
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
man nvidia-smi
 
</pre>
 
If '''man''' feels like "too much information" or some commands doesn't have a manual file, an alternative that is available on all respectable tools is the option '''--help''' (also abbreviated as '''-h''', if that flag isn't used by another option).
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
df --help
 
gpustat -h
 
</pre>
 
 
== Examples & Howtos==
 
 
=== Selecting which GPUs to use ===
 
 
First use either the [[#gpustat|'''gpustat''']] or '''nvidia-smi''' command to find a free GPU. Scripts need to be started with for example:
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
CUDA_VISIBLE_DEVICES=NUMBER python yourscriptornotebook.py
 
</pre>
 
where NUMBER is for example '''2''' to run on GPU number 2 or '''1,2''' to run on GPU number 1 and 2. This is to avoid running on and occupying all GPUs.<br />
 
This can also be incorporated into the scrip you want to run, by using the os package in python. Thus by adding the following two lines:
 
<code>
 
import os
 
os.environ["CUDA_VISIBLE_DEVICES"]="NUMBER"
 
</code>
 
before importing your favorite deep learning framework. "NUMBER" is for example '''2''' to run on GPU number 2 or '''0,2''' to run on GPU number 0 and 2. Then the script/notebook is executed as normal:
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
(myenv) user@host:~$ python yourscript_or_notebook.py
 
</pre>
 
 
=== Transferring Data ===
 
 
==== Windows ====
 
 
On a Windows PC you can assign a drive letter to a shared drive by starting a command prompt and issuing the command:
 
<pre style="background-color: black; color: white; border-style: none; padding: 5px; width: 75%;">
 
net use v: \\comp-nas3.compute.dtu.dk\nobackup\user
 
</pre>
 
The drive letter does not have to be '''v:'''<br />
 
If Windows asks for a username & password, you may have to prepend '''WIN\''' to the username, like so: <code>WIN\user</code><br />
 
<br />
 
[https://winscp.net WinSCP] is also an option
 
 
==== Linux & macOS ====
 
 
To transfer files from a Linux or Mac PC directly to a specific cluster machine, you can use '''scp''':
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
scp -r my_data_dir user@tethys:/path/to/place/
 
</pre>
 
The '''-r''' flag means "recursive", so makes sense if copying a directory with content, and not a single file, though you can leave it on in any case.<br />
 
Specifying <code>user@</code> before the hostname, is only necessary if your local user differs from the DTU/cluster user.<br />
 
The example uses an absolute path to copy to, but if you leave out the path and only say <code>tethys:</code> it means the root of your <code>$HOME</code><br />
 
<br />
 
If the data is multiple gigabytes, it would be better to use '''rsync''' as it can resume copying rather than start over, if the transfer is interrupted.
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
rsync -av my_data_dir user@tethys:/path/to/place/
 
</pre>
 
The syntax is much like with scp. Here '''-a''' means "archive" which is a combination of other flags, and '''-v''' means "verbose" so you can see what is currently being copied.<br />
 
<br />
 
Using '''scp''' and '''rsync''' to copy to a machine directly as shown in the above examples, makes sense when the receiving directory is ''local'', such as '''/scratch'''<br />
 
Data copied to a shared drive is of course available on all machines mounting that drive.<br />
 
<br />
 
Alternatively on Linux you can use '''sshfs''' to mount a remote directory on your machine:
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
sshfs remotehost:directory relative/dir/
 
</pre>
 
Here a directory located in <code>$HOME</code> on the remotehost is mounted in a sub-dir relative to where you currently are (probably '''/home/user/''' - notice the leading '''/''')<br />
 
You can now use a graphical filemanager to work with the files if you wish.<br />
 
To unmount:
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
fusermount -u relative/dir
 
</pre>
 
 
== Q & A ==
 
 
=== GPUs running but fans @ 0% ===
 
 
Yes, they are running, but they are not run at 100%, so they don't get hot enough to start the fans. It looks like they start spinning at around 60C. With '''nvidia-smi''' you can see in the column ''GPU-util'' that they use around 20% of the processing capacity. This happens because there is some other bottleneck in the program e.g. input/output so the program spends more time writing to disk than actually runninng stuff on the GPUs.
 
 
=== "broken pipe" message ===
 
 
The broken pipe happens when you lose connection to the server for a while. When that happens your ssh session closes and since the python program is a child process of the ssh session it will close as well. This is what [[#screen_.2F_tmux|screen]] is for. The screen process is detached from your ssh session and will keep running if you lose the connection and you can log back in to your screen session when you lose the connection.
 
 
=== Scary ssh WARNING ===
 
 
When connecting to a machine via ssh for the first time, you enter 'yes' to confirm your intent, and the hostname + IP + key of the machine is saved in '''~/.ssh/known_hosts'''. If that machine gets reinstalled, while the hostname + IP are the same, its key is different, and that is what the "REMOTE HOST IDENTIFICATION HAS CHANGED" warning message is about.<br />
 
It is very easy to fix, by simply opening '''~/.ssh/known_hosts''' in your favorite editor, finding the hostname, and deleting that line. Alternatively this command also works:
 
<pre style="background-color: #374048; color: white; border-style: none; padding: 5px; width: 75%;">
 
ssh-keygen -R hostname
 
</pre>
 
'''Note:''' This is OK to do here, on these machine with planned maintenance. If you see this warning when connecting to machines you manage yourself, or thirdparty systems such as VPS's, you are right to be suspicious.
 
  
 
[[Category:IT]]
 
[[Category:IT]]

Latest revision as of 14:26, 1 June 2023

Info about DTU Compute GPU Cluster


Go to https://titans.compute.dtu.dk for current information.