Difference between revisions of "Cluster"

From ITSwiki
Jump to: navigation, search
[unchecked revision][unchecked revision]
m
Line 1: Line 1:
 +
<div style="background-color: #FFFF00; border-style: dotted;"> This guide is for users at '''DTU Compute''' only</div>
  
  
 
+
This page describes the cluster facilities at '''DTU Compute'''.
This page describes the cluster facilities at DTU Compute.
+
  
 
Most CPU resources are available at http://www.hpc.dtu.dk (via the "Compute" queue). The resources here are kept for backward compatibility and to allow for more interactive usage.  
 
Most CPU resources are available at http://www.hpc.dtu.dk (via the "Compute" queue). The resources here are kept for backward compatibility and to allow for more interactive usage.  

Revision as of 17:08, 19 May 2022

This guide is for users at DTU Compute only


This page describes the cluster facilities at DTU Compute.

Most CPU resources are available at http://www.hpc.dtu.dk (via the "Compute" queue). The resources here are kept for backward compatibility and to allow for more interactive usage.


The cluster is made of:

  • 15 servers (grid01, grid02, ...) running Ubuntu 18.04 64-bit Linux with 2 X5650 6-Core Processor 2.66GHz, 48GB RAM
  • 6 servers (grid21, ..., grid27) running Ubuntu 18.04 64-bit Linux with 2 E5-2660 v3 10 Core CPU 2.60GHz, 128GB RAM
  • 1 server (grid20) running Ubuntu 16.04 64-bit Linux with 2 E5-2660 v3 10 Core CPU 2.60GHz, 128GB RAM (will be upgraded to Ubuntu 18.04 by the end of 2019)
  • 2 servers (hms1 &hms2) running running Ubuntu 14.04 64-bit Linux with 8 AMD Quad-Core AMD Opteron(tm) Processor 8356 2.3GHz, 256 GBRAM (will be decommissioned by the end of 2019)


The grid is reboot'ed once pr. month, generally first Wednesday after the 15th. Is announced on the messaage of today (motd) on the grid terminals.


Setup

Access to most of the servers is controlled via sun gridengine. However, in order to submit jobs one has to logon to grid01-04. The servers hms1 and grid01-04 can be used for developing purposes, i.e. compile, test etc. and to submit jobs to of the other servers using the qsub command. grid01-04 are available from the linuxterm servers via the gridterm command and through the menu system. hms1 is also available for running interactive jobs.

There are 5 queues defined on the grid.

  1. fast
    This queue is for very short (test) jobs which requires max 10 min. of WALL time. Jobs takin more time will be killed.
    In order to use it jobs have to be submitted with qsub -q fast -P fast job.sh. Submitting jobs without arguments (i.e. qsub job.sh) will not run in this queue.
    The queue has access to almost all slots on all machines.
  2. long
    This queue is for long lasting jobs (more than 12 hours) - there is no enforced upper limit.
    The queue can only utilize 4 slots on each host, i.e. jobs submitted to this queue can not saturate these machines
  3. himem
    For jobs needing up to 12 hours of WALL time.
  4. himem-long
    Same as long


On each node there is a /space (about 20GB) where everybody can write files to. However, files older than one week will be deleted automatically.


For a howto use the grid, see the grid howto.

Software

The cluster currently runs Ubuntu 18.04 and with that OS comes a suite of standard utilities like gcc compiler suite, emacs, etc. Other software is installed (under /appl if not specficied). The interactive versions of the programs are generally available from the menu system on the linuxterm servers (either using a linuxterm hardware client or Thinlinc) or from the command line (text in [] denotes the name of the command).

matlab: [matlab] version 2018b (aka version 9.5). 2016b [matlab91], 2017b [matlab93].
mathematica: [mathematica] version 11.3 and version 10.4 [mathematica10]
maple: [maple/xmaple] version 2018
sas: version 9.4 [sas]
R: [R] newest version
Rstudio: [rstudio]

How to

This page contains howto's for SUN gridengine, Matlab and Gridengine and OpenMPI.

For a general descripton of the cluster facilities at DTU Compute, please follow this link.

SUN gridengine - a mini howto

This is a 1 minute introduction to how to use the gridengine.

To submit jobs

All jobs must be submitted using a batch file (a binary file can not be submitted directly). Create a file like this simplesum.sh:

#!/bin/bash
# -- your mail name ---(like abc@dtu.dk)
#$ -N simplesum
# -- request /bin/bash --
#$ -S /bin/bash
# -- send an email once the job has ended or been aborted.
#$ -m ea
# -- run in the current working (submission) directory --
#$ -cwd
matlab -nodesktop < simplesum.m

The content of simplesum.m is

x=sum(sum(inv(rand(3300))))

Next submit it using:

qsub simplesum.sh

If it is a memory demanding job submit it using

qsub -q himem simplesum.sh


More (much more!) information in the man page of qsub.

Job status

After you have submitted your job, you can check the status with the qstat command:

qstat

Without extra arguments, qstat will print all of your jobs only. Other useful options:

qstat -u '*'

Show jobs belonging to all users

qstat -u user_name1 [user_name2 ....]

Show only jobs belonging to user(s) user_name1 (user_name2 ...)

qstat -j job_id1 [job_id2 ...]

Show only jobs with the requested job_ids (long listing!). This lists (almost) everything GE knows about the jobs, and this output from this command can be useful to check the reasons why your job will not start.

For more options see the qstat manual pages:

man qstat 

Stop/Delete a job

Usually your jobs will run, finish and disappear from the gridengine system, but sometimes you might want to stop a job or remove a job from the queue that does not start due to wrong submission options:

qdel job_id

For more options see the manual pages for qdel:

man qdel