Difference between revisions of "Cluster"

From ITSwiki
Jump to: navigation, search
[unchecked revision][unchecked revision]
m
Line 1: Line 1:
This page describes the cluster facilities at DTU Informatics.
+
This page describes the cluster facilities at DTU Compute.
  
 
The cluster is made of:
 
The cluster is made of:
Line 38: Line 38:
 
This page contains howto's for SUN gridengine, Matlab and Gridengine and OpenMPI.
 
This page contains howto's for SUN gridengine, Matlab and Gridengine and OpenMPI.
  
For a general descripton of the cluster facilities at DTU Informatics, please follow this [[Cluster | link]].<br />
+
For a general descripton of the cluster facilities at DTU Compute, please follow this [[Cluster | link]].<br />
  
 
==SUN gridengine - a mini howto==
 
==SUN gridengine - a mini howto==

Revision as of 12:08, 3 July 2013

This page describes the cluster facilities at DTU Compute.

The cluster is made of:

  • 16 servers (grid01, grid02, ...) running 64-bit Linux with 2 X5650 6-Core Processor2.66GHz, 48GB RAM
  • 2 servers (hms1 &hms2) running 64-bit Linux with 8 AMD Quad-Core AMD Opteron(tm) Processor 8356 2.3GHz, 256 GBRAM
  • 3 servers (cimbi2-4) running 64 bit Linux with 4 Dual core AMD Opteron (tm) Processor 880 2.4GHz, 32GB RAM [*]

The grid is reboot'ed once pr. month, generally first Wednesday after the 15th. Is announced on the frontpage of this wiki.


Setup

Access to most of the servers is controlled via sun gridengine. However, in order to submit jobs one has to logon to hms1 or one of grid01-04 (using the normal user account/password which is used for the SunRay servers as well.) The servers hms1 and grid01-04 can be used for developing purposes, i.e. compile, test etc. and to submit jobs to of the other servers using the qsub command. grid01-04 are available from the SunRay servers via the gridterm command and through the menu system. hms1 is also available for running interactive jobs.

There are 5 queues defined on the grid.

  1. fast
    This queue is for very short (test) jobs which requires max 10 min. of WALL time. Jobs takin more time will be killed.
    In order to use it jobs have to be submitted with qsub -q fast P fast job.sh. Submitting jobs without arguments (i.e. qsub job.sh) will not run in this queue.
    The queue has access to almost all slots on all machines.
  2. long
    This queue is for long lasting jobs (more than 12 hours) - there is no enforced upper limit.
    The queue can only utilize 4 slots on cimbi2-4, grid05-16 and 16 on hms2, i.e. jobs submitted to this queue can not saturate these machines
  3. himem
    For jobs needing up to 12 hours of WALL time and will be executed on either grid05-16, cimbi2-4 or hms1-2.
  4. himem-long
    For jobs needing more than 12 hours and running on either grid05-16, cimbi2-4 or hms1-2
  5. himem2
    For jobs up to 12 hours and will be executed on hms2.

On each node there is a /space (about 20GB) where everybody can write files to. However, files older than one week will be deleted automatically.

The grid has support for openMPI.

For a howto use the grid, submit matlab jobs in parallel and using openmpi see the grid howto.

Software

The cluster currently runs Ubuntu 10.04 and with that OS comes a suite of standard utilities like gcc compiler suite, emacs, etc. Other software is installed (under /appl if not specficied). The interactive versions of the programs are generally available from the menu system on the SunRay servers (either using a SunRay client or Thinlinc) or from the command line (text in [] denotes the name of the command).

matlab: [matlab] version 2006b (aka version 7.3). 2008b [matlab77] 2011b [matlab713] as well. Make sure to see the examples in the howto for running matlab on the grid
mathematica: [mathematica] version 6.0, version 7.0 [mathematica7] and version 8.0 [mathematica8]
maple: [maple/xmaple] version 12, version 14 [maple14/xmaple14], version 15 [maple15/xmaple15] and version 15 [maple15/xmaple15]
sas: version 9.2 [sas]
R: [R] newest version (by June 28th, 2012: 2.15.1)
Rstudio: [rstudio]
splus: [Splus] version 8.0.4.
SUN studio 12u1: [sunstudio] installed under /opt/SS12u1/.... .
TotalView debugger: [totalview]
OpenMPI: version 1.3.x. See the howto for usage.
Wine: [wine] version 1.0.1. See special setup instructions.

How to

This page contains howto's for SUN gridengine, Matlab and Gridengine and OpenMPI.

For a general descripton of the cluster facilities at DTU Compute, please follow this link.

SUN gridengine - a mini howto

This is a 1 minute introduction to how to use the gridengine.

To submit jobs

All jobs must be submitted using a batch file (a binary file can not be submitted directly). Create a file like this simplesum.sh:

#!/bin/bash
# -- our name ---
#$ -N simplesum
# -- request /bin/bash --
#$ -S /bin/bash
# -- send an email once the job has ended or been aborted.
#$ -m ea
# -- run in the current working (submission) directory --
#$ -cwd
matlab -nodesktop < simplesum.m</nowiki>

The content of simplesum.m is

x=sum(sum(inv(rand(3300))))

Next submit it using:

qsub simplesum.sh

If it is a memory demanding job submit it using

qsub -q himem simplesum.sh

To submit it to the himem queue with priority

qsub -P cimbi -q himem simplesum.sh

More (much more!) information in the man page of qsub.

Job status

After you have submitted your job, you can check the status with the qstat command:

qstat

Without extra arguments, qstat will print all of your jobs only. Other useful options:

qstat -u '*'

Show jobs belonging to all users

qstat -u user_name1 [user_name2 ....]

Show only jobs belonging to user(s) user_name1 (user_name2 ...)

qstat -j job_id1 [job_id2 ...]

Show only jobs with the requested job_ids (long listing!). This lists (almost) everything GE knows about the jobs, and this output from this command can be useful to check the reasons why your job will not start.

For more options see the qstat manual pages:

man qstat 

Stop/Delete a job

Usually your jobs will run, finish and disappear from the gridengine system, but sometimes you might want to stop a job or remove a job from the queue that does not start due to wrong submission options:

qdel job_id

For more options see the manual pages for qdel:

man qdel

Matlab and gridengine

At the time of writing (18/3-2008) matlab distributed engine is only supported on matlab 2006b and 2007a. Matlab jobs - which can be processed in parallel (like a parameter sweep) can benefit from matlab distributed engine. Each subtask should consume several CPU minutes otherwise the benefit of creating subtask within matlab is negligible. The method illustrated below for matlab 2007b can be beneficial for smaller jobs than the method for matlab 2006b (i.e. owehead per job is les). A small mini-example is provided (courtesy Martin Vester Christensen). Read through the m files and it should give a starting point.

Matlab 2011b

  1. Create a folder ~/matlab713
  2. Download colsum.m to the directory ~/matlab713
  3. Edit the files according to need
  4. Either run paralleltest713.m interactively or using the qsub command as described above.

Matlab 2008b

quite similar to matlab 2007b

  1. Create a folder ~/matlab77
  2. Download colsum.m to the directory ~/matlab77
  3. Edit the files according to need
  4. Either run paralleltest77.m interactively or using the qsub command as described above.

Note: in order to run matlab 2008b from a script and using qsub one has to specify ulimit -s 8196 in the script just before calling matlab77.

Matlab 2006b

  1. Create a folder ~/matlab
  2. Download distributedtest73.m
  3. Edit distributedtest.m according to your need (at least the the /path/to/my/homedir)
  4. Either run distributedtest.m interactively or using the qsub command as described above.

Matlab and ksh users

Many users who have /usr/bin/ksh as logon shell (most users created before 2005) have a ~/.profile.linux which includes a line set -u. This line line prevents matlab 2007b & 2008b to use the examples above on queues, which run on ubuntu 8.10 (pr. 1/2-2009: himem2)

OpenMPI and gridengine

The OpenMPI installationen hasn't been tested widely.

2 versions are installed:

OpenMPI from SUNhpc tools

The Open MPI installed comes from SUN and is based on version 1.3.4. There are 2 versions installed; one for compilations using gcc compiler suite and one for using SUN Studio compiler suite. Installed in /appl/SUNWhpc/HPC8.2.1/gcc and /appl/SUNWhpc/HPC8.2.1/sun respectively. Neither of those are in the default PATH, i.e. the user must do that in his/her dot files.

It is beyond the scope of this page to explain the usage of OpenMPI. However, to use it follow the idea in the following example (which is using the "gcc" compiler suite:

  • Download the following hello_world.c example.
  • Compile it using the gcc wrapper mpicc:
export PATH=/appl/SUNWhpc/HPC8.2.1/gnu/bin:$PATH
mpicc hello_world.c -o hello_world
  • create a script file hello_world.sh like this
#!/bin/sh
# -- our name ---
# -- request /bin/sh --
#$ -S /bin/sh
# -- run in the current working (submission) directory --
#$ -cwd
#$ -m bea
#$ -pe mpi 8
export PATH=/appl/SUNWhpc/HPC8.2.1/gnu/bin:$PATH
mpirun  -np $NSLOTS hello_world

The key line here is #$ -pe mpi 8 which tells SUN grid Engine to use the parallel environment mpi and request 8 slots (change the latter according to your needs).

  • Submit it using qsub hello_world
  • To avoid the setting up PATH one should add that line to ~/.profile.linux.

OpenMPI from open-mpi.org

  • Download the followong file hello_world.c example
  • Compile it using gcc wrapper mpicc
mpicc hello_world.c -o hello_world
  • create a script file hello_world.sh like this
#!/bin/sh
# -- our name ---
# -- request /bin/sh --
#$ -S /bin/sh
# -- run in the current working (submission) directory --
#$ -cwd
#$ -m bea
#$ -pe mpi 8
export PATH=/appl/openmpi/1.4.5/bin:$PATH
mpirun  -np $NSLOTS hello_world

The key line here is #$ -pe mpi 8 which tells SUN grid Engine to use the parallel environment mpi and request 8 slots (change the latter according to your needs).

  • Submit it using qsub hello_world
  • To avoid the setting up PATH one should add that line to ~/.bashrc