An introduction to use of the Bulldogc cluster

 

Note: this document is gradually being superceded by a wiki.  For more complete and up-to-date information, please consult it.

Acknowledging use of the Bulldogc cluster

 

When results derived from computations run on the cluster are published please include an acknowledgment along the lines of the following:

"Yale University Life Sciences Computing Center" and NIH grant: RR19895, which funded the instrumentation.

Access to the Bulldogc cluster

Bulldogc is administered by Nicholas Carriero (carriero@cs.yale.edu) and Robert Bjornson (robert.bjornson@yale.edu).  Questions about the cluster should be directed to them.

 

Bulldogc is open to all Yale investigators, and to as many non-Yale investigators as load allows.   If you would like to join the Center please send your name, department, institution, title and paragraph length description of your project to kenneth.williams@yale.edu. 

Overview of hardware

 

Bulldogc is a cluster of  consisting of a head node (bulldogc.wss.yale.edu) and 130 compute nodes (c1-c130), each containing (2) 3.2 Ghz EM64T Xeons.  The processors are 64 bit, although they will run 32 bit applications.  All nodes run Redhat Enterprise R3.

 

Each node has 8 GB of RAM and a small local disk.  There is a large SAN disk array that serves a number of filesystems, including home directories, to every compute node.

 

Warning: officially, nothing is backed up; please be sure to copy all important data to a safe location.  Currently there is a disk-based backup system being tested with the cluster.  Contact us if you lose something.  But, we cannot guarantee backups at this time, so please keep copies of important data.

 

Installed Software

 

PBS is installed in /usr/pbs.

 

A number of useful packages are installed in /usr/local/cluster, including LAM mpi, mpich, R, java, Intel and PGI compilers.

 

R version 2.4 is installed in /home3/njc2/myInstalls/bin/R.  Make sure to add  /usr/local/cluster/intel/mkl/mkl721/lib/em64t to LD_LIBRARY_PATH.

 

Many other package are installed, or can be installed upon request.  Please contact us with your requirements.

Using the cluster

 

Compilation and other interactive tasks can be done either on the head node, or on one of the compute nodes by logging in via interactive PBS shell (see below).

 

All compute-intensive runs, whether parallel or sequential, should be done via PBS on the appropriate queue.  See the following section for information.

 

If you need access to more cpus than your queue provides, please contact us.  We may be able to provide additional nodes on a case-by-case basis.

Introduction to PBS pro

 

 

PBS (Portable Batch System) is used as a way to manage jobs that are submitted to the cluster.  The utilities you’ll need are installed in /usr/pbs/bin. 

 

PBS organizes the cluster into a number of queues.  Currently, these are the available queues:

 

Queue Name

group_list

Nodes

Description

sandbox

sand

60

General Users

general

 

100

Low priority, prempted

eph

eph

20

EPH group only

zhang

zhang

20

Zhang group only

private

private

20

Reserved queue

 

 

The general queue is a low priority queue available to all users.  It opportunistically scavenges unused nodes from other queues.  However, jobs running in this queue will be automatically aborted if other queues require nodes.  We recommend that all jobs run on the general queue be configured to send email notifications, so that you will know if they are aborted.  To do this, use the –m and –M flags:

 

-m abe

-M me@yale.edu

 

Note that, at present, a Yale email address must be used; non-Yale emails will silently fail.

 

There is no longer a default queue, so all submissions must specify a queue via the –q flag.  In addition, all queues other than general require the use of the –W group_list flag.  See the example below.

 

To submit a job via PBS, you first write a simple shell script that wraps your job.  Here is an example script.pbs that runs a sequential job:

#PBS –q sandbox –W group_list=sand

#PBS –l walltime=4:00:00

#PBS –m abe –M me@yale.edu

cd $PBS_O_WORKDIR

./myprog arg1 arg2 arg3…

 

This script runs myprog on a node in the sandbox queue chosen by PBS, after changing directory to where the user did the submission (default behavior is to run in the home directory).  The user must be a member of the “sand” group.  In this case, the job will be limited to 4 hours.  You can put any number of PBS directives in the script, followed by commands.

 

To actually submit the job, do:

 

$ qsub script.pbs

 

Note that you can specify all flags either in the script or on the command line, with the command line taking precedence.  For example, the previous script could be submitted to the general queue, without change, by doing:

 

$ qsub –q general script.pbs

 

To check on the status of your jobs, do:

 

$ qstat –q sandbox

 

To kill a running job, do:

 

$ qdel <jobid>

 

Output will normally be returned to you after the job completes in files called scriptname.ojobid and scriptname.ejobid for standard output and standard error, respectively. 

 

To run an interactive shell rather than a batch job, use the –I flag to qsub:

 

$ qsub –q sandbox –W group_list=sand –I

 

This will assign a free node to you, and put you within a shell on that node.   You can run any number of commands within that shell.  To free the allocated node, exit from the shell.  The environment variable $PBS_NODEFILE is set to the name of a file containing  the names of the node(s) you were allocated.

 

Running an mpi parallel job is similar, but requires that you specify the number of nodes in the script. 

 

Using LAM MPI:

#PBS -l nodes=8:ppn=2

#PBS -q sandbox –W group_list=sand

#PBS –m abe –M me@yale.edu

lamboot  ${PBS_NODEFILE}

cd $PBS_O_WORKDIR

/path/to/mpirun C ./myprog arg1 arg2 …

lamhalt

 

Using MPICH:

#PBS -l nodes=8:ppn=2

#PBS -q sandbox –W group_list=sand

#PBS –m abe –M me@yale.edu

cd $PBS_O_WORKDIR

/path/to/mpirun –machinefile ${PBS_NODEFILE} -allcpus ./myprog arg1 arg2 …

 

For more documentation on PBS, see the User's Guide