How to use Tesla Cluster

All jobs on Tesla-280 are run in batch mode only. We are currently using the SunGrid Engine (SGE) to manage all batch jobs. At this time all interaction with the cluster is done remotely through SGE. Jobs can be submitted and controlled from the front-end Linux machine maidroc.fiu.edu. The steps below detail the most typical usage of the Tesla-280 environment.

  1. Tesla-280 is a Linux only machine so only Linux compatible programs will run on the system. If the program source code is available, it first should be compiled and linked on the front-end workstation maidroc.fiu.edu. This workstation can be accessed remotely by secure shell (‘ssh’) and secure FTP (‘sftp’). Access to this machine is restricted to specific IP addresses. The 64-bit GNU (gcc,g++) and Portland Group (pgf77,pgf90,pgcc,pgCC) commercial compilers are available. All of these compilers will produce AMD 64-bit binaries by default. Consult the Portland Group compiler online manual for complete details. MPI programs should be compiled and linked with the Open MPI specific compilers (mpif90,mpicc,mpiCC). We are currently using Open Mpi 1.2.7,which supports MPI-1 and most of MPI-2, for compiling and running MPI codes. Programs compiled and linked on maidroc.fiu.edu should run normally on Tesla-280. New User Introduction
  2. Before a job is submitted, the user must first decide which network interface is most appropriate for their job. The cluster is composed of 64 nodes, each with a fast ethernet interface and a gigabit ethernet interface. Currently, all 64 nodes are connected by a 48-way fast ethernet switch. Also, nodes 1-32 are connected to one gigabit ethernet switch and nodes 33-64 are connected to another gigabit ethernet switch. There is no connection between these two gigabit switches. This is a temporary configuration that we hope to change in the future. See the diagram below for a graphical depiction of the network layout. Using fast ethernet, a user can run a single MPI job using all 280 processors. However, a user can only use a maximum of 64 processors using the gigabit ethernet interface. But in this case, two separate 64 processor MPI jobs can be run simultaneously using gigabit ethernet. The gigabit interfaces have roughly ten times more bandwidth with one-third the latency of the fast ethernet interfaces. Therefore, these interfaces should be used for fine-grained parallel algorithms that might appear in, for example, large-scale simulation codes. Codes using coarse-grained algorithms, such as parallel optimization, can use fast ethernet with little or no impact on performance.
  3. A user’s job can be submitted to SunGrid Engine by using the ‘qsub’ command on the front-end. The usual approach is to use a shell script when submitting a job, i.e. ‘qsub my_job.sh’. Some example scripts are given below. QSub script for job submission Open MPI script for parallel MPI job NOTE: A job should always be submitted from within the user’s home directory and never from a local directory such as /tmp. Otherwise, all files created by the job will be lost.
  4. The status of a job can be checked by using the ‘qstat -f’ command on the
    front-end. The command ‘qdel’ can be used to kill a job that is pending or executing. More details on the ‘qsub’,’qstat’, and ‘qdel’ commands can be found in the manual pages on the front-end machine (i.e. ‘man qsub’).

Leave a Reply