Econometrics Laboratory, UC Berkeley

(DEPRECATED) How do I use the EML Linux cluster?

THIS PAGE IS NOW DEPRECATED AND THE INFORMATION HEREIN PRESERVED ONLY FOR HISTORICAL PURPOSES.
AS OF JUNE 29, 2017, BOTH EML LINUX CLUSTERS USE SLURM AND NOT SGE FOR JOB SUBMISSION.
PLEASE SEE THE SLURM CLUSTER WEBPAGE FOR INFORMATION ON SUBMITTING JOBS AND PARALLEL PROCESSING.

EML Cluster Information

The EML Linux compute cluster has eight nodes, each with two 16-core CPUs available for compute jobs (i.e., 32 cores per node), for a total of 256 cores. Each node has 248GB dedicated RAM. It is managed by the Sun Grid Engine (SGE) queueing software. SGE provides a standard batch queueing system through which users submit jobs to the cluster. Jobs are typically submitted to SGE via a user's shell script which executes one's application code. Users may also query the cluster to see job status. All software running on EML Linux machines is available on the cluster. Users can also compile programs on any EML Linux machine and then run that program on the cluster.

Below is more detailed information about and how to use the cluster. We are currently experimenting with the queueing policies, and these are subject to change.

Access and Job Restrictions

The cluster is open to EML users using their EML logon. Currently users may submit jobs on the following submit hosts:

  blundell, frisch, hicks, jorgensen, klein, laffont, logit, marlowe, marshall, 
  nerlove, quesnay, radner, theil

The cluster has three job queues; high.q, interactive.q, and low.q. Interactive jobs allow a user to work at the command line of a cluster node; instructions for interactive jobs are provided in a later section of this document. We have configured high.q and interactive.q to have priority over low.q, as low.q jobs have lower priority for obtaining system resources than the other queues. When the cluster is busy, low.q jobs will run more slowly than other jobs.

A given user can use at most 12 slots in each of high.q and interactive.q at any one time (this could be a single 12-core job, 12 single core jobs, or any combination in between). Users can submit jobs requiring additional cores to the queue to be started when a user's previous job(s) end. The number of jobs and cores per user running in low.q is not restricted, except by the physical limits of the cluster and jobs being run by other users. low.q jobs are limited to 28 days runtime while high.q jobs and interactive jobs are restricted to 7 days run time (but see the Section below on "How to Submit Long Jobs" for jobs you expect to take more than 3 days, as jobs lasting longer than 5 days will be killed by default). Jobs on all queues are restricted to 128GB RAM, and threaded/multi-core jobs will run on at most 32 cores. Jobs exceeding run-time or memory will be silently killed off by SGE. It is therefore important to try to gauge how long a job might run. SGE will default submitted jobs to low.q (if no queue is specified) and any job's default output files to the CWD (current working directory) from which the job was submitted.

Queue	Max. # cores per user (running)	Time Limit	Max. memory/job (GB)	Max. # cores/job
interactive.q	12	7 days*	128 GB	12
high.q	12	7 days*	128 GB	12
low.q	256	28 days*	128 GB	32**

* See the Section on "How to Submit Long Jobs" for jobs you expect to take more than three days.

** If you use MPI (including foreach with doMPI in R), you can run individual jobs on more than 32 cores. See the Section on "How to Submit MPI Jobs" for such cases. You can also use up to 64 cores for a single Matlab job; see the Section on "Multi-core and Threaded Jobs".

We have implemented a "fair share" policy that governs the order in which jobs that are waiting in a given queue start when resources become available. In particular, if two users each have a job sitting in a queue, the job that will start first will be that of the user who has made less use of the cluster recently (measured in terms of CPU time). The measurement of CPU time downweights usage over time, with a half-life of one month, so a job that ran a month ago will count half as much as a job that ran yesterday. Apart from this prioritization based on recent use, all users are treated equally.

How To Submit Jobs

Prepare a shell script containing the instructions you would like the system to execute. For example a simple script to run the Matlab code in the file 'simulate.m' would contain the single line:

matlab -nodisplay -nodesktop -singleCompThread < simulate.m > simulate.out

Once logged onto a submit host, use the qsub command with the name of the shell script to enter a job into the queue:

theil:~/grid$ qsub grid1.sh
Your job 6057 ("grid1.sh") has been submitted

Here the job "grid1.sh" has been submitted to the default (low.q) queue and assigned job ID 6057. I can specify a different queue with the "-q" option, e.g.

theil:~/grid$ qsub -q high.q grid1.sh
Your job 6058 ("grid1.sh") has been submitted

SGE will create output (notice the "o" below) and error (the "e") files respectively for each job. For job 6058 above these look like:

theil:~/grid$ ls | grep 6058
grid1.sh.e6058
grid1.sh.o6058

Any Matlab jobs submitted in this fashion, without specifying the '-pe smp' flag to qsub, must start Matlab with the -singleCompThread flag. This ensures that the job does not use more than the single core requested from the queueing system.

Similarly, any SAS jobs submitted in this fashion, without specifying the '-pe smp' flag to qsub, must start SAS with the following flags.

sas -nothreads -cpucount 1

Also, if you want to run Stata/MP, the parallel version of Stata, please see the information below under "Multi-core and Threaded Jobs".

Your script by default will use your default shell (which will be bash, unless you've requested something else). To be completely safe, include the standard hashbang line as the first line of your script: #!/bin/bash or #!/bin/tcsh and your script should be run under that shell.

How to Submit Long Jobs

In an effort to provide SGE with sufficient information on how long jobs will run for, such that it can do a better job with reservations (described later in this document), we are now distinguishing short-running and long-running jobs. For jobs that you expect to run three days or less, you do not need to do anything when submitting, but be aware that any job running more than 5 days (120 hours) will be killed automatically (the difference between three and five days is to emphasize the usefulness of including a buffer in your estimate of how long a job will run). For jobs you expect to run longer than three days, you should submit the job with the following additional flag when using low.q. This syntax indicates a time limit of 672 hours = 28 days, the maximum runtime on the system for low.q jobs.

theil:~/grid$ qsub -l h_rt=672:00:00 grid1.sh

For high.q jobs, change 672:00:00 to 168:00:00 since the time limit on high.q is 7 days.

You can specify the job length as less than 672 hours (or 168 for high.q) but unless you are reasonably sure about setting a time that includes an appropriate buffer, we recommend simply setting this to 672 or 168. Do not use values larger than 672 or 168, respectively, as jobs specified in that way will not start.

How To Kill a Job

First, find the job-id of the job, by typing qstat at the command line of a submit host (see the section on "Monitoring Jobs" at the end of this document.

Then use qdel to delete the job:

qdel

To delete all your jobs in the queue:

qdel "*"

Multi-core and Threaded Jobs

Any job that is written to explicitly use multiple cores must be submitted via the SMP 'parallel environment' (or for certain cases, the DCS 'parallel environment' for Matlab - see below). This includes jobs that fork, spawn processes in any fashion, use the multiple core functionality in Matlab, or use the parallel or multicore packages in R (or packages such as boot when used such that they make use of parallel or multicore).

In addition, if you want to take advantage of multiple cores for threaded computation (either in Matlab, by writing your own threaded C/C++/Fortran code with openMP, or by using linear algebra in R or by calling the BLAS or LAPACK), you need to submit your job via the SMP parallel environment.

To submit such jobs (in this case, requesting 4 cores):

theil:~/grid$ qsub -pe smp 4 grid1.sh

where you can specify anywhere from 2 to 32 cores.

In any code that is explicitly parallel code (including parfor in Matlab and foreach in R), you should specify only as many cores as you requested via the -pe smp flag to qsub. Furthermore your code (including all commands contained in your submission script) should not create any more processes than the number of cores that you request.

For Matlab jobs, to ensure that the number of workers matches the number requested you should explicitly invoke parpool() with the number of workers equal to the number of cores requested. The easiest way to do this is by including this line of code before the use of parfor in your Matlab script, as NSLOTS is set by the queueing software to equal the number of cores requested.

parpool(str2num(getenv('NSLOTS')));

In addition, to allow Matlab to use more than 16 cores on a cluster node, you need to run the following code in Matlab once to modify your Matlab profile for all future Matlab work. This code could be run at the start of your cluster job or just in a Linux session on one of the EML servers:

cl = parcluster('local');
cl.NumWorkers = 32;
saveProfile(cl);

Finally note that Matlab jobs that use parfor by default do not use threading. Finally, threading in Matlab uses at most 16 cores, so do not request more than 16 cores for a Matlab job that exploits threaded calculations. For such threaded calculations in Matlab please submit via "-pe smp" and not "-pe dcs".

To use more than 32 workers (32 cores) in Matlab in a parpool context (or to use cores spread across multiple nodes), you need to use the DCS parallel environment. You can use up to 64 cores, but since we only have licenses for 64 cores, if other users are using some of the license slots, your job may be queued until sufficient license slots become available. To submit such a job (in this case requesting 40 cores):

theil:~/grid$ qsub -pe dcs 40 grid1.sh

As above you should invoke parpool with the number of cores requested. Note that the workers will run as the user "matlabdcs", so if you interactively log into a node with some of the workers on it, you will see MATLAB processes running as this user if you use 'top' or 'ps'.

To make use of threading, please follow the following guidelines.

For Matlab jobs, do not start Matlab with the -singleCompThread. Rather, at the beginning of your Matlab code, set the number of threads to the number of cores you have requested (by default, this will be stored in the NSLOTS environment variable) by including the following line of code

feature(’numThreads’, str2num(getenv('NSLOTS')));

For Stata jobs, if you use Stata/MP (by invoking stata-mp or xstata-mp), you can use up to 8 cores based on the EML license for Stata. If you request 8 cores via -pe smp, then you are all set. If you request fewer than 8 cores (i.e., 2-7 cores), you need to set the number of processors to be the number of cores requested at the start of your Stata code, by including the following line, in this case for 4 cores:

set processors 4

To automate this, you can start your Stata job as follows:

stata-mp -b do myStataCode.do $NSLOTS

And at the start of your code file, include these lines:

args ncores
set processors `ncores'

For SAS jobs, please start SAS with the following flags:

sas -threads -cpucount $NSLOTS

A number of SAS procs are set to take advantage of threading, including SORT, SUMMARY, TABULATE, GLM, LOESS, and REG. SAS enables threading by default, with the default number of threads set to four. Starting SAS as above ensures that the number of threads is set to the number of cores you requested. You can check that threading is enabled from within SAS by running the following and looking for the cpucount and threads options in the printout.

Proc Options group=performance; run;

For jobs other than Matlab, SAS, and Stata, if you want to use more than one thread, you need to set the OMP_NUM_THREADS variable in the shell script you submit via qsub (by default it is set to one, so your code will not use multiple threads),

export OMP_NUM_THREADS=X

where X should be an integer that is no more than the number of cores you have requested (which is stored in NSLOTS) divided by the number of processes your code starts. If your code does not start multiple processes, this means you can set the number of threads to NSLOTS.

If you have any questions about parallel programming, submitting jobs to use more than one core, or are not sure how to follow these rules, please email consult@econ.berkeley.edu.

For additional details, please see the these notes on the basics of parallel programming in Matlab, Stata and C/C++. If you are an R user, then these notes tailored for the Statistics Department may be helpful. If you're making use of the threaded BLAS, it's worth doing some testing to make sure that threading is giving an non-negligible speedup; see the notes above for more information.

To reserve cores for a multi-core/threaded job

Finally, if the cluster is heavily loaded and you submit a job requesting many cores, it is possible that as fewer cores become available, jobs requesting fewer cores will slip in front of your job and your job may have to wait a long time before the necessary resources are available. To tell SGE to accumulate cores for your job (this is called a 'reservation'), you can add the flag '-R y' to your submission, e.g.,

theil:~/grid$ qsub -pe smp 20 -R y grid1.sh

This should allow your job to start once it reaches the top of the queue and SGE is able to accumulate enough cores for the job, though it's possible some other jobs might slip ahead of your job because SGE needs to choose a specific node on which to accumulate cores and make some guesses about how long currently-running jobs will take to finish. You can also email consult@econ.berkeley.edu if you're having trouble getting your jobs to run.

Running MPI jobs

We now allow MPI jobs on the cluster, including jobs that use cores across multiple nodes. This is useful in two ways. First, it allows you to use more than 32 cores for a single job. Second if you need fewer than 32 cores but the free cores are scattered across the nodes and there are not sufficient cores on any one machine, this allows you to make use of those scattered cores. Note that we ask that you not run long jobs (more than a couple hours) that take up more than around 100 cores.

Here's how you submit the job (in this case requesting 40 cores):

theil:~/grid$ qsub -pe mpi 40 grid1.sh

Then your grid1.sh file must use mpirun. More details are available at these notes and the accompanying template code files (most of the code is also embedded in the notes). One simple way to use MPI is to use the doMPI back-end to foreach in R. In this case you invoke R via mpirun as:

mpirun -machinefile $TMPDIR/machines R CMD BATCH file.R file.out

Interactive jobs

You can work interactively on a node from the Linux shell command line by starting a job in the interactive queue. Interactive jobs are officially limited to 7 days, but we expect that such jobs should generally be short (say less than an hour or so). Furthermore, please do not forget to close your interactive sessions when you finish your work so the resources are available to other users.

The syntax for requesting an interactive session is:

qrsh -q interactive.q

This will start a shell on one of the four nodes. You can then act as you would on any EML Linux compute server. For example, you might use top to assess the status of one of your non-interactive (i.e., batch) cluster jobs. Or you might test some code before running it as a batch job. You can also transfer files to the local disk of the cluster node.

To monitor a job on a specific node or transfer files to the local disk of a specific node, you need to request that your interactive session be started on the node of interest (in this case eml-sm01):

qrsh -q interactive.q -l hostname=eml-sm01

Note that if that specific node has 32 cores in use, you will need to wait until resources become available on that node before your interactive session will start. The qstat command (see below in the section on monitoring jobs) will tell you on which node (eml-sm00, eml-sm01, eml-sm02, eml-sm03, eml-sm10, eml-sm11, eml-sm12, eml-sm13) a given job is running.

Finally, you can request multiple cores using -pe smp, as with batch jobs. As with batch jobs, you can change OMP_NUM_THREADS from its default of one, provided you make sure that that the total number of cores used (number of processes your code starts multiplied by threads per process) does not exceed the number of cores you request. However, unlike batch jobs, the NSLOTS environment variable is not available, so your code should not rely on being able to access it.

You are limited to 12 slots in the interactive queue at any time. This could be one session with 12 cores, requested by "-pe smp 12" or 12 sessions, each with a single core, although it is hard to imagine that one could effectively use 12 interactive sessions at once.

How To Monitor Jobs

The SGE command qstat provides info on job status:

theil:~/grid$ qstat
job-ID  prior name       user         state submit/start at     queue               slots ja-task-ID 
----------------------------------------------------------------------------------------------------
 6060 0.00000 grid1.sh   rspence      qw    07/09/2012 16:02

where the "qw" indicates the job is waiting to run. This status will be "r" while running. Once a job finishes the qstat command will show no output.

To see how much memory a job is using using the "-j" flag to qstat with the job id and look at the output line beginning with 'usage'. In this case that would look like:

theil:~/grid$ qstat -j 6060 | grep ^usage
usage   1:                 cpu=00:03:52, mem=245.09918 GBs, io=0.00418, vmem=1.093G, maxvmem=1.093G

The vmem field tells how much memory is currently being used and the maxvmem field tells the maximum used.

As used above, qstat only shows your jobs. To see how busy the cluster is, you can look at all the jobs amongst all users:

theil:~/grid$ qstat -u "*"

Here are a few other useful commands for assessing usage on the cluster. You could create an alias in your .bashrc file for any of these that you use regularly.

To find out how busy the cluster is in aggregate, try the following. The AVAIL column shows the number of cores available:

theil:~/grid$ qstat -g c

To find out how many cores are available on each node, for a given queue (low.q in this case):

theil:~/grid$ qstat -F slots -q low.q

Alternatively, this will do the same thing:

theil:~/grid$ qcores

If you add this qtop function to your .bashrc:

function qtop { if [ -z "$1" ]; then n=20 ; else n=$1 ; fi; qstat -u '*' | head -$(($n+2)); }

then you can do the following to see the top jobs (in this case 15) on the cluster:

theil:~/grid$ qtop 15

To see all of your jobs that are queued but not running (for running jobs do "jstatus r"):

theil:~/grid$ jstatus

Submitting multiple jobs in an automated fashion

Each individual job must be submitted through qsub. I.e., no job submission script should execute jobs in parallel (except via the mechanisms discussed in the previous section).

To submit multiple jobs, what follows is a couple ways to do it. [Of course you can manually create lots of individual job submission scripts, each of which calls a different script.] We've tested these a bit but email consult@econ.berkeley.edu if you have problems or find a better way to do this.

Here is some example bash shell code (which could be placed in a shell script file) that loops over two variables (one numeric and the other a string):

for ((it = 1; it <= 10; it++)); do
  for mode in short long; do
    qsub job.sh $it $mode
  done
done

You now have a couple options in terms of how job.sh is specified. This illustrates things for Matlab jobs, but it shouldn't be too hard to modify for other types of jobs.

Option #1:

# contents of job.sh

echo "it = $1; mode = '$2'; myMatlabCode" > tmp-$1-$2.m
matlab -nodesktop -nodisplay -singleCompThread < tmp-$1-$2.m > tmp-$1-$2.out 2> tmp-$1-$2.err

In this case myMatlabCode.m would use the variables 'it' and 'mode' but not define them.

Option #2:

# contents of job.sh

export it=$1; export mode=$2;
matlab -nodesktop -nodisplay -singleCompThread < myMatlabCode.m > tmp-$1-$2.out 2> tmp-$1-$2.err

In this case you need to insert the following Matlab code at the start of myMatlabCode.m so that Matlab correctly reads the values of 'it' and 'mode' from the UNIX environment variables:

it = str2num(getenv('it'));
mode = getenv('mode');

It's also possible to avoid creating job.sh entirely:

for ((it = 1; it <= 10; it++)); do
  for mode in short long; do
    echo "it = $it; mode = '$mode'; myMatlabCode" > tmp-$it-$mode.m
    qsub -b y "matlab -nodesktop -nodisplay -singleCompThread < tmp-$it-$mode.m > tmp-$it-$mode.out 2> tmp-$it-$mode.err"
  done
done

For Stata jobs, there's an easier mechanism for passing arguments into a batch job. Here's an example that does not require one to create job.sh nor to modify the master code file:

for ((it = 1; it <= 20; it++)); do
  for mode in short long; do
    qsub -b y "stata -b do myStataCode.do $it $mode"
  done
done

Your Stata code file, myStataCode.do above, should have its first line assign the input values to variables (in this case I've named them id and mode to match the shell variables, but they can be named differently):

args id mode

Then the remainder of your code can make use of these variables.