Using the SLURM cluster
The EML operates a high-performance Linux-based computing cluster that uses the Slurm queueing software to manage jobs. The cluster has three partitions, which are distinct sets of nodes, with different generations of CPUs.
The high
priority (default) partition has eight nodes, divided into two sets of four nodes.
- eml-sm2 nodes: These nodes each have two 14-core CPUs, each core with two hyperthreads (i.e., 56 logical cores per node) available for compute jobs. Each node has 132 GB dedicated RAM
- eml-sm3 nodes: These nodes each have two 8-core CPUS, each core with two hyperthreads (i.e., 32 logical cores per node) available for compute jobs. Each node has 768 GB dedicated RAM.
In the remainder of this document, we’ll refer to the processing units (the logical cores) as ‘cores’.
The low
priority partition has eight nodes, each with 32 physical cnores per node, for a total of 256 cores. Each node has 264 GB dedicated RAM. These nodes have slower cores than the high priority partition and are intended for use when the high priority partition is busy or users have jobs that are not time-critical, thereby freeing up the high partition for other jobs.
The gpu
partition has a single node (with 48 logical cores and 264 GB CPU RAM) that hosts an Nvidia A40 GPU.
The partitions are managed by the Slurm queueing software. Slurm provides a standard batch queueing system through which users submit jobs to the cluster. Jobs are submitted to Slurm using a user-defined shell script that executes one’s application code. Interactive use is also an option. Users may also query the cluster to see job status. As currently set up, the cluster is designed for processing single-core and multi-core/threaded jobs, as well as distributed memory jobs that use MPI. All software running on EML Linux machines is available on the cluster. Users can also compile programs on any EML Linux machine and then run that program on the cluster.
Below is more detailed information about how to use the cluster.
Access and Job Restrictions/Time Limits
The cluster is open to a restricted set of Department of Economics faculty, grad students, project account collaborators, and visitors using their EML logon.
Currently users may submit jobs on the following standalone Linux servers (aka ‘submit hosts’):
blundell, frisch, hicks, jorgensen, laffont, logit,
marlowe, marshall, nerlove, radner, theil
Users can also start JupyterHub sessions to get browser-based access to the cluster nodes.
The cluster has three job queues (called partitions by Slurm) called ‘high’ (the default), ‘low’, and ‘gpu’. Interactive jobs can also be run in any queue.
One important note about using the cluster is that your code will not be able to use any more cores than you have requested via Slurm when submitting your job (this is enforced by a Linux tool called ‘cgroups’).
Also note that as indicated above, the cores in the high and gpu partitions use hyperthreading. If you’d like to run multi-core jobs without hyperthreading, please contact us for work-arounds.
At the moment the default time limit on each job is five days run-time, but users can still request a longer limit (up to max of 28 days) with the -t flag. The scheduling software can better balance usage across multiple users when it has information about how long each job is expected to take, so if possible please indicate a time limit for your job even if it is less than five days. Feel free to be generous in this time limit to avoid having your job killed if it runs longer than expected. Also feel free to set a time limit as a round number, such as 1 hour, 4 hours, 1 day, 3 days, 10 days, and 28 days rather than trying to be more exact.
This table outlines the job restrictions in each partition.
Partition | Max. cores/user | Time Limit | Max. mem/job (GB) | Max. cores/job |
---|---|---|---|---|
high (default) | 352 | 28 days2 | 132 GB (eml-sm2 nodes) or 768 GB (eml-sm3 nodes, requested with “-C mem768g” [formerly “-C bigmem”] 4) | 56 (eml-sm2 nodes) or 32 (eml-sm3 nodes)3 |
low1 | 256 | 28 days2 | 264 GB | 323 |
gpu5 | 48 | 28 days | 256 GB | 48 |
[1] See How to Submit Jobs to the Low Partition.
[2] See Submitting Long Jobs for jobs you expect to take more than three days.
[3] If you use software set up to use multiple nodes, you can run individual jobs on more than 56 (or 32 on the low partition) cores. See Submitting Multi-node Jobs or Submitting MPI Jobs or Submitting MATLAB Parallel Server Jobs.
[4] See Submitting Large Memory Jobs for jobs needing more than 25 GB memory.
[5] See Submitting GPU Jobs for jobs using the GPU.
Basic Slurm Usage
Quick Start Guide
Our sister facility, the Statistical Computing Facility, has a quick start guide to submitting and monitoring jobs that is useful for EML users as well (but note that the machine names shown in the guide are the SCF machines, not EML machines; in particular make sure you ssh to an EML Linux server to submit jobs).
Submitting a Simple Single-Core Job
Prepare a shell script containing the instructions you would like the system to execute. When submitted using the instructions in this section, your code should only use a single core at a time. Jobs that start additional processes will still only have access to one core. In the later sections of this document, we describe how to submit jobs that use a variety of types of parallelization to make use of multiple cores.
For example a simple script to run the Matlab code in the file ‘simulate.m’ would contain these lines:
#!/bin/bash
matlab -nodisplay -nodesktop < simulate.m > simulate.out
Note that the first line, indicating which UNIX shell to use, is required. You can specify tcsh or another shell if you prefer.
Once logged onto a submit host, use the sbatch command with the name of the shell script (assumed to be job.sh here) to enter a job into the queue:
theil:~$ sbatch job.sh
Submitted batch job 380
Here the job is assigned job ID 380. Results that would normally be printed to the screen from your program will be written to a file called simulate.out per the invocation of MATLAB in the job script.
For Stata, if you submit a job without requesting multiple cores, it makes sense to use Stata/SE so that Stata only attempts to use a single core.
Slurm provides a number of additional flags (input options) to control what happens; you can see the man page for sbatch for help with these. Here are some examples, placed in the job script file, where we name the job, ask for email updates and name the output and error files:
#!/bin/bash
#SBATCH --job-name=myAnalysisName
#SBATCH -o myAnalysisName.out #File to which job script's standard output will be written
#SBATCH -e myAnalysisName.err #File to which job script's standard error will be written
matlab -nodisplay -nodesktop -singleCompThread < simulate.m > simulate.out
For any of the sbatch flags you may choose to include them in the job script as just above, or to use the flags on the command line when you submit the job, just after you type ‘sbatch’ and before the name of the submission script, for example:
theil:~$ sbatch --job-name=foo --mail-user=blah@berkeley.edu job.sh
Note that Slurm is configured such that single-core jobs will have access to a single physical core (including both hyperthreads on the machines in the high and gpu partitions), so there won’t be any contention between the two threads on a physical core. However, if you have many single-core jobs to run on the high or gpu partitions, you might improve your throughput by modifying your workflow so that you can one run job per hyperthread rather than one job per physical core. You could do this by taking advantage of parallelization strategies in R, Python, or MATLAB to distribute tasks across workers in a single job, or you could use GNU parallel or srun within sbatch.
How to Kill a Job
First, find the job-id of the job, by typing ‘squeue’ at the command line of a submit host (see the section on ‘How to Monitor Jobs’ below).
Then use scancel to delete the job (with id 380 in this case):
theil:~$ scancel 380
Submitting a Low-Priority Job
To submit a job to the slower nodes in the low priority partition (e.g., when the default high priority partition is busy), you must include either the ‘–partition=low’ or ‘-p low’ flag. Without this flag, jobs will be run by default in the high partition. For example:
theil:~$ sbatch -p low job.sh
Submitted batch job 380
You can also submit interactive jobs (see next section) to the low partition, by simply adding the flag for the low partition, e.g., ‘-p low’, to the srun command.
Interactive Jobs
You can work interactively on a node from the Linux shell command line by starting an interactive job (in any of the partitions). Please do not forget to close your interactive sessions when you finish your work so the cores are available to other users.
The syntax for requesting an interactive (bash) shell session is:
srun --pty /bin/bash
This will start a shell on one of the nodes. You can then act as you would on any EML Linux compute server. For example, you might use top to assess the status of one of your non-interactive (i.e., batch) cluster jobs. Or you might test some code before running it as a batch job. You can also transfer files to the local disk of the cluster node.
If you want to run a program that involves a graphical interface (requiring an X11 window), you need to add –x11=first to your srun command. So you could directly run MATLAB, e.g., on a cluster node as follows:
srun --pty --x11=first matlab
or you could add the -x11=first flag when requesting an interactive shell session and then subsequently start a program that has a graphical interface.
Please note that you will only have access to one core in your interactive job unless you specifically request more cores. To run an interactive session in which you would like to use multiple cores, do the following (here we request 4 cores for our use):
srun --pty --cpus-per-task 4 /bin/bash
Note that “-c” is a shorthand for “–cpus-per-task”. More details on jobs that use more than one core can be found below in the section on Submitting Parallel Jobs.
To transfer files to the local disk of a specific node, you need to request that your interactive session be started on the node of interest (in this case eml-sm10):
srun --pty -w eml-sm10 /bin/bash
Note that if that specific node does not have sufficient free cores to run your job, you will need to wait until cores become available on that node before your interactive session will start. The squeue command (see below in the section on How to Monitor Jobs) will tell you on which node a given job is running.
Submitting Long Jobs and Setting Job Time Limits
As mentioned earlier, the default time limit on each job is five days run-time, but users can still request a longer limit (up to max of 28 days) with the -t flag, as illustrated here to request a 10-day job:
theil:~$ sbatch -t 10-00:00:00 job.sh
The scheduling software can better balance usage across multiple users when it has information about how long each job is expected to take, so if possible please indicate a time limit for your job even if it is less than three days. Feel free to be generous in this time limit to avoid having your job killed if it runs longer than expected. (For this reason, we suggest that if you expect your job to take more than three days that you may want to increase the limit relative to the five-day default.) Also feel free to set a time limit as a round number, such as 1 hour, 4 hours, 1 day, 3 days, 10 days, and 28 days rather than trying to be more exact.
Here is an example of requesting three hours for a job:
theil:~$ sbatch -t 3:00:00 job.sh
Submitting Large Memory Jobs
The default high priority partition has some nodes with 132 GB memory and some with 768 GB of memory. If you’re submitting a job that needs a lot of memory, you should add the ‘-C mem768g’ flag (formerly ‘bigmem’) to ensure your job runs on a node with 768 GB of memory. In particular, any jobs that use more than 100 GB memory should use this flag, and we recommend that jobs using more than 25 GB memory use this flag (because having multiple such jobs, potentially from different users, could cause the nodes with 132 GB of memory to run out of memory).
theil:~$ sbatch -C mem768g job.sh
You don’t need to, and usually shouldn’t, use the ‘–mem’ flag for sbatch. In general, you’ll have access to all the memory on a node. This is true for all users, so occasionally a job will fail as mentioned above and discussed further here.
If you expect to need all the memory on a node or need to make sure your job does not die because of memory use by other user jobs on the node your job is running on, you can request that your job have exclusive access to a node by adding the ‘–exclusive’ flag.
Submitting GPU Jobs
To submit a job to the GPU node, you must include either the ‘–partition=gpu’ or ‘-p gpu’ flag, as well as the “–gpus=1” flag. For example:
theil:~$ sbatch -p gpu --gpus=1 job.sh
Submitted batch job 380
To use CUDA or cuDNN, you’ll need to load the ‘cuda’ or ‘cudnn’ modules. To use Tensorflow with the GPU, you’ll need to load the ‘tensorflow’ module (without loading the cuda
or cudnn
modules, as CUDA and cuDNN will be available to Tensorflow via a different mechanism).
How to Monitor Jobs
The Slurm command squeue provides info on job status:
theil:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
381 high job.sh paciorek R 25:28 1 eml-sm20
380 low job.sh paciorek R 25:37 1 eml-sm11
The following will tailor the output to include information on the number of cores (the CPUs column below) being used:
theil:~$ squeue -o "%.7i %.9P %.8j %.8u %.2t %.9M %.5C %.8r %.6D %R"
JOBID PARTITION NAME USER ST TIME CPUS REASON NODES NODELIST(REASON)
381 high job.sh paciorek R 28:00 4 None 1 eml-sm20
380 low job.sh paciorek R 28:09 4 None 1 eml-sm11
The ST
field indicates whether a job is running (R), failed (F), or pending (PD). The latter occurs when there are not yet enough resources on the system for your job to run.
If you would like to logon to the node on which your job is running in order to assess CPU or memory use, you can SSH to the node in the context of your existing job. (Otherwise SSH to a cluster node will fail if you don’t have a job running on the node.) You’ll need to determine the node on which your job is running and then ssh to that node from a standalone/login node. Note that if your terminal on the standalone/login node is via JupyterHub, you’ll need this SSH invocation: ssh -F none <name_of_node>
.
You can then run ‘top’ and other such tools.
To see a history of your jobs (within a time range), including reasons they might have failed:
sacct --starttime=2021-04-01 --endtime=2021-04-30 \
--format JobID,JobName,Partition,Account,AllocCPUS,State%30,ExitCode,Submit,Start,End,NodeList,MaxRSS
How to Monitor Cluster Usage
If you’d like to see how busy each node is (e.g., to choose what partition to submit a job to), you can run the following:
theil:~$ sinfo -N -o "%8P %15N %.5a %6t %C"
PARTITIO NODELIST AVAIL STATE CPUS(A/I/O/T)
low* eml-sm00 up idle 0/32/0/32
low* eml-sm01 up idle 0/32/0/32
low* eml-sm02 up idle 0/32/0/32
low* eml-sm03 up idle 0/32/0/32
low* eml-sm10 up mix 29/3/0/32
low* eml-sm11 up mix 27/5/0/32
low* eml-sm12 up mix 9/23/0/32
low* eml-sm13 up idle 0/32/0/32
high eml-sm20 up idle 0/56/0/56
high eml-sm21 up mix 8/48/0/56
high eml-sm22 up idle 0/56/0/56
high eml-sm23 up idle 0/56/0/56
high eml-sm30 up mix 28/4/0/32
high eml-sm31 up mix 28/4/0/32
high eml-sm32 up idle 0/32/0/32
high eml-sm33 up idle 0/32/0/32
Here the A column indicates the number of cores used (i.e., active), I indicates the number of inactive cores, and T the total number of cores on the node.
When will my job run or why is it not starting?
The cluster is managed using the Slurm scheduling software. We configure Slurm to try to balance the needs of the various cluster users.
Often there may be enough available CPU cores (aka ‘resources’) on the partition, and your job will start immediately after you submit it.
However, there are various reasons a job may take a while to start. Here are some details of how the scheduler works.
- If there aren’t enough resources in a given partition to run a job when it is submitted, it goes into the queue. The queue is sorted based how much CPU time you’ve used over the past few weeks, using the ‘fair share’ policy described below. Your jobs will be moved to a position in the queue below the jobs of users who have used less CPU time in recent weeks. This happens dynamically, so another user can submit a job well after you have submitted your job, and that other user’s job can be placed higher in the queue at that time and start sooner than your job. If this were not the case, imagine the situation of a user sending 100s of big jobs to the system. They get into the queue and everyone submitting after that has to wait for a long time while those jobs are run, if other jobs can’t get moved above those jobs in the queue.
- When a job at the top of the queue needs multiple CPU cores (or in some cases an entire node), then jobs submitted to that same partition that are lower in the queue will not start even if there are enough CPU cores available for those lower priority jobs. That’s because the scheduler is trying to accumulate resources for the higher priority job(s) and guarantee that the higher priority jobs’ start times wouldn’t be pushed back by running the lower priority jobs.
- In some cases, if the scheduler has enough information about how long jobs will run, it will run lower-priority jobs on available resources when it can do so without affecting when a higher priority job would start. This is called backfill. It can work well on some systems but on the SCF and EML it doesn’t work all that well because we don’t require that users specify their jobs’ time limits carefully. We made that tradeoff to make the cluster more user-friendly at the expense of optimizing throughput.
The ‘fair share’ policy governs the order of jobs that are waiting in the queue for resources to become available. In particular, if two users each have a job sitting in a queue, the job that will start first will be that of the user who has made less use of the cluster recently (measured in terms of CPU time). The measurement of CPU time downweights usage over time, with a half-life of one month, so a job that ran a month ago will count half as much as a job that ran yesterday. Apart from this prioritization based on recent use, all users are treated equally.
Useful Slurm commands
We’ve prepared a set of shortcut commands that wrap around Slurm commands such srun
, squeue
, sinfo
, and sacct
with some commonly-used options:
slogin
: starts an interactive shell on a cluster nodesnodes
: prints the current usage of nodes on the clustersjobs
: lists running jobs on the clustershist
: provides information about completed (including failed) jobssassoc
: gives information about user access to cluster partitions
For each of these commands, you can add the -h
flag to see how to use them. For example:
theil:~$ slogin -h
Usages:
'slogin' to start an interactive job
'slogin jobid' to start a shell on the node a job is running on
'slogin additional_arguments_to_srun' to start an interactive job with additional arguments to srun
Submitting Parallel Jobs
One can use Slurm to submit a variety of types of parallel code. Here is a set of potentially useful templates that we expect will account for most user needs. If you have a situation that does not fall into these categories or have questions about parallel programming or submitting jobs to use more than one core, please email consult@econ.berkeley.edu.
For additional details, please see the SCF tutorial on the basics of parallel programming in R, Python, MATLAB and C/C++, with some additional details on doing so in the context of a Slurm job. If you’re making use of the threaded BLAS, it’s worth doing some testing to make sure that threading is giving an non-negligible speedup; see the notes above for more information.
Submitting Threaded Jobs
Here’s an example job script to use multiple threads (4 in this case) in R (or with your own openMP-based program):
#!/bin/bash
#SBATCH --cpus-per-task 4
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
R CMD BATCH --no-save simulate.R simulate.Rout
This will allow your R code to use the system’s threaded BLAS and LAPACK routines. [Note that in R you can instead use the omp_set_num_threads()
function in the RhpcBLASctl package, again making use of the SLURM_CPUS_PER_TASK environment variable.]
The same syntax in your job script will work if you’ve compiled a C/C++/Fortran program that makes use of openMP for threading. Just replace the R CMD BATCH line with the line calling your program.
Here’s an example job script to use multiple threads (4 in this case) in MATLAB:
#!/bin/bash
#SBATCH --cpus-per-task 4
matlab -nodesktop -nodisplay < simulate.m > simulate.out
At the start of your MATLAB code file you should include this line:
maxNumCompThreads(str2num(getenv('SLURM_CPUS_PER_TASK')));
Here’s an example job script to use multiple threads (4 in this case) in SAS:
#!/bin/bash
#SBATCH --cpus-per-task 4
sas -threads -cpucount $SLURM_CPUS_PER_TASK
A number of SAS procs are set to take advantage of threading, including SORT, SUMMARY, TABULATE, GLM, LOESS, and REG. SAS enables threading by default, with the default number of threads set to four. Starting SAS as above ensures that the number of threads is set to the number of cores you requested. You can check that threading is enabled from within SAS by running the following and looking for the cpucount and threads options in the printout.
Proc Options group=performance; run;
You can use up to eight cores with Stata/MP (limited by the EML license for Stata).
Do not request more than 8 cores for a Stata job. If you request eight cores (--cpus-per-task=8
) or fewer, you are all set.
It’s possible to explicitly set the number of processors in Stata to be the number you requested in your job submission, but that is unnecessary because Slurm will limit your job to the number of cores requested. That said, one way to do this is to hard-code the following line at the start of your Stata code (in this case assuming you requested four cores):
set processors 4
You can do this in an automated fashion by first invoking Stata in your job script as:
stata-mp -b do myStataCode.do ${SLURM_CPUS_PER_TASK}
and then at the start of your Stata code including these two lines:
args ncores
set processors `ncores'
Submitting Multi-core Jobs
The following example job script files pertain to jobs that need to use multiple cores on a single node that do not fall under the threading/openMP context. This is relevant for parfor in MATLAB; for IPython parallel (ipyparallel), Dask, Ray, Pool.map and pp.Server in Python; and for parallel code in R that starts multiple R process (e.g., future’s future_lapply, foreach, mclapply, parLapply).
Here’s an example script that uses multiple cores (4 in this case):
#!/bin/bash
#SBATCH --cpus-per-task 4
R CMD BATCH --no-save simulate.R simulate.Rout
Your R, Python, or any other code won’t be able to use more cores than the number of total cores requested (4 in this case). You can use the SLURM_CPUS_PER_TASK
environment variable to programmatically control this, though your software will automatically detect the number of available cores in many cases.
The same syntax for your job script pertains to MATLAB. When using parpool in MATLAB, you can do the following:
parpool(str2num(getenv('SLURM_CPUS_PER_TASK')));
In the high (default) partition, the default maximum number of workers is 28 on the eml-sm2 nodes and 16 on the eml-sm3 nodes, while in the low partition it is also 16. To increase this (up to the maximum number of cores on a machine), run the following code before invoking parpool:
cl = parcluster();
cl.NumWorkers = str2num(getenv('SLURM_CPUS_PER_TASK'));
cl.parpool(str2num(getenv('SLURM_CPUS_PER_TASK')));
To use more than 32 workers (i.e., 32 cores or more) (or more than 56 in the eml-sm2 nodes on the high (default) partition) in MATLAB in a parpool context (or to use cores spread across multiple nodes, which can help start your job faster when the cluster is busy), you need to use MATLAB Parallel Server, discussed below.
To use multiple threads per worker in MATLAB, here is an example script for four workers and two threads per worker:
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks 4
#SBATCH --cpus-per-task 2
matlab -nodesktop -nodisplay < simulate.m > simulate.out
And here is how to start your parpool in your MATLAB code:
cl = parcluster();
cl.NumThreads = str2num(getenv('SLURM_CPUS_PER_TASK'));
cl.parpool(str2num(getenv('SLURM_NTASKS')));
Also note that as indicated above, the cores in the high and gpu partitions use hyperthreading, which could slow multi-core computations in some cases. If you’d like to run multi-core jobs without hyperthreading, please contact us for work-arounds.
Submitting Multi-node Jobs
You can run jobs that use cores across multiple nodes, but only if the software you are using is configured to execute across machines. Some examples include use of MATLAB Parallel server (discussed below); MPI-based jobs (discussed below); Python code that uses ipyparallel, Dask or Ray with workers running on multiple nodes; or R code using the future package or other multi-node-capable tools. This modality allows you to use more cores than exist on a single node or to gather free cores that are scattered across the nodes when the cluster is heavily used. To run across multiple nodes, you can simply request the total number of cores you want using the –ntasks flag. In cases where the number of cores is greater than the number available on a machine, your job will have access to multiple nodes. Depending on cluster usage, even if you request fewer cores than a machine has, your job may still access cores on multiple nodes.
Submitting MPI Jobs
You can use MPI to run jobs across multiple nodes. This modality allows you to use more cores than exist on a single node or to gather free cores that are scattered across the nodes when the cluster is heavily used.
Here’s an example script that uses multiple processors via MPI (64 in this case):
#!/bin/bash
#SBATCH --ntasks 64
mpirun -np $SLURM_NTASKS myMPIexecutable
Note that “-n” is a shorthand for “–ntasks”.
“myMPIexecutable” could be C/C++/Fortran code you’ve written that uses MPI, or R or Python code that makes use of MPI. More details are available here.
To run an MPI job with each process threaded, your job script would look like the following (here with 14 processes and two threads per process):
#!/bin/bash
#SBATCH --ntasks 14 --cpus-per-task 2
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
mpirun -np $SLURM_NTASKS -x OMP_NUM_THREADS myMPIexecutable
Submitting MATLAB Parallel Server (Multi-node) Jobs
MATLAB Parallel Server (formerly MATLAB DCS) allows you to run a parallel MATLAB job across multiple nodes.
There are two primary advantages to using MATLAB Parallel Server rather than parallelizing your MATLAB code across the cores on a single node: (1) there is no limit on the number of workers (apart from the limit on the number of cores available on the cluster), and (2) Your job may not have to wait until the number of cores requested are all available on a single node, but rather than “scavenge” available cores across multiple nodes. (But note that all cores must be within one partition.
By default each worker will use one core, but it’s possible to use more than one core per worker as discussed below.
There are two ways to use MATLAB Parallel Server.
Option 1a - submit a Slurm job via sbatch that uses MATLAB Parallel Server (new approach for MATLAB R2023a, as well as R2022a)
As a one-time setup step (you only need to do this the first time you are using the approach), go to “Environment->Parallel->Create and Manage Cluster Profiles” in the MATLAB Desktop toolbar; this will open the Cluster Profile Manager. Then click on ‘Import’ and navigate to /usr/local/linux/MATLAB/current/toolbox/parallel/dcs.mlsettings
(dcsProfile.settings
in MATLAB R2022a) and click ‘Open’. Then exit out of the Cluster Profile Manager.
To submit a MATLAB Parallel Server job, you’ll need to specify the number of MATLAB workers (by using the -n or –ntasks flag). Unlike in the past you should NOT use the ‘-C dcs’ flag. Here’s an example job script that would use 40 cores for 40 workers (making sure to use the ‘-n’ flag when requesting the number of cores):
#!/bin/bash
#SBATCH -n 40
matlab -nodesktop -nodisplay < code.m > code.mout
Then in your MATLAB code, simply invoke parpool as:
pool = parpool('dcs', str2num(getenv('SLURM_NTASKS')));
If you’d like to use multiple threads per worker, please set –cpus-per-task equal to the number of threads per worker you desire and then use this approach in your MATLAB code:
cl = parcluster('dcs');
cl.NumThreads = str2num(getenv('SLURM_CPUS_PER_TASK'));
% Start a pool of multi-core workers on the EML cluster
pool = cl.parpool('dcs', str2num(getenv('SLURM_NTASKS'));
% execute code
% ...
% Delete the job to release its cluster resources
delete(pool);
You should also be able to use the “batch” command to execute a script or function across multiple workers as described under Option 2 below. Other uses of MATLAB parallel server functionality may also be possible.
Option 1b - submit a Slurm job via sbatch that uses MATLAB Parallel Server (deprecated approach for MATLAB R2022a and earlier)
To submit a MATLAB Parallel Server job, you’ll need to specify the number of MATLAB workers (by using the -n or –ntasks flag) and indicate that a Parallel Server job will be run (by using ‘-C dcs’). This won’t work with srun, only with sbatch. Here’s an example job script that would use 40 cores for 40 workers (making sure to use the ‘-n’ flag when requesting the number of cores):
#!/bin/bash
#SBATCH -n 40 -C dcs
matlab -nodesktop -nodisplay < code.m > code.mout
Then in your MATLAB code, simply invoke parpool as:
pool = parpool(str2num(getenv('SLURM_NTASKS')));
Note that the workers will run as the user “matlabdcs”, so if you interactively log into a node with some of the workers on it, you will see MATLAB processes running as this user if you use ‘top’ or ‘ps’.
If you’d like to use multiple threads per worker, please set –cpus-per-task equal to the number of threads per worker you desire and then use this appraoch in your MATLAB code:
cl = parcluster('dcs');
cl.NumThreads = str2num(getenv('SLURM_CPUS_PER_TASK'));
% Start a 20-worker pool of multi-core workers on the EML cluster
pool = cl.parpool('dcs', 20);
% execute code
% ...
% Delete the job to release its cluster resources
delete(pool);
Option 2 - run your MATLAB code on an EML Linux machine and offload parallel code to the cluster from within MATLAB
This option allows you to be running code within MATLAB running on a stand-alone Linux machine (not part of the EML cluster) and offload parallel execution of a portion of your code to the cluster without explicitly starting a cluster job via Slurm. Note that you simply login to an EML Linux machine and start MATLAB; you do NOT use sbatch or srun to start a cluster job under this approach.
Under Option 2, you can either start a pool of workers using “parpool” or you can use the “batch” command to execute code across multiple workers.
To start up a pool of workers (with the only limit on the number of workers being the number of cores in a given partition and the usage of cores by other users’ jobs), you’ll need to use the ‘dcs’ cluster profile.
As a one-time setup step (you only need to do this the first time you are using the approach), go to “Environment->Parallel->Create and Manage Cluster Profiles” in the MATLAB Desktop toolbar; this will open the Cluster Profile Manager. Then click on ‘Import’ and navigate to /usr/local/linux/MATLAB/current/toolbox/parallel/dcs.mlsettings
(dcsProfile.settings
in R2022a and earlier versions) and click ‘Open’. Then exit out of the Cluster Profile Manager.
Now, whenever you want to start up a pool of workers, simply do the following (here for 40 workers):
% Start a 40-worker pool on the EML cluster
pool = parpool('dcs', 40);
% execute code
% ...
% Delete the job to release its cluster resources
delete(pool);
This starts a Slurm job (and prints out the job ID to the screen in case you want to monitor the Slurm job). Once the pool is ready, simply execute your parallel code (such as with a parfor loop). When you are done remember to delete the pool so the cluster job ends and the resources are available to others.
For threaded workers, you can simply do this:
cl = parcluster('dcs');
cl.NumThreads = 2; % however many threads per worker you'd like to use
% Start a 20-worker pool of multi-core workers on the EML cluster
pool = cl.parpool(20);
% execute code
% ...
% Delete the job to release its cluster resources
delete(pool);
If you’d like to modify the flags that are used when the underlying Slurm job is submitted (e.g., to use the ‘low’ partition and set a particular time limit as shown here), you would do it like this:
cl = parcluster('dcs');
cl.AdditionalProperties.AdditionalSubmitArgs='-p low -t 30:00'
% Start a 40-worker pool on EML cluster low partition, 30 min. time limit
pool = cl.parpool(40);
% execute code
% ...
% Delete the job to release its cluster resources
delete(pool);
Alternatively you can use the “batch” command to execute a script or function across multiple workers. Here is one example usage but there are a variety of others discussed in MATLAB’s online documentation for the “batch” command. Suppose you have a file ‘code.m’ that executes a parfor. To run that code on 39 workers (an additional worker will be used to manage the work, for a total of 40 workers), you would do this:
% Sets things up to make use of MATLAB Parallel Server
c = parcluster('dcs');
% Uses 40 workers total, starting a Slurm job on the EML cluster
j = c.batch('code', 'Pool', 39);
wait(j) % Wait for the job to finish
diary(j) % Display logging output
r = fetchOutputs(j); % Get results into a cell array
r{1} % Display results
% Delete the job to release its cluster resources
j.delete()
Automating Submission of Multiple Jobs
Using Job Arrays to Submit Multiple jobs at Once
Job array submissions are a nice way to submit multiple jobs in which you vary a parameter across the different jobs.
Here’s what your job script would look like, in this case to run a total of 5 jobs with parameter values of 0, 1, 2, 5, 7:
#!/bin/bash
#SBATCH -a 0-2,5,7
myExecutable
Your program should then make use of the SLURM_ARRAY_TASK_ID environment variable, which for a given job will contain one of the values from the set given with the -a flag (in this case from {0,1,2,5,7}). You could, for example, read SLURM_ARRAY_TASK_ID into your R, Python, MATLAB, or C code.
Here’s a concrete example where it’s sufficient to use SLURM_ARRAY_TASK_ID to distinguish different input files if you need to run the same command (the bioinformatics program tophat in this case) on multiple input files (in this case, trans0.fq, trans1.fq, …):
#!/bin/bash
#SBATCH -a 0-2,5,7
tophat BowtieIndex trans${SLURM_ARRAY_TASK_ID}.fq
Submitting Data Parallel (SPMD) Code
Here’s how you would set up your job script if you want to run multiple instances (18 in this case) of the same code as part of a single job.
#!/bin/bash
#SBATCH --ntasks 18
srun myExecutable
To have each instance behave differently, you can make use of the SLURM_PROCID environment variable, which will be distinct (and have values 0, 1, 2, …) between the different instances.
To have each process be threaded, see the syntax under the MPI section above.
“Manually” Automating Job Submission
The above approaches are more elegant, but you can also use UNIX shell tools to submit multiple Slurm jobs. Here are some approaches and example syntax. We’ve tested these a bit but email consult@econ.berkeley.edu if you have problems or find a better way to do this. (Of course you can also manually create lots of individual job submission scripts, each of which calls a different script.)
First, remember that each individual job should be submitted through sbatch.
Here is some example bash shell code (which could be placed in a shell script file) that loops over two variables (one numeric and the other a string):
for ((it = 1; it <= 10; it++)); do
for mode in short long; do
sbatch job.sh $it $mode
done
done
You now have a couple options in terms of how job.sh is specified. This illustrates things for MATLAB jobs, but it shouldn’t be too hard to modify for other types of jobs.
Option #1
# contents of job.sh
echo "it = $1; mode = '$2'; myMatlabCode" > tmp-$1-$2.m
matlab -nodesktop -nodisplay -singleCompThread < tmp-$1-$2.m > tmp-$1-$2.out 2> tmp-$1-$2.err
In this case myMatlabCode.m would use the variables ‘it’ and ‘mode’ but not define them.
Option #2
# contents of job.sh
export it=$1; export mode=$2;
matlab -nodesktop -nodisplay -singleCompThread < myMatlabCode.m > tmp-$1-$2.out 2> tmp-$1-$2.err
In this case you need to insert the following MATLAB code at the start of myMatlabCode.m so that MATLAB correctly reads the values of ‘it’ and ‘mode’ from the UNIX environment variables:
it = str2num(getenv('it'));
mode = getenv('mode');
For Stata jobs, there’s an easier mechanism for passing arguments into a batch job. Invoke Stata as follows in job.sh:
stata -b do myStataCode.do $1 $2
and then in the first line of your Stata code file, myStataCode.do above, assign the input values to variables (in this case I’ve named them id and mode to match the shell variables, but they can be named differently):
args id mode
Then the remainder of your code can make use of these variables.