Updated 2022-08-08
Run a Large Number of Jobs Concurrently with GNU Parallel¶
Overview¶
- Many domains and problems require a large number of tests, each using a different set of parameters.
- This guide will cover how to run one job that will run a large number of processes concurrently using GNU Parallel rather than executing a large number of runs sequentially which will take much longer.
- In general, this resource is helpful to run batchs of 10-100 jobs that use similar resources and environment (number of cores, amount of memory, walltime, modules, etc.)
Summary of Procedure¶
- Create a
<JOBFILE>
containing the commands to run your jobs- Each line will be a single "job" to be run on the cluster
- Submit your job by running
pace-gnu-job -G <JOBFILE> -q <QUEUENAME>
, replacingand with the appropriate values - Important: If you are using Phoenix or Firebird clusters, you must include the
-A
flag followed by your account name:pace-gnu-job -G <JOBFILE> -q <QUEUENAME> -A <ACCOUNTNAME>
- Run
pace-quota
to see you available accounts
- Run
pace-gnu-job
will create a PBS script that invokesGNU Parallel
to launch the contents ofas a single job submission -G <JOBFILE>
and-q <QUEUENAME>
are required arguments- To run optimally on the system,
pace-gnu-job
will request 1 node per line in<JOBFILE>
with the appropriate<PPN>
per task- This takes advantage of the relatively high abundance of small batches of processors throughout the cluster to minimize queue wait times
- To request more than one core per job, append the request to the jobfile:
-G <JOBFILE>,<PPN>
- Many of the standard
qsub
options are available for use:-l
can be used to set resource requests such as walltime, pmem, and pvmem-W
can be used to set job dependencies-N
can be used to set a custom name for the job-j
can be used to set the output flags (e.g.-j oe
)-o
can be used to set the output file path
- To load any modules needed for the tasks, use the
--modules=<MODULELIST>
option<MODULELIST>
should be a comma-separated list of modules, including version (e.g.--modules=intel/17.0,mvapich2/2.3,hdf5/1.8.19,netcdf/4.3.3
)- Modules will be loaded in the order listed, so please ensure that the dependencies are listed first
- Important: If you are using Phoenix or Firebird clusters, you must include the
- Once completed, the output to
stdout
andstderr
for each line of<JOBFILE>
will be collected and concatenated in the PBS output/error files - Additionally, a GNU Parallel joblog is created that details the hostname, start time, duration, and exit code for each of the commands in
<JOBFILE>
- The filename for the joblog is
$PBS_JOBNAME.gnu<JOBID>
- The columns of the output log correspond to
Seq, Host, Starttime, Runtime, Send, Receive, Exitval, Signal, Command
- The filename for the joblog is
Important
The resource options for nodes
and ppn
CANNOT be set using the -l
argument. nodes
is not a user configurable parameter, and is determined by the number of lines in <JOBFILE>
. ppn
should only be used for multithreaded jobs, and can be set
by appending <JOBFILE>
with appropriate request, i.e. -G <JOBFILE>,<PPN>
Warning
As jobs run through pace-gnu-job
share resources, it is best to use pmem
and pvmem
instead of mem
and vmem
to ensure each task has sufficient memory allocated. pace-gnu-job
will warn the user that requests for mem
and vmem
are ignored
and values for pmem
and pvmem
will be used instead (default values are 1gb and 2gb, respectively).
Single Core Example¶
- In this example, a parameter sweep of a python script will be run simultaneously as one job
- The python script, trajectory.py accepts 4 arguments (initial speed, initial angle, mass, and ball diameter) and returns the horizontal distance traveled by the ball
- The job file, ParSweep.jobs lists the calls to trajectory.py with the parameters to be tested
- Rather than creating 14 separate jobs, each run will be included in a single request
- Once run through
pace-gnu-job
, 3 files will be generated:ParSweep.o<JOBID>
contains the collected output to stdout from each execution, plus the standard PBS prologue and epilogueParSweep.e<JOBID>
contains the collected output to stderr from each executionParSweep.gnu<JOBID>
contains the GNU Parallel job log information
- Save the following python script as trajectory.py
import sys, math, time
x,y,dt = 0,0,.005
v0=float(sys.argv[1])
theta=float(sys.argv[2])
vx=v0*math.cos(math.radians(theta))
vy=v0*math.sin(math.radians(theta))
m=float(sys.argv[3])
D=float(sys.argv[4])
c=math.pi*1.225*D**2/16.
print('A ball of mass %0.3f kg and diameter %0.3f m is launched at %0.2f m/s at an angle of %0.1f degrees' %(m,D,v0,theta))
while True:
v=math.sqrt(vx**2+vy**2)
vx-=c*v*vx*dt/m
vy-=(9.81+c*v*vy/m)*dt
x+=vx*dt
y+=vy*dt
if y<=0:
break
print("The ball traveled %0.2f meters before hitting the ground." %x)
- Save the following lines to ParSweep.jobs to list the commands to be run as each task within the job
python trajectory.py 30 15 0.145 0.075
python trajectory.py 30 30 0.145 0.075
python trajectory.py 30 45 0.145 0.075
python trajectory.py 30 60 0.145 0.075
python trajectory.py 30 75 0.145 0.075
python trajectory.py 30 90 0.145 0.075
python trajectory.py 20 45 0.145 0.075
python trajectory.py 25 45 0.145 0.075
python trajectory.py 35 45 0.145 0.075
python trajectory.py 40 45 0.145 0.075
python trajectory.py 30 45 0.145 0.060
python trajectory.py 30 45 0.145 0.090
python trajectory.py 30 45 0.125 0.075
python trajectory.py 30 45 0.165 0.075
- Run the job using
pace-gnu-job
being sure to include-G ParSweep.jobs
and-q <QUEUENAME>
- For example, to run this job on inferno, run
pace-gnu-job -G ParSweep.jobs -q inferno
- The script will indicate the creation and submission of the PBS file, as well as the
<JOBID>
for the job:
- For example, to run this job on inferno, run
- Once completed, three files are created to log the run:
- ParSweep.o27231865 provides the collected output to stdout from each call to the python script:
-
- ParSweep.e27231865 provides the collected output to stderr from each call to the python script (note that the output from
module list
prints to stderr): - - ParSweep.gnu27231865 provides the GNU Parallel joblog information for the job:
-
- ParSweep.o27231865 provides the collected output to stdout from each call to the python script:
-