Updated 2022-08-08

Run a Large Number of Jobs Concurrently with GNU Parallel

Overview

  • Many domains and problems require a large number of tests, each using a different set of parameters.
  • This guide will cover how to run one job that will run a large number of processes concurrently using GNU Parallel rather than executing a large number of runs sequentially which will take much longer.
  • In general, this resource is helpful to run batchs of 10-100 jobs that use similar resources and environment (number of cores, amount of memory, walltime, modules, etc.)

Summary of Procedure

  • Create a <JOBFILE> containing the commands to run your jobs
    • Each line will be a single "job" to be run on the cluster
  • Submit your job by running pace-gnu-job -G <JOBFILE> -q <QUEUENAME>, replacing and with the appropriate values
    • Important: If you are using Phoenix or Firebird clusters, you must include the -A flag followed by your account name: pace-gnu-job -G <JOBFILE> -q <QUEUENAME> -A <ACCOUNTNAME>
      • Run pace-quota to see you available accounts
    • pace-gnu-job will create a PBS script that invokes GNU Parallel to launch the contents of as a single job submission
    • -G <JOBFILE> and -q <QUEUENAME> are required arguments
    • To run optimally on the system, pace-gnu-job will request 1 node per line in <JOBFILE> with the appropriate <PPN> per task
      • This takes advantage of the relatively high abundance of small batches of processors throughout the cluster to minimize queue wait times
      • To request more than one core per job, append the request to the jobfile: -G <JOBFILE>,<PPN>
    • Many of the standard qsub options are available for use:
      • -l can be used to set resource requests such as walltime, pmem, and pvmem
      • -W can be used to set job dependencies
      • -N can be used to set a custom name for the job
      • -j can be used to set the output flags (e.g. -j oe)
      • -o can be used to set the output file path
    • To load any modules needed for the tasks, use the --modules=<MODULELIST> option
      • <MODULELIST> should be a comma-separated list of modules, including version (e.g. --modules=intel/17.0,mvapich2/2.3,hdf5/1.8.19,netcdf/4.3.3)
      • Modules will be loaded in the order listed, so please ensure that the dependencies are listed first
  • Once completed, the output to stdout and stderr for each line of <JOBFILE> will be collected and concatenated in the PBS output/error files
  • Additionally, a GNU Parallel joblog is created that details the hostname, start time, duration, and exit code for each of the commands in <JOBFILE>
    • The filename for the joblog is $PBS_JOBNAME.gnu<JOBID>
    • The columns of the output log correspond to Seq, Host, Starttime, Runtime, Send, Receive, Exitval, Signal, Command

Important

The resource options for nodes and ppn CANNOT be set using the -l argument. nodes is not a user configurable parameter, and is determined by the number of lines in <JOBFILE>. ppn should only be used for multithreaded jobs, and can be set by appending <JOBFILE> with appropriate request, i.e. -G <JOBFILE>,<PPN>

Warning

As jobs run through pace-gnu-job share resources, it is best to use pmem and pvmem instead of mem and vmem to ensure each task has sufficient memory allocated. pace-gnu-job will warn the user that requests for mem and vmem are ignored and values for pmem and pvmem will be used instead (default values are 1gb and 2gb, respectively).

Single Core Example

  • In this example, a parameter sweep of a python script will be run simultaneously as one job
    • The python script, trajectory.py accepts 4 arguments (initial speed, initial angle, mass, and ball diameter) and returns the horizontal distance traveled by the ball
    • The job file, ParSweep.jobs lists the calls to trajectory.py with the parameters to be tested
      • Rather than creating 14 separate jobs, each run will be included in a single request
    • Once run through pace-gnu-job, 3 files will be generated:
      • ParSweep.o<JOBID> contains the collected output to stdout from each execution, plus the standard PBS prologue and epilogue
      • ParSweep.e<JOBID> contains the collected output to stderr from each execution
      • ParSweep.gnu<JOBID> contains the GNU Parallel job log information
  • Save the following python script as trajectory.py
import sys, math, time
x,y,dt = 0,0,.005
v0=float(sys.argv[1])
theta=float(sys.argv[2])
vx=v0*math.cos(math.radians(theta))
vy=v0*math.sin(math.radians(theta))
m=float(sys.argv[3])
D=float(sys.argv[4])
c=math.pi*1.225*D**2/16.
print('A ball of mass %0.3f kg and diameter %0.3f m is launched at %0.2f m/s at an angle of %0.1f degrees' %(m,D,v0,theta))
while True:
    v=math.sqrt(vx**2+vy**2)
    vx-=c*v*vx*dt/m
    vy-=(9.81+c*v*vy/m)*dt
    x+=vx*dt
    y+=vy*dt
    if y<=0:
        break
print("The ball traveled %0.2f meters before hitting the ground." %x)
  • Save the following lines to ParSweep.jobs to list the commands to be run as each task within the job
python trajectory.py 30 15 0.145 0.075
python trajectory.py 30 30 0.145 0.075
python trajectory.py 30 45 0.145 0.075
python trajectory.py 30 60 0.145 0.075
python trajectory.py 30 75 0.145 0.075
python trajectory.py 30 90 0.145 0.075
python trajectory.py 20 45 0.145 0.075
python trajectory.py 25 45 0.145 0.075
python trajectory.py 35 45 0.145 0.075
python trajectory.py 40 45 0.145 0.075
python trajectory.py 30 45 0.145 0.060
python trajectory.py 30 45 0.145 0.090
python trajectory.py 30 45 0.125 0.075
python trajectory.py 30 45 0.165 0.075
  • Run the job using pace-gnu-job being sure to include -G ParSweep.jobs and -q <QUEUENAME>
    • For example, to run this job on inferno, run pace-gnu-job -G ParSweep.jobs -q inferno
    • The script will indicate the creation and submission of the PBS file, as well as the <JOBID> for the job:

Screenshot

  • Once completed, three files are created to log the run:
    • ParSweep.o27231865 provides the collected output to stdout from each call to the python script: - Screenshot
    • ParSweep.e27231865 provides the collected output to stderr from each call to the python script (note that the output from module list prints to stderr): - Screenshot
    • ParSweep.gnu27231865 provides the GNU Parallel joblog information for the job: - Screenshot