Updated 2021-05-17

Submit Multiple Jobs at Once with Job Arrays

Overview

  • Job Arrays are very useful if you want to submit many jobs at once
  • For example:
    • Submitting one job that you want to run with many different inputs

Specify Array Job in PBS Script

  • In your PBS Script, where the #PBS directives are, add
  • PBS -t a-b
  • Where a and b are the limits to the number of jobs. Ex: PBS -t 1-100 would run 100 jobs
  • You can also run individual jobs. Lets say you ran 100 jobs and a few of them failed. You would like to run just the failed jobs again:
  • #PBS -t 23,87,41,5 will rerun the job 23,87,41, and 5
  • All other directives (walltime, num of nodes and processors, etc) are the same.
  • Resources specified in the PBS script apply to each individual job, not the entire job array

Limit the number of Tasks that run at Once

  • #PBS -t 1-100%10 will produce a 100 job array with only 10 jobs active at a time.
    • The scheduler by default will limit the number of concurrent array jobs based on resource availability

Viewing Job Info and Delete Job Arrays

Tip

If you want to view the output of a single job in your job array while it is running, use the command pace-qpeek followed by the job array id and the job number enclosed in brackets like this pace-qpeek 123456[1] (this will view the output of job 1 in the job array 123456). If you are not using a shell other than Bash, enclose the everything after pace-qpeek in quotes to ensure that it is treated as a string literal.

  • After you have submitted the job as normal (using qsub <jobname>), viewing and deleting jobs with arrays is slightly different
  • Check job status: qstat -t 22182721[] where the number followed by " [ ]" is the job id (shows up when job is submitted)
  • To check specific jobs: qsat -t 65-85 will show status of jobs 65 - 85
  • To delete entire job array: qdel 22182721[]. Replace the example number with your jobID.
  • To delete specific jobs: qdel -t 65-85. This example deletes jobs 65-85.

Walkthrough: Use a Job Array

  • This walkthrough will run a python script 10 times with 10 different input files. The input files have 1 line containing 2 digits, and the script adds the digits and prints the results. Log on and transfer the files to your account to follow along.

Part 1: The Input Script, Datafile, and PBS Script

  • Python Script: save the code below as arrayTest.py
import sys
import fileinput

for line in fileinput.input(sys.argv[1:]):
    lineArgs = line.strip()
    lineArgs = lineArgs.replace(" ","")
    a = int(lineArgs[0])
    b = int(lineArgs[1])

def add(a, b):
    sum = a + b
    print ("Sum of {} + {} = {}".format(a , b , sum))

if __name__ == "__main__":
    add(a , b)
  • Input files: run the following command in the same directory as arrayTest.py to create 10 files of input
#!/bin/sh
for i in 1 2 3 4 5 6 7 8 9 10
do
    touch file-$i.txt
    num1=$(($i+1))
    num2=$(($i+2))
    echo $num1 $num2 >> file-$i.txt
done
  • PBS Script: save the code below as arrayTest.pbs
#PBS -N arrayTest
#PBS -A [Account] 
#PBS -l nodes=1:ppn=1
#PBS -l walltime=1:00
#PBS -l pmem=2gb
#PBS -q inferno
#PBS -j oe
#PBS -o arrayTest.out
#PBS -t 1-10


cd $PBS_O_WORKDIR
module load python/2.7
python arrayTest.py file-${PBS_ARRAYID}.txt
  • The #PBS directive lines are mostly standard, except you must add the #PBS -t <a-b> directive to run a job array
  • Be sure to specify memory requirement using -l pmem=2gb, 2gb is a robust and standard amount for this job.
  • $PBS_O_WORKDIR is a varaible representing the dir you are in when you submit the script. The command tells the cluster to enter$PBS_O_WORKDIR, and look for files it needs.

Warning

The input files and python script must also be in the directory you submitted the PBS script from

  • In the computation section (below the directives) $PBS_ARRAYID is a variable used for easy naming of files. It is just a number representing each job, 1 - 10. Since the input files are named 1 - 10, it is used to determine each input file for each job.

Part 2: Submit Job and Check Status

  • Make sure you're in the dir that contains the PBS Script
  • Submit as normal, with qsub <pbs script name>. In this case qsub arrayTest.pbs
  • Check job status with qstat -t 22182721[], replacing the number with the job id returned after running qsub
  • You can delete the job with qdel 22182721[], again replacing the number with the jobid returned after running qsub

Part 3: Collect Results

  • The result should be 10 output files, each printing the sum of two integers. Here is what arrayTest.out-7 should look like:
Job name:   arrayTest-7
Queue:      inferno
End PBS Prologue Mon Aug 27 09:04:06 EDT 2018
---------------------------------------
Sum of 8 + 9 = 17
---------------------------------------
Begin PBS Epilogue Mon Aug 27 09:04:07 EDT 2018
Job ID:     22195011[7].shared-sched.pace.gatech.edu
  • After the result files are produced, you can move the files off the cluster, refer to the file transfer guide for help.
  • Congratulations! You successfully ran a job array.

Walkthrough: Submitting Multiple Jobs with Multiple Variables

  • You may wish to submit muliple jobs to test the effects of changing multiple different input variables.
  • You cannot create multi-dimensional job arrays, but you can still complete the same objective as a muti-dimensional job array by creating a "flat" input file to list the different combinations of variables.
  • This walkthrough will run a python script 100 times using 2 inputs, each with 10 possible values, specified on each line of an input file.
  • The Python script can be found here
  • The input file creator can be found here
  • PBS script can be found here
  • You can transfer the files to your account on the cluster to follow along. The file transfer guide may be helpful.

Part 1: Creating the Input File

  • You can use a bash script to generate a file containing the different combinations of inputs that you would like.
#!/bin/sh
touch input.txt
for ((i = 1; i <= 10; i++))
do
        for ((x = 1; x <= 10; x++))
        do
                echo $i $x >> input.txt
        done
done
  • Rather than using a set of inputs that follow a pattern that is easily to loop through, you could specify inputs by replacing for ((i = 1; i <= 10; i++)) with for i in 1, 15, 24, 123, 71, 63 or whatever else you'd like to use.

Part 2: The PBS Script

#PBS -N arrayTest
#PBS -A [Account] 
#PBS -l nodes=1:ppn=1
#PBS -l walltime=1:00
#PBS -l pmem=2gb
#PBS -q inferno
#PBS -j oe
#PBS -o arrayTest.out
#PBS -t 1-100

cd $PBS_O_WORKDIR
module load python/2.7

# Get Paramters from input.txt file using $PBS_ARRAYID as the line number 
params=`sed -n "${PBS_ARRAYID} p" input.txt`
paramsArray=($params)
i=${paramsArray[0]}
x=${paramsArray[1]}

python multiArrayTest.py $i $x

  • The #PBS directive lines are mostly standard, except you must add the #PBS -t <a-b> directive to run a job array
  • Be sure to specify memory requirement using -l pmem=2gb, 2gb is a robust and standard amount for this job.
  • $PBS_O_WORKDIR is a variable representing the dir you are in when you submit the script. The command tells the cluster to enter$PBS_O_WORKDIR, and look for files it needs.
  • module load python/2.7 will load the 2.7 version of Python which is used in this job.
  • params=`sed -n "${PBS_ARRAYID} p" input.txt` stores the contents of the line number of the current job's array id in params.
  • paramsArray=($params) converts the contents of params into an array so the inputs can be accessed individually.
  • i=${paramsArray[0]} stores the first input found in paramArray in i.
  • x=${paramsArray[1]} stores the second input found in paramArray in x.
  • python multiArrayTest.py $i $x passes the values stored ini and x as arguments to the python script.

Part 3: Submit Job and Check Status

  • Make sure you're in the dir that contains the PBS Script
  • Submit as normal, with qsub <pbs script name>. In this case qsub jobArray.pbs
  • Check job status with qstat -t 22182721[], replacing the number with the job id returned after running qsub
  • You can delete the job with qdel 22182721[], again replacing the number with the jobid returned after running qsub

Part 4: Collect Results

  • The result should be 100 output files, each printing the sum of two integers. Here is what arrayTest.out-30 should look like:
---------------------------------------
Begin PBS Prologue Fri Jun 28 13:15:37 EDT 2019
Job ID:     26248936[30].shared-sched.pace.gatech.edu
User ID:    svemuri8
Job name:   arrayTest-30
Queue:      inferno
End PBS Prologue Fri Jun 28 13:15:37 EDT 2019
---------------------------------------
sum of  3  and  10  is  13
---------------------------------------
Begin PBS Epilogue Fri Jun 28 13:15:37 EDT 2019
Job ID:     26248936[30].shared-sched.pace.gatech.edu
User ID:    svemuri8
Job name:   arrayTest-30
Resources:  neednodes=1:ppn=1,nodes=1:ppn=1,pmem=2gb,walltime=00:01:00
Rsrc Used:  cput=00:00:00,energy_used=0,mem=6956kb,vmem=237896kb,walltime=00:00:00
Queue:      inferno
Nodes:
iw-k41-38-r.pace.gatech.edu
End PBS Epilogue Fri Jun 28 13:15:37 EDT 2019
---------------------------------------
  • After the result files are produced, you can move the files off the cluster, refer to the file transfer guide for help.
  • Congratulations! You successfully ran a job array with multiple variables using an input file.