Updated 2021-05-17
Chain Jobs / Use Job Dependencies¶
Overview¶
- The scheduler provides many tools to:
- chain multiple jobs together (run one after another)
- set conditions for when the next job will run (ex: only run job 2 if job 1 works)
Process¶
- Chaining jobs together is achieved by writing a bash script that serves as a sort of controller for the jobs
Important
The Bash script is not the same as a PBS script. The Bash script controls the jobs (PBS scripts) and how they are run.
- The scheduler provides many dependency tools, allowing users to define what jobs will run based on what jobs fail or not
Example Dependency Options¶
- These options control how jobs interact and are run
- Placed in the bash script, in
qsub -W depend=<dependency_list> <dependent_job>
line - Example options:
after
: start job after listed jobs have begunafterok
: start job only after other job(s) have run successfullybefore
: job may start any time before specified jobs have started executionafternotok
: job may start at any time after all specified jobs have completed succesfully
- A complete list of options can be found here
Dependency on a Specific Job ID¶
- You can create a dependency on a specific Job ID by putting the Job ID in the dependency list.
- Ex:
qsub -W depend=afterok:22182721 job2.pbs
makes the submission of job2.pbs dependent on the successful completion of job1.pbs which has the Job ID 22182721.
- Ex:
Dependency on Multiple Jobs¶
- You can create a dependency on multiple jobs by separating them with
:
.- Ex:
qsub -W depend=afterok:22182721:22182722 job3.pbs
makes the submission of job3.pbs dependent on the successful completion of jobs22182721
and22182722
.
- Ex:
Walkthrough: Chain Jobs Together¶
- This walkthrough will use two jobs, and require the second job run only if the first job is successful
- Both jobs simply print out the node they were started on
- First job: job1.pbs
- Second job: job2.pbs
- Bash script: jobDepend.sh
- You can transfer the files to your account on the cluster to follow along. The file transfer guide may be helpful.
Part 1: The PBS Scripts¶
#PBS -N job1
#PBS -A GT-gburdell3
#PBS -l nodes=1:ppn=2
#PBS -l walltime=1:00
#PBS -q inferno
#PBS -j oe
#PBS -o job1.out
cd $PBS_O_WORKDIR
echo "Job1 started on `/bin/hostname`"
job2.pbs
is exactly the same, but prints out "job2 started on ..." instead of "job1 started on..."- The
#PBS
directives are standard, requesting just 1 minute of walltime and 1 node with 2 cores. More on#PBS
directives can be found in the PBS guide $PBS_O_WORKDIR
is simply a variable that represents the directory you submit the PBS script from.echo
prints the phrase to the out file
Part 2: Bash Script¶
#!/bin/bash
first=$(qsub job1.pbs)
echo $first
second=$(qsub -W depend=afterok:$first job2.pbs)
echo $second
- Instead of using
qsub
directly, the bash script will serve as the controller and handle all the job submission "logic", as in what jobs should run depending on what conditions - This bash script can be used as a template, or you can create your own
- Overview:
first
defines a variable that contains the commandqsub job1.pbs
. This command simply submits job1 normally second
defines a variable that contains the command to submit job 2 only if job1 executedqsub -W
: additional attributes flag, allows you to specify dependenciesdepend=<dependency_list>
: defines the dependencies between this job and other jobsafterok
: option that states Job may be started at any time after all specified jobs have successfully completed.job2.pbs
will only run afterjob1.pbs
has been completed
Part 3: Submitting the Jobs¶
- To submit the jobs, run the bash script . It will handle qsub and the job dependencies for you
- For the walkthrough, use
./jobDepend
to execute the bash script and run the jobs. The bash script will have to be made executable first. - More information on creating and making bash scripts executable can be found here
-
- Check job status with
qstat -u gtusername3 -n
, replacing gtusername3 with your gt username
- Check job status with
- You can delete the job with
qdel 22182721
, replacing the number with the jobid returned after running qsub.
Part 3: Collecting Results¶
- The results of the jobs will be stored as normal
- In the directory where you submitted the
Bash
script, you should seejob1.out
andjob2.out
files, which contain the results of the job. Usecat *.out
or open the files in a text editor to take a look. job1.out
should look like this:
Job name: job1
Queue: inferno
End PBS Prologue Mon Oct 15 11:01:50 EDT 2018
---------------------------------------
Job1 started on iw-c39-29-r.pace.gatech.edu
---------------------------------------
Begin PBS Epilogue Mon Oct 15 11:01:50 EDT 2018
Job ID: 22713426.shared-sched.pace.gatech.edu
*job2.out
should look like this:
Job name: job2
Queue: inferno
End PBS Prologue Mon Oct 15 11:02:00 EDT 2018
---------------------------------------
Job2 started on iw-c39-29-r.pace.gatech.edu
---------------------------------------
Begin PBS Epilogue Mon Oct 15 11:02:00 EDT 2018
Job ID: 22713427.shared-sched.pace.gatech.edu
- After the result files are produced, you can move the files off the cluster, refer to the file transfer guide for help.
- Congratulations! You successfully ran multiple jobs with job dependencies on the cluster.