Updated 2019-06-28

Chain Jobs / Use Job Dependencies

Overview

  • The scheduler provides many tools to:
    • chain multiple jobs together (run one after another)
    • set conditions for when the next job will run (ex: only run job 2 if job 1 works)

Process

  • Chaining jobs together is achieved by writing a bash script that serves as a sort of controller for the jobs

Important

The Bash script is not the same as a PBS script. The Bash script controls the jobs (PBS scripts) and how they are run.

  • The scheduler provides many dependency tools, allowing users to define what jobs will run based on what jobs fail or not

Example Dependency Options

  • These options control how jobs interact and are run
  • Placed in the bash script, in qsub -W depend=<dependency_list> <dependent_job> line
  • Example options:
    • after: start job after listed jobs have begun
    • afterok: start job only after other job(s) have run successfully
    • before: job may start any time before specified jobs have started execution
    • afternotok: job may start at any time after all specified jobs have completed succesfully
  • A complete list of options can be found here
Dependency on a Specific Job ID
  • You can create a dependency on a specific Job ID by putting the Job ID in the dependency list.
    • Ex: qsub -W depend=afterok:22182721 job2.pbs makes the submission of job2.pbs dependent on the successful completion of job1.pbs which has the Job ID 22182721.
Dependency on Multiple Jobs
  • You can create a dependency on multiple jobs by separating them with :.
    • Ex: qsub -W depend=afterok:22182721:22182722 job3.pbs makes the submission of job3.pbs dependent on the successful completion of jobs 22182721 and 22182722.

Walkthrough: Chain Jobs Together

  • This walkthrough will use two jobs, and require the second job run only if the first job is successful
  • Both jobs simply print out the node they were started on
  • First job: job1.pbs
  • Second job: job2.pbs
  • Bash script: jobDepend.sh
  • You can transfer the files to your account on the cluster to follow along. The file transfer guide may be helpful.

Part 1: The PBS Scripts

#PBS -N job1
#PBS -l nodes=1:ppn=2
#PBS -l walltime=1:00
#PBS -q iw-shared-6
#PBS -j oe
#PBS -o job1.out

cd $PBS_O_WORKDIR

echo "Job1 started on `/bin/hostname`"
  • job2.pbs is exactly the same, but prints out "job2 started on ..." instead of "job1 started on..."
  • The #PBS directives are standard, requesting just 1 minute of walltime and 1 node with 2 cores. More on #PBS directives can be found in the PBS guide
  • $PBS_O_WORKDIR is simply a variable that represents the directory you submit the PBS script from.
  • echo prints the phrase to the out file

Part 2: Bash Script

#!/bin/bash
first=$(qsub job1.pbs)
echo $first
second=$(qsub -W depend=afterok:$first job2.pbs)
echo $second
  • Instead of using qsub directly, the bash script will serve as the controller and handle all the job submission "logic", as in what jobs should run depending on what conditions
  • This bash script can be used as a template, or you can create your own
  • Overview: first defines a variable that contains the command qsub job1.pbs. This command simply submits job1 normally
  • second defines a variable that contains the command to submit job 2 only if job1 executed
    • qsub -W: additional attributes flag, allows you to specify dependencies
    • depend=<dependency_list>: defines the dependencies between this job and other jobs
    • afterok: option that states Job may be started at any time after all specified jobs have successfully completed. job2.pbs will only run after job1.pbs has been completed

Part 3: Submitting the Jobs

  • To submit the jobs, run the bash script . It will handle qsub and the job dependencies for you
  • For the walkthrough, use ./jobDepend to execute the bash script and run the jobs. The bash script will have to be made executable first.
  • More information on creating and making bash scripts executable can be found here
    • Check job status with qstat -u gtusername3 -n, replacing gtusername3 with your gt username
  • You can delete the job with qdel 22182721 , replacing the number with the jobid returned after running qsub.

Part 3: Collecting Results

  • The results of the jobs will be stored as normal
  • In the directory where you submitted the Bash script, you should see job1.out and job2.out files, which contain the results of the job. Use cat *.out or open the files in a text editor to take a look.
  • job1.out should look like this:
Job name:   job1
Queue:      iw-shared-6
End PBS Prologue Mon Oct 15 11:01:50 EDT 2018
---------------------------------------
Job1 started on iw-c39-29-r.pace.gatech.edu
---------------------------------------
Begin PBS Epilogue Mon Oct 15 11:01:50 EDT 2018
Job ID:     22713426.shared-sched.pace.gatech.edu

*job2.out should look like this:

Job name:   job2
Queue:      iw-shared-6
End PBS Prologue Mon Oct 15 11:02:00 EDT 2018
---------------------------------------
Job2 started on iw-c39-29-r.pace.gatech.edu
---------------------------------------
Begin PBS Epilogue Mon Oct 15 11:02:00 EDT 2018
Job ID:     22713427.shared-sched.pace.gatech.edu
  • After the result files are produced, you can move the files off the cluster, refer to the file transfer guide for help.
  • Congratulations! You successfully ran multiple jobs with job dependencies on the cluster.