Chain Jobs / Use Job Dependencies¶
- The scheduler provides many tools to:
- chain multiple jobs together (run one after another)
- set conditions for when the next job will run (ex: only run job 2 if job 1 works)
- Chaining jobs together is achieved by writing a bash script that serves as a sort of controller for the jobs
The Bash script is not the same as a PBS script. The Bash script controls the jobs (PBS scripts) and how they are run.
- The scheduler provides many dependency tools, allowing users to define what jobs will run based on what jobs fail or not
Example Dependency Options¶
- These options control how jobs interact and are run
- Placed in the bash script, in
qsub -W depend=<dependency_list> <dependent_job>line
- Example options:
after: start job after listed jobs have begun
afterok: start job only after other job(s) have run successfully
before: job may start any time before specified jobs have started execution
afternotok: job may start at any time after all specified jobs have completed succesfully
- A complete list of options can be found here
Dependency on a Specific Job ID¶
- You can create a dependency on a specific Job ID by putting the Job ID in the dependency list.
qsub -W depend=afterok:22182721 job2.pbsmakes the submission of job2.pbs dependent on the successful completion of job1.pbs which has the Job ID 22182721.
Dependency on Multiple Jobs¶
- You can create a dependency on multiple jobs by separating them with
qsub -W depend=afterok:22182721:22182722 job3.pbsmakes the submission of job3.pbs dependent on the successful completion of jobs
Walkthrough: Chain Jobs Together¶
- This walkthrough will use two jobs, and require the second job run only if the first job is successful
- Both jobs simply print out the node they were started on
- First job: job1.pbs
- Second job: job2.pbs
- Bash script: jobDepend.sh
- You can transfer the files to your account on the cluster to follow along. The file transfer guide may be helpful.
Part 1: The PBS Scripts¶
#PBS -N job1 #PBS -l nodes=1:ppn=2 #PBS -l walltime=1:00 #PBS -q iw-shared-6 #PBS -j oe #PBS -o job1.out cd $PBS_O_WORKDIR echo "Job1 started on `/bin/hostname`"
job2.pbsis exactly the same, but prints out "job2 started on ..." instead of "job1 started on..."
#PBSdirectives are standard, requesting just 1 minute of walltime and 1 node with 2 cores. More on
#PBSdirectives can be found in the PBS guide
$PBS_O_WORKDIRis simply a variable that represents the directory you submit the PBS script from.
echoprints the phrase to the out file
Part 2: Bash Script¶
#!/bin/bash first=$(qsub job1.pbs) echo $first second=$(qsub -W depend=afterok:$first job2.pbs) echo $second
- Instead of using
qsubdirectly, the bash script will serve as the controller and handle all the job submission "logic", as in what jobs should run depending on what conditions
- This bash script can be used as a template, or you can create your own
firstdefines a variable that contains the command
qsub job1.pbs. This command simply submits job1 normally
seconddefines a variable that contains the command to submit job 2 only if job1 executed
qsub -W: additional attributes flag, allows you to specify dependencies
depend=<dependency_list>: defines the dependencies between this job and other jobs
afterok: option that states Job may be started at any time after all specified jobs have successfully completed.
job2.pbswill only run after
job1.pbshas been completed
Part 3: Submitting the Jobs¶
- To submit the jobs, run the bash script . It will handle qsub and the job dependencies for you
- For the walkthrough, use
./jobDependto execute the bash script and run the jobs. The bash script will have to be made executable first.
- More information on creating and making bash scripts executable can be found here
- Check job status with
qstat -u gtusername3 -n, replacing gtusername3 with your gt username
- Check job status with
- You can delete the job with
qdel 22182721, replacing the number with the jobid returned after running qsub.
Part 3: Collecting Results¶
- The results of the jobs will be stored as normal
- In the directory where you submitted the
Bashscript, you should see
job2.outfiles, which contain the results of the job. Use
cat *.outor open the files in a text editor to take a look.
job1.outshould look like this:
Job name: job1 Queue: iw-shared-6 End PBS Prologue Mon Oct 15 11:01:50 EDT 2018 --------------------------------------- Job1 started on iw-c39-29-r.pace.gatech.edu --------------------------------------- Begin PBS Epilogue Mon Oct 15 11:01:50 EDT 2018 Job ID: 22713426.shared-sched.pace.gatech.edu
job2.out should look like this:
Job name: job2 Queue: iw-shared-6 End PBS Prologue Mon Oct 15 11:02:00 EDT 2018 --------------------------------------- Job2 started on iw-c39-29-r.pace.gatech.edu --------------------------------------- Begin PBS Epilogue Mon Oct 15 11:02:00 EDT 2018 Job ID: 22713427.shared-sched.pace.gatech.edu
- After the result files are produced, you can move the files off the cluster, refer to the file transfer guide for help.
- Congratulations! You successfully ran multiple jobs with job dependencies on the cluster.