Updated 2023-04-19
Why did my job terminate when I didn't expect it too?¶
Potential issues¶
- If your job terminates when you don't expect it, it means most likely something went wrong in the program you were trying to run, not in your
#SBATCH
directives. - For troubleshooting, focus on the computation section of the SBATCH script.
- The computation section is anything after the
#SBATCH
lines (the Slurm directive lines). - Make sure the number of resources requested in the top
#SBATCH
directives section of the SBATCH script matches any number of resources requested in the computation section. - After ensuring the resources requested in the
#SBATCH
directive and computation parts of the SBATCH script are consistent, it is important to identify if:- the computation section of the SBATCH script is broken
- modules loaded are incorrect or wrong version
- wrong directory entered
- files not stored where they are specified in the computation section of the SBATCH script
- The script being run by the job is broken
- the computation section of the SBATCH script is broken
Note
Whatever the case, check the contents of the .out file after the job terminates. It will tell you everything that went wrong, whether it was a problem in the user's program to be run, or it was a problem with the SBATCH script
Warning
To enable error to be logged in the out file, make sure in your PBS script you have the directive #SBATCH -oReport-%j.out
Potential Issue 1: Computatuation Section of the SBATCH Script is broken¶
- One common error is the wrong version is loaded of the software required. Software is loaded in the form of
modules
withmodule load <module name>
. Loading the wrong modules can lead to conflict between modules, outdated version, or requirement of different dependent modules. - Solution: Know exactly what version of the software you need to load and what version of its dependencies it needs.
- Ex: for mvapich2/2.1 (a version of mpi), you need to load the correct version of mvapich2 and its compiler, gcc with:
module load gcc/4.9.0 mvapich2/2.1
- You can find versions of the software you want with
module avail <software name>
An example ismodule avail tensorflow
. - Another common error is the files you need to run the job aren't in the correct place.
- Solution: Make sure in the computation part of the SBATCH script, you tell the cluster to enter the directory where you store the files you need (including script to be run and other files). For example, if I had a python script I wanted to run and the data for the script stored in
~/data/project
, I would tell the cluster to enter that directory withcd ~/data/project
before I told the cluster to run the script. - Helpful guide on moving files to and from the cluster if it is unclear
Potential Issue 2: The User's program to be run is broken¶
- If the error you get (found in the .out file) have nothing to do with module versions, missing software or files, the
SBATCH
script, or anything on the cluster side, the script you are trying to be run may have syntax / other runtime errors - Solution: try compiling or running the program or a simplified verion of the program on a controlled environment like your laptop. This way you can check for syntax and other runtime / compilation errors.
- If it runs fine in your personal environment, then you will probably have to focus on cluster related troubleshooting strategies.