Updated 2019-02-22

Why did my job terminate when I didn't expect it too?

Potential issues

  • If your job terminates when you don't expect it, it means most likely something went wrong in the program you were trying to run, not in your #PBS directives.
  • For troubleshooting, focus on the computation section of the PBS script.
  • The computation section is anything after the #PBS lines (the PBS directive lines).
  • Make sure the number of resources requested in the top #PBS directives section of the PBS script matches any number of resources requested in the computation section.
  • After ensuring the resources requested in the #PBS directive and computation parts of the PBS script are consistent, it is important to identify if:
    • the computation section of the PBS script is broken
      • modules loaded are incorrect or wrong version
      • wrong directory entered
      • files not stored where they are specified in the computation section of the PBS script
    • The script being run by the job is broken

Note

Whatever the case, check the contents of the .out file after the job terminates. It will tell you everything that went wrong, whether it was a problem in the user's program to be run, or it was a problem with the PBS script

Warning

To enable error to be logged in the out file, make sure in your PBS script you have the directive #PBS -j oe

Potential Issue 1: Computatuation Section of the PBS Script is broken

  • One common error is the wrong version is loaded of the software required. Software is loaded in the form of modules with module load <module name>. Loading the wrong modules can lead to conflict between modules, outdated version, or requirement of different dependent modules.
  • Solution: Know exactly what version of the software you need to load and what version of its dependencies it needs.
  • Ex: for mvapich2/2.1 (a version of mpi), you need to load the correct version of mvapich2 and its compiler, gcc with:
module load gcc/4.9.0 mvapich2/2.1
  • You can find versions of the software you want with module avail <software name> An example is module avail tensorflow.
  • Another common error is the files you need to run the job aren't in the correct place.
  • Solution: Make sure in the computation part of the PBS script, you tell the cluster to enter the directory where you store the files you need (including script to be run and other files). For example, if I had a python script I wanted to run and the data for the script stored in ~/data/project, I would tell the cluster to enter that directory with cd ~/data/project before I told the cluster to run the script.
  • Helpful guide on moving files to and from the cluster if it is unclear

Potential Issue 2: The User's program to be run is broken

  • If the error you get (found in the .out file) have nothing to do with module versions, missing software or files, the PBS script, or anything on the cluster side, the script you are trying to be run may have syntax / other runtime errors
  • Solution: try compiling or running the program or a simplified verion of the program on a controlled environment like your laptop. This way you can check for syntax and other runtime / compilation errors.
  • If it runs fine in your personal environment, then you will probably have to focus on cluster related troubleshooting strategies.