Updated 2021-01-11

Why is my job stuck in the Queue (won't run)?

Potential issues

  • Having a job stuck in the queue is one of the most common and frustrating problems
  • However, it is important to identify if:
    • the PBS script is broken, resulting in the job not being able to be run
    • The job requires resources that don't exist (issue with PBS script)
    • The job simply is very intensive, so it will have to wait in the queue for longer. This long wait time is mistaken for the job being stuck in queue.

Potential Issue 1: PBS Script is broken

  • A job will never run on the compute node if there is a critical error in the PBS script. These means sytax, typos, and leaving out key #PBS directives. In fact, a job with critical syntax errors will be rejected when it is submitted to the scheduler with qsub <jobname>. If this is the case, look out for typos and syntax error in the ``PBS Script, and make sure you have all the necessary directives. For example, if you were missing the walltime directive,#PBS -l walltime=hh:mm:ss```, the job submission would fail because the scheduler doesn't know how much compute time to allocate to your job.
  • If the PBS script does submit, it may still have logical errors which are harder to catch.
  • A common logical error in the PBS Script is requesting resources that don't exist
    • For example, lets say I request a gpu for my job and submit to the inferno queue (instead of force-gpu). Sytactically, there is nothing wrong with my PBS Script, but the inferno queue has no GPU nodes. As a result, my job will stay in the queue forever, because the scheduler will be looking for nodes that have gpus, and will never find any.
    • A solution is make sure the resources you request in the PBS script are available in the queue you submit to. This includes number of processors, memory, gpu and any other type of resource you can request. Similarly, if you request too many processors than a node physically has, the job will never run.
  • Use pace-check-queue <queue name> to get info about the state of the queue. This command will show you what nodes are available and how many processors / memory the nodes have.
  • In general, use pace-why-inqueue <jobID> to determine why your job is stuck in the queue

Potential Issue 2: Job is very intensive, and naturally has a long wait time

  • If the job requests a lot of resources and has a long walltime, it might have to wait a long time in the queue to run. This long wait may be mistaken for "my job is stuck in the queue".
  • Use pace-check-queue <queue name> when running a job to see what resources are currently available. You can judge what is realistic to wait based on what is open. If 2 nodes are completely open, and your job is not too large, it makes sense that your job will start right away.
  • showstart <jobid> is also very useful in displaying estimated start time. If it can't determine a start time, and the job isn't starting when the resources available are reasonable, the submission script might be broken. Note: showstart will not work immediately when you submit a job, you have to wait a bit before running it
  • Overall, jobs that require a lot of nodes and have a very long walltime will have to wait in the queue longer than jobs that require less of a walltime, and less resources overall.
  • The reason for this is the scheduler runs jobs in the queue as effeciently as possible.
  • Metaphor for explanation: Imagine you had a glass jar, and had to fill it with rocks of different sizes ranging from large stones to grains of sand. That glass jar is the cluster, and the stones are the jobs you have to run on the cluster. Much like large jobs that have long runtimes and request lots of nodes, you could only fit a couple large stones in the jar. It would be much easier to pour in the smaller stones and sand (the smaller jobs). The point of this metaphor is that the less resources and smaller your walltime is, the less time it will sit in the queue and the faster it will be run. It is much easier for the scheduler to slot these small jobs in wherever there is space (enough nodes and enough time available). However, the scheduler must wait a long time to find enough space for the large jobs. If there isn't enough space in the jar, you can't just jam in a large stone. As a result, very intensive jobs will wait longer in the queue.
  • Solution: be as effecient as possible when requesting resources. Always go for the bare minimum in terms of walltime and nodes, so your job will be in the queue for as less time as possible.
  • You can always purchase a private node if you want to never wait in a queue.