Updated 2023-04-19

Why is my job stuck in the Queue (won't run)?

Potential issues

  • Having a job stuck in the queue is one of the most common and frustrating problems
  • However, it is important to identify if:
    • the SBATCH script is broken, resulting in the job not being able to be run
    • The job requires resources that don't exist (issue with SBATCH script)
    • The job simply is very intensive, so it will have to wait in the queue for longer. This long wait time is mistaken for the job being stuck in queue.

Potential Issue 1: SBATCH Script is broken

  • A job will be rejected on the compute node if there is a critical error in the SBATCH script. This means syntax, typos, and leaving out key #SBATCH directives. In fact, a job with critical syntax errors will be rejected when it is submitted to the scheduler with sbatch jobName.sbatch. If this is the case, look out for typos and syntax error in the SBATCH Script, and make sure you have all the necessary directives. For example, if you were missing or used an incorrect account fot the charge account directive, #SBATCH --account=gts-exampleAccount_name, the job submission would fail because the scheduler doesn't know which account to allocate to your job.
  • If the SBATCH script does submit, it may still have logical errors which are harder to catch.
  • In general, use squeue <jobID> or pace-why-inqueue <jobID> to determine why your job is stuck in the queue

Potential Issue 2: Job is very intensive, and naturally has a long wait time

  • If the job requests a lot of resources and has a long walltime, it might have to wait a long time in the queue to run. This long wait may be mistaken for "my job is stuck in the queue".
  • Use pace-check-queue <queue name> when running a job to see what resources are currently available. You can judge what is realistic to wait based on what is open. If 2 nodes are completely open, and your job is not too large, it makes sense that your job will start right away.
  • squeue --start is also very useful in displaying estimated start time. If it can't determine a start time, and the job isn't starting when the resources available are reasonable, the submission script might be broken.
  • Overall, jobs that require a lot of nodes and have a very long walltime will have to wait in the queue longer than jobs that require less of a walltime, and less resources overall.
  • The reason for this is the scheduler runs jobs in the queue as effeciently as possible.
  • Metaphor for explanation: Imagine you had a glass jar, and had to fill it with rocks of different sizes ranging from large stones to grains of sand. That glass jar is the cluster, and the stones are the jobs you have to run on the cluster. Much like large jobs that have long runtimes and request lots of nodes, you could only fit a couple large stones in the jar. It would be much easier to pour in the smaller stones and sand (the smaller jobs). The point of this metaphor is that the less resources and smaller your walltime is, the less time it will sit in the queue and the faster it will be run. It is much easier for the scheduler to slot these small jobs in wherever there is space (enough nodes and enough time available). However, the scheduler must wait a long time to find enough space for the large jobs. If there isn't enough space in the jar, you can't just jam in a large stone. As a result, very intensive jobs will wait longer in the queue.
  • Solution: be as effecient as possible when requesting resources. Always go for the bare minimum in terms of walltime and nodes, so your job will be in the queue for as less time as possible.
  • You can always purchase a private node if you want to never wait in a queue.