Updated 2019-08-27

Using PBS

How can I ask for a particular node and/or property?

  • In your PBS script, you can request Specific Core Count / Nodes
  • To request a specific core count:
    • Add cores:x to the directive in your PBS script requesting nodes. Example: #PBS -l nodes=1:ppn=1:cores24
  • To reserve a job on specific nodes:
    • Add the -l nodes=<names> to the directives part of your PBS script. Example: #PBS -l nodes=rich133-p32-33-l.pace.gatech.edu+rich133-p32-33-r.pace.gatech.edu would request 2 specific nodes.
  • To request a node with a specific property (nvidiagpu, number of cores, SSE capabilities, etc), then you can use tags. Below is an example:
#PBS -l nodes=2:ppn=4:nvidiagpu
  • Other common tags include:
    • core8 (8 core node)
    • core24 (24 core node)
    • core48 (48 core node)
    • core64 (64 core node)
    • intel
    • amd
    • nvidiagpu
    • amdgpu
    • localdisk (local disk)
    • ib (infiniband network)
    • ssd (solid state local disk)
    • sse3
    • sse41
    • fma4 (FMA4 capability)
  • To request specific nodes with unique resources, you can combine different tags above.
  • Here a couple of examples that you might find helpful:

    • Requesting one node with 4 cores and another node with 6 cores + 2 GPUs:

    #PBS -l nodes=cores4+cores6:gpus=2

    • Requesting 4 cores + 1 GPU on a specific node (rich133-k40-12) and 6 cores + 2 GPUs on another specific node (rich133-k40-15):

    #PBS -l nodes=rich133-k40-12:cores4:gpus=1+rich133-k40-15:cores6:gpus=2

Why can PBS not allocate resources, no matter what I ask for?

  • We have seen this issue happening when the PBS script includes memory request on the same line as the processor request. For example:
#PBS -l nodes=12:ppn=4:mem=8GB
  • Please try placing the memory to a separate line:
#PBS -l nodes=12:ppn=4
#PBS -l mem=8GB
  • This issue reveals itself with the following error message:
qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes

Can I submit multiple PBS jobs? If so, how?

How can I use nodes with GPUs?

  • You can request GPU nodes as follows (please make sure that the queue you are using actually has GPU nodes).
#PBS -q <queue_name>
#PBS -l nodes=1:ppn=8:gpus=1:exclusive_process
  • Note that this example request is for a single node with eight cores and one GPU, so please modify the request according to your needs. Please keep in mind that some (not all) GPU nodes may have multiple GPUs on them.
  • Similarly, you can also request an interactive session to run your GPU code as follows:
qsub -I -q <queue_name> -l nodes=1:ppn=8:gpus=1:exclusive_process

How do I check the status of a job?

  • Once we have a job id (which can be obtained using qstat), we can find out more information about the job like this:
[user@force ~]checkjob 137500
job 137500

AName: CJV.data.970.20
State: Running 
Creds:  user:myusername  group:mygroupname  class:force
WallTime:   22:56:44 of 8:08:00:00
SubmitTime: Mon Nov 22 10:37:03
  (Time Queued  Total: 1:24:04  Eligible: 1:22:20)

StartTime: Mon Nov 22 12:01:07
Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: repace  
Memory >= 5200M  Disk >= 0  Swap >= 0
Opsys: ---  Arch: ---  Features: ---
Dedicated Resources Per Task: PROCS: 1  MEM: 5200M
NodeCount:  1

Allocated Nodes:

StartCount:     1
Partition List: repace
StartPriority:  1832
Reservation '137500' (-22:56:57 -> 7:09:03:03  Duration: 8:08:00:00)
  • Here's what the items in this output mean:
    • AName - the name of the job supplied by the user in their job submission file
    • State - valid states are Running, Idle, UserHold, SystemHold, Deferred, BatchHold and NotQueued
      • Running - the job has been issued to a compute node and is executing
      • Idle - the job is awaiting resources to become available before it can execute
      • UserHold - the job is not being considered for execution at the request of the user
      • SystemHold - like UserHold, except the hold has been placed by an administrator
      • Deferred - a temporary hold, used by the scheduler when it has been unable to start the job after a number of attempts. Usually, this results from some policy violation such as exceeding the maximum allowed wall clock time. See this FAQ entry for more information.
      • BatchHold - the scheduler has failed in multiple attempts to start the job. Usually, this is the permanent state Deferred jobs transition into.
      • NotQueued - something went seriously wrong with this job, and it has only partially been entered into the scheduler. Contact support for assistance.
    • Creds: credentials of the job
      • user - the username of the account that submitted the job, in this case "myusername"
      • group - the primary unix group of the account that submitted the job, in this case "mygroupname"
      • class - the queue into which the job has been submitted
    • WallTime: elapsed and maximum time this job is allowed to run. In this case, the job has been executing for 22 hours, 56 minutes, 44 seconds. If it doesn't complete before a total of 8 days, 8 hours, it will be terminated.
    • SubmitTime: when the job was submitted to the queue
    • Time Queued: how long the job waited in the queue for resources to become available
    • StartTime: when the job began execution
    • Total Requested Tasks: the total number of CPUs the job has requested
    • Allocated Nodes: a list of compute nodes this job is utilizing. In this case, one processor on node iw-h41-17a.

How do I tell what's running on this node?

  • This is a two step process.
    • The first part is a command to show all of the jobs running on a given node, and the other is a command to show the details of a given job.
    • For the second part, see our FAQ entry on checking the status of a job.
  • For instance, to find out about the node iw-h41-17a, use the node_status command (output formatted for better web display):
[user@force ~]/usr/local/sbin/node_status iw-h41-17a
     state = job-exclusive
     np = 24
     ntype = cluster
     jobs = 0/137499.repace.pace.gatech.edu, 1/137500.repace.pace.gatech.edu,
            2/138400.repace.pace.gatech.edu, 3/138400.repace.pace.gatech.edu,
            4/138400.repace.pace.gatech.edu, 5/138400.repace.pace.gatech.edu,
            6/138400.repace.pace.gatech.edu, 7/138400.repace.pace.gatech.edu,
            8/138400.repace.pace.gatech.edu, 9/138400.repace.pace.gatech.edu,
            10/139359-69.repace.pace.gatech.edu, 11/139359-70.repace.pace.gatech.edu,
            12/139359-71.repace.pace.gatech.edu, 13/139359-72.repace.pace.gatech.edu,
            14/139359-73.repace.pace.gatech.edu, 15/139359-74.repace.pace.gatech.edu,
            16/139359-75.repace.pace.gatech.edu, 17/139359-76.repace.pace.gatech.edu,
            18/139359-77.repace.pace.gatech.edu, 19/139359-78.repace.pace.gatech.edu,
            20/139359-79.repace.pace.gatech.edu, 21/139359-80.repace.pace.gatech.edu,
            22/139359-81.repace.pace.gatech.edu, 23/139359-82.repace.pace.gatech.edu
     status = opsys=linux,uname=Linux iw-h41-17a.pace.gatech.edu 2.6.18-194.el5 #1 
            SMP Tue Mar 16 21:52:39 EDT 2010 x86_64,sessions=943 997 6286 7694
            16223 16301 16554 16804 17044 17270 17526 17766 18019 18246 18496 18737
            18982 19343 25479 29192,nsessions=20,nusers=6,idletime=1178489,
            jobs=137499.repace.pace.gatech.edu 137500.repace.pace.gatech.edu
            138400.repace.pace.gatech.edu 139359-69.repace.pace.gatech.edu
            139359-70.repace.pace.gatech.edu 139359-71.repace.pace.gatech.edu
            139359-72.repace.pace.gatech.edu 139359-73.repace.pace.gatech.edu
            139359-74.repace.pace.gatech.edu 139359-75.repace.pace.gatech.edu
            139359-76.repace.pace.gatech.edu 139359-77.repace.pace.gatech.edu
            139359-78.repace.pace.gatech.edu 139359-79.repace.pace.gatech.edu
            139359-80.repace.pace.gatech.edu 139359-81.repace.pace.gatech.edu
  • Here's what this output means
    • state - valid states are down, offline, free and job-exclusive
      • job-exclusive - the node is completely filled with jobs.
      • free - slightly misleading. Some jobs may be executing, but not all processors are in use.
      • offline - administratively down, usually means that PACE staff are making repairs.
      • down - the scheduling services are not running on this node. usually means the node is powered off.
    • np - number of processors. In the example above, iw-h41-17a has 24 processors.
    • ntype - it will always be "cluster" in our environment. Move on, nothing to see here.
    • jobs - a list of job id's, and which logical processor to which they are assigned.
    • status - lots of status information, including job id's, all on a single line to facilitate easy parsing with unix commands such as sed and awk.

What queues are available for job submission?

  • To find some basic queue information about a system, use the following command:
qstat -Q
  • This command will show a current utilization summary of every queue defined.

How do I delete jobs after they have been submitted?

  • First, identify the JobID of the job you want to cancel. In the example below, the '-w' argument to showq is used to only list jobs in the force queue. You can run showq without '-w' to show all jobs.
[user@force ~]showq -w class=force
active jobs------------------------
137077              user       Running    16  1:02:35:30  Sun Nov 21 23:32:35
1 active job            16 of 5216 processors in use by local jobs (0.31%)
                        137 of 255 nodes active      (53.73%)
eligible jobs----------------------
0 eligible jobs   
blocked jobs-----------------------
0 blocked jobs   
Total job:  1
  • After identifying which jobs you want to cancel, use qdel <jobid> to delete the job.

What is "wallclock limit" and why are some jobs killed because of it?

  • Wallclock is the "maximum amount of real time during which the job can be in the running state." See here for more information.
  • Your wall clock limit is set in your pbs script with the walltime option:
#PBS -l walltime=HH:MM:SS
  • If your walltime request is too small, your job will be killed before it successfully completes; if it is too big, the scheduling algorithm may make it wait in the 'Deferred' state for a longer period of time than necessary (or eventually put it in the 'BatchHold' state).
  • So, in order to stop your jobs from being killed, do your best to accurately estimate how long your job will take to run, and then use that value in your walltime request.

What does it mean when my job is "Deferred"?

  • Please see this link for a list of job hold types and their explanations.

Can my disk quote be increased?

  • Disk quota on home directories is fixed at 5GB per user. Space in project directories (~/data) is funded by faculty members who also set quota policy for thier users. Project directory space on the FoRCE cluster is subject to space approved in your access proposal. We will do our best to accomidate requests for additional space for FoRCE users as resources allow. Check the Storage Guide in the Storage and File Transfer section for more specific information.

Why do my jobs have to wait so long in the queue?

  • Here are some possible scenarios:
    • No idle resources that match your requirements:
      • The cluster you are submitting to could be full. You might want to check the status of queues using pace-check-queue <queue_name> and decide which queue to submit.
    • Unrealistically long walltime:
      • It is also a good practice to set the walltime only as long as needed, not more (if you have means to predict that).
      • The PBS scheduler shows its best effort to keep the utilization rate high, and one strategy is allocating large jobs first, then filling-in the remaining spots with smaller jobs.
  • You may have a job that gets held up in the queue even though it seems as though there are many open processors when you use showq.
  • The cluster statistics you get from the 'showq' command may include some nodes that you do not have access to.
  • For example, here is a current listing of nodes (and processors) for the 'joe' queue:
pace-check-queue joe

=== joe Queue Summary: ====
    Last Update                            : 03/15/2018 13:15:05
    Number of Nodes (Accepting Jobs/Total) : 33/72 (45.83%)
    Number of Cores (Used/Total)           : 1601/1792 (89.34%)
    Amount of Memory (Used/Total) (MB)     : 1389523/7825469 (17.76%)
  Hostname       tasks/np Cpu%  loadav%  used/totmem(MB)   Mem%   Accepting Jobs?
iw-p31-3-l        20/20  100.0   101.2     12463/66568     18.7    No  (all cores in use)
iw-p31-3-r        18/20   90.0    86.4     26774/66568     40.2    Yes (free)
iw-p31-4-l        20/20  100.0   100.2      7079/66568     10.6    No  (all cores in use)
iw-p31-4-r        20/20  100.0   102.2     28618/66568     43.0    No  (all cores in use)
iw-p31-5-l        16/20   80.0    94.9     20729/66568     31.1    Yes (free)
iw-p31-5-r        20/20  100.0    90.3      8147/66568     12.2    No  (all cores in use)
iw-p31-6-l        20/20  100.0   103.2     33393/66568     50.2    No  (all cores in use)
  • The next time your jobs get held up in the queue, you could try cancelling them with qdel <jobid> and resubmitting them with slightly altered resource requirements in your PBS submit script.
  • For example, instead of
    • #PBS -l nodes=5:ppn=24 (asking for 5 nodes with at least 24 free processors each, which is not available in the joe queue as shown above), you might try
    • #PBS -l nodes=6:ppn=20 (asking for 6 nodes with 20 processors each).
  • Depending on the cluster's current usage, a change like the one above could result in far less waiting time.
  • If no one else is currently using the nodes in the 'joe' queue, you are encouraged to use all the nodes and processors available, but when others are also making resource requests, generally, the more you ask for, the longer you have to wait.

How do I check my quota of filesystem space?

  • Use pace-quota to check how much of each storage quota you have used and how much is available in total for you to use.