Updated 2023-04-28

Technical Questions

How can I learn more about PACE clusters and users/groups/queues?

  • PACE offers the tool Ganglia for inspecting performance/utilization/status of PACE managed clusters.

What are those "data" and "scratch" folders in my home directory?

  • Please refer to the Storage Guide for the most relevant information.
  • On the Phoenix Cluster there are project and scratch directories. The project directory is for long term storage, while scratch is for short-term high-performance storage. The Phoenix Storage Guide has more information
  • On the Hive Cluster, there are two symbolic links - data and scratch. The data symbolic link points to your project directory space for long term storage of data sets. The scratch symbolic link points to your space on the high-performance scratch storage
  • As part of your job submission file, you can make additional directories within your scratch space and copy input files from your project directory into the newly created sub-directory on the scratch. During the execution of your job, operate on the copy within the scratch space. When your calculations are complete, copy needed files back to your project directory space and remove the remaining files from the scratch space.
  • Remember, the scratch space is limited and not intended to hold data for the long term. We implement automated removal of "old" files (> 60 days old) from the scratch space each week. In addition, we apply 7TB hard quotas and a file limit of 1 Million Files per user. We do not perform backups on the scratch storage, but do for the project directory and home directory storage.

How can I get information about CPUs on a particular node?

  • Use this shell command from your home directory:
cat /proc/cpuinfo
  • The output should look something like this:
foo@joe98 ~> cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2222
stepping        : 3
cpu MHz         : 3015.524
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips        : 6038.61
TLB size        : 1088 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2222
stepping        : 3
cpu MHz         : 3015.524
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips        : 6030.07
TLB size        : 1088 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

processor       : 2
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2222
stepping        : 3
cpu MHz         : 3015.524
cache size      : 1024 KB
physical id     : 1
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips        : 6030.07
TLB size        : 1088 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

processor       : 3
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2222
stepping        : 3
cpu MHz         : 3015.524
cache size      : 1024 KB
physical id     : 1
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips        : 6030.07
TLB size        : 1088 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

foo@joe98 ~>

How can I find out how much memory a particular node includes?

  • Use this shell command from your home directory:
cat /proc/meminfo
  • The output should look something like this:
foo@joe98 ~> cat /proc/meminfo
MemTotal:     16419200 kB
MemFree:      15035800 kB
Buffers:        239928 kB
Cached:         283524 kB
SwapCached:       2120 kB
Active:         706044 kB
Inactive:       114520 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     16419200 kB
LowFree:      15035800 kB
SwapTotal:     2096440 kB
SwapFree:      1899200 kB
Dirty:             316 kB
Writeback:           0 kB
Mapped:         303940 kB
Slab:           540140 kB
CommitLimit:  10306040 kB
Committed_AS:   504724 kB
PageTables:       2516 kB
VmallocTotal: 536870911 kB
VmallocUsed:      1320 kB
VmallocChunk: 536869559 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB
foo@joe98 ~>

What's eating up all my disk space?

  • Use du -sh * to show the disk usage of each file/dir in the current directory.

Is there a way to allow my interactive sessions to persist as I travel between my home and lab computers?

  • Yes, you can use Screen as outlined on the Screen page of the Software Guides section.
  • Please refer to the following link for more information: GNU Screen

How do I get system email sent elsewhere?

  • If you are intereted in PBS-issued emails only, you can specify your email address in the PBS submit script, followed by the "#PBS -M":
#PBS -M your_email_address
  • If you would like all system emails forwarded, then you create a .forward file in your home directory:
foo@pacemaker ~> cd
foo@pacemaker ~> echo your_email_address > .forward

What do I do if I have trouble transferring files?

  • Please refer to the Storage and File Transfer section and make sure that you have followed all the steps correctly. If you have and the problem persists, please try another transfer method in the section.
  • If you are still unable to transfer files properly, run the following command from your home directory if are having trouble transferring files off of the cluster:
pace-support.sh

What do I do if I'm having problems with my password?

  • PACE clusters use the standard "GT Account" provided to all GT faculty, staff and students. For external collaborators, we can provision guest accounts created by their GT sponsor.
  • Guest accounts and password resets can be resolved by using passport.
  • If you still have problems with your password, please see your local Computer Support Representative (CSR), or visit the Technology Support Center.

How can I get general information about all clusters?

  • OIT offers Ganglia to provide users with any kind of information they might need about the clusters. The webpages can only be browsed on campus, or via VPN.
  • About Ganglia:
    • The main page of Ganglia provides two graphs for CPU and Memory utilization for the past hour, for each cluster.
    • You can get historical information up to a year from the menu titled "Last" (see figure below).
    • To get more detailed (i.e. per-node) information, you can click to the cluster title, or any of the graphs.
    • The workload on each node are color coded, e.g., nodes that use almost 100% of CPUs will appear red.
    • If you submitted a job to a cluster and it is not allocted for a long time, you can always check the cluster utilization from this webpage and see how many nodes are busy/idle.
    • If the cluster looks idle and your jobs is not still being allocated, then please check your PBS parameters for typos.

What does the "Disk quota exceeded" error mean?

  • This means that whichever storage is being used (home, project, scratch) has hit its quota. This issue is very common with the home directory. If you run pace-quota you will get a detailed view of your personal user quota and your research group's quota.
  • Rather than letting your jobs store any output during run time in the home directory (which is limited per user to 5-10GB, depending on the cluster), you can redirect your processing to use the project and scratch area space on Phoenix or the project and scratch directory on Hive, depending on your needs.

How do I convert mem-per-cpu?

  • Based on the amount of memory you are given, your nodes may have 24 cores although there are now some nodes with 32 cores
  • For example, 750 GB/24 cores = 31 GB mem-per-cpu
  • Additional information regarding Phoenix Cluster computing nodes and resources can be found here.
  • Additional information regarding Hive Cluster computing nodes and resources can be found here.

Why won't my job start?

  • There are many reasons why a job may not start so it is important to take time analyzing the problem.
  • One common reason for a job being stuck in the queue is that a user is out of funds. To check available funds, run pace-quota to get a detailed view of accounts and quota.
  • The squeue command allows you to check individual job details and view information on why a job won't start. An 'AssocGrpBillingMinutes' message means you have requested more credits than what you currently have. More information regarding paid accounts and increasing your credits can be found here.
  • Another reason for a job not starting is that the cluster is facing an outage or is busy. In the instances where the cluster is experiencing an outage, PACE will send an official email update to users, as well as blog posts related to these issues. You can check the availability of compute nodes during these periods with the pace-check-queue script and a specification of your partition.
  • Information on troubleshooting a job stuck in the queue, a job terminating/failing is available in the Troubleshooting section.
  • Information on interactive/gui jobs is available in the Open OnDemand section.

What does the "Requested node configuration is not available" error mean?

  • A common logical error in the SBATCH Script is requesting resources that don't exist.
  • A solution is to make sure the resources you request in the SBATCH script are available in the queue you submit to. This includes number of processors, memory, gpu and any other type of resource you can request. Similarly, if you request too many processors than a node physically has, the job will never run.
  • The resulting error will look like salloc: error: Job submit/allocate failed: Requested node configuration is not available.
  • For Phoenix users, partitions are chosen automatically, so you only need to choose hardware that exists on the Phoenix cluster. More details on Phoenix resources can be found here.
  • For Hive users, choosing the wrong partition or hardware that exists in a different partition will result in this error. More details on Hive resources can be found here.

Why do my Interactive Desktop Sessions immediately terminate after they start?

  • When you start a job in OnDemand, there is a Python script running in the background that sets everything up. The script uses Anaconda, so if you have set up certain settings for your personal usage of Anaconda and have used the command conda init your Anaconda environment will run and interfere with the Python script's Anaconda environment.
  • For this reason, we do not suggest running conda init on PACE.
  • There are two possible solutions that you can try:
    • You can clean up the ~/.bashrc file in the home directory which may have lines added by Anaconda that conflict with the launch script. To do this, either delete or comment out everything in the file relating to Anaconda.
    • You may see a directory in your home directory called ~/.lmod.d. You can either move it or delete it. Deleting it should not affect any functions.There is also an option to rename it to ~/.lmod.d.bak which also makes a backup of the directory.
  • Both options for solution work and afterwards, your session should be working like normal.