Overview of Clusters¶
- PACE implements a federated model that allows consolidation of common components such as physical hosting, systems management infrastructure, high-speed scratch storage, home directory space, and commodity networking.
- This federation shares common infrastructure components across several clusters including:
- The FoRCE Reseach Computing Environment - a community resource that includes mixture of compute nodes, some with attached GPUs, some with large memory capacity and some with local storage.
- Multiple clusters who share job assignments with the FoRCE. When one of these clusters have available processors, jobs from the others may "gust" onto the otherwise idle processors. This capability allows faculty access to resources in excess of their contribution.
- Multiple clusters whose compute nodes are dedicated for exclusive use by PACE participants across many Colleges/Schools.
- The Institute also leases data center space off campus to host MYRIAD, a standalone cluster used for systems biology research.
- Some non-federated legacy clusters are hosted in the PACE HPC environment. The intent over time is to leverage the federated model to the greatest extent possible wherever it appropriately meets research needs.
- The PACE federation implements a multi-tiered data storage strategy. Each user is provided with a small home directory intended to be used for basic login tasks, source code, etc. The next level, project directory storage, is intended for storage of large data sets.
- All clusters participating in the federation also have access to a high-performance, high-capacity filesystem. The scratch storage is intended to be used by currently executing jobs. A common workflow would be to stage datasets from the project directory storage to the scratch storage at the beginning of job execution, utilize the performance of the scratch storage during the job, and then copy results back to the project directory.
- PACE provides backups of home and project directory storage. Of these tiers, home, project, scratch and backup, faculty bear the costs only of the project directory storage.
- The PACE federation has various common HPC software available. Our systems leverage the campus site license for RedHat Enterprise Linux and other scientific and engineering applications such as Matlab, Mathematica, Fluent, etc. We also have various applications traditional to HPC environments such as NAMD, MPICH, and PETSc as well as compilers from GNU, Intel and Portland Group. All installed applications are available on all clusters.
- We currently are not funded to provide specialized software, but please contact us if you are interested in a software package that is not currently available. We are interested in bringing people together and leverage group purchases.
- All software and systems must be used in compliance with the Georgia Tech Cyber Security Policies. Please see our FAQ entries regarding cluster software restrictions for further details on software and currently installed software.
- Individual clusters within the PACE federation are connected via non-blocking gigabit ethernet. Connections between clusters and to the server and storage infrastructure are connected via multiple 10-gigabit links. We have a dual 10-gigabit connection to the campus backbone network, and gigabit and 10-gigabit connections from campus to networks such as NLR, Internet2, SOX and other research networks in the southeast, nationally and internationally. Please see the OIT page on the campus network for further details.
- The GT facilities are designed meet the demanding requirements of modern HPC systems maximizing their availability with power, cooling and storage redundancy measures.
Power and Cooling¶
- The Rich Computer Center has a total of 1.2mW power capacity. The BCDC Center has 2N redundant 270kW capacity. Both centers provide a high (> 0.97) power factor. These facilities have not suffered any outages for the past 6 years.
- The Rich Computer Center is backed up with five Uninterruptable Power Supplies (UPS), and the BCDC is supported by two UPS systems with 2N redundancy. One of the rooms in the Rich Center has a low-density generator with 285kW capacity that serves all of the critical storage and server units. The compute nodes are not on generators, but are connected to the UPS systems that allows for a graceful shutdown. The BCDC has a generator to provide power for all of the systems in the facility. The storage systems are physically distributed between these two facilities to prevent data loss in a catastrophic event, each holding the backups of the data hosted by the other.
- The cooling is achieved by a 3N redundant 450 ton chilled-water system in the Rich Center. The BCDC is also equipped with a N+1 redundant 200 ton chilled-water system. Both facilities feature raised floors that allow full coverage of cooling systems to all racks, and equipped with chilled water leak detection systems.
- The GT police department (GTPD), a division of the Georgia State Patrol, provides the general security on the campus. GTPD performs campus patrols, mobile camera monitoring and includes a SWAT response team for emergency preparedness and crime prevention.
- GT datacenters have badge level access and camera coverage including the building vicinity. Motion sensor alarms are configured to alert GTPD. All systems are monitored 24/7 by an operations team, located in the Rich Computer Center, which responds to emergencies and potential hazards, such as rising temperatures or chilled water leaks.
- Both datacenters link through GTPD to the Atlanta Fire Department, which is located approximately 2.5 miles from campus.
Network and Connectivity¶
- GT has a unique advantage for connectivity as the founding member of Internet2 (I2) and National Lambda Rail (NLR). GT’s Office of Information Technology (OIT) manages and operates the Southern Crossroads (SoX), which is the regional GigaPOP for I2, and Southern Light Rail (SLR), the regional aggregation. This strategic position of GT will allow the proposed systems be connected at multiple Gigabit per second (Gb/s) speeds to leading universities and national labs, with a 10GbE link to Oak Ridge National Lab (ORNL) in particular.
- The facilities provide 1Gb/s and 10 Gb/s connections to all servers and HPC systems. The Rich Building is equipped with a QDR QLogic Infiniband Switch with uplinks that connect two computer rooms. Fex switches are distributed throughout the data center, generally two in a rack to provide a (N + 1) redundancy.
- Academic and administrative leadership are seriously considering various options to meet the anticipated future demand for HPC resources This underscores Georgia Tech's long-term commitment to support research that requires highly scalable computation and simulation.
Vendor documentation for the Moab Workload manager and Torque resource manger in use on PACE clusters