Updated 2021-07-22

Storage and File Transfer on Hive with Globus

Access to PACE Data

Screenshot

Warning

For accessing your PACE data on Hive, please use Globus as documented in sections below.

  • When you log into Hive, you will not have access to your PACE data that you access from, e.g., Shared clusters (login-s) or Dedicated clusters (login-d). For accessing your PACE data on Hive, please use Globus as documented in sections below.
  • The above image illustrates how globus can be used to transfer your data from shared/dedicated clusters (login-s or -d) to Hive (login-hive)

Storage

Overview

Hive Storage

Important

Users have separate home, data, and scratch directories on Hive compared to other PACE systems.

  • Home

    • This is your home directory, the directory you are in when you log into, and your main directory. You can create whatever files and directories you want in ~
    • Is your user directory, name will be something like /storage/home/hhive1/someuser3 but you can access it as ~
    • home space is backed up (includes any files or directories you make in the home directory)
    • is limited by a 5gb quota (no file number limit)
  • Data

    • ~/data is the place for any large files that need to be stored long term data sets, etc amount of storage depends on what cluster you're on and type of user. Can find amount of storage by running pace-quota (which will be covered later) and looking for the 'hard limit' line under "data" (should be upwards of 100g)

    Note

    • ~/data does not have a user block quota, however, there is a 2 million file inode quota per user.
    • Each research group has a quota shared across all research members. On the Hive clusers this is 50 TB per PI, unless additional funding is specifically provided by the NSF grant funding Hive.
  • Scratch

    • ~/scratch is for short term data
    • Great for a working environment, such as moving files during a job, storing data to be used in a job that doesn't need to be on the cluster long term, or as place to store generated files from a job.
    • Common workflow looks like this:
      • Using a file transfer service like Globus, copy scripts and dataset into ~/scratch folder
      • When job is executed, the data remains in ~/scratch
      • Output and resulting generated files will show up in ~/scratch. Then, move the important results data to your data directory (see above), or transfer them off the cluster if needed. Then empty ~/scratch and remove unneeded temporary files.
    • Each week, files older than 60 days are automatically deleted from ~/scratch

    Note

    • ~/scratch File limit is 1 Million Files
    • ~/scratch Storage limit is 7 TB

File Transfer

  • This guide will focus on using Globus, a secure research data management tool that PACE provides free access to.

Step 1: Making an account

  • Georgia Tech already provides accounts, so all you have to do is sign in with your Gatech information
  • Go to https://www.globus.org/ and click Log In
  • Search for and select "Georgia Institute of Technology", you will then be taken to the Gatech login page
  • After logging in with your GT credentials, you will be redirected to the main file transfer screen

Step 2: Installing the Globus Personal Client

  • The Globus Connect Personal Client is how Globus will have access to your files
  • After installing the personal client, you will have to set up your computer as an endpoint to allow files to be transfered to and from it
  • Log in to Globus then follow these well documented guides on how to install the Globus Personal Client and set up an endpoint:
    • Windows
    • Mac
    • Linux. Linux is a bit tricky, after using tar to extract the files, if ./globusconnect & gives errors about tcllib, skip to the part of the guide titled "How to Install Globus Connect Personal for Linux Using the Command Line" and follow that.
    • On Linux, try ./globusconnectpersonal -start & if ./globusconnect & doesnt work

Warning

Linux: Globus also has a standalone command line tool, globus-cli which needs to be installed on Linux if you are installing the personal client through the command line

  • If you want to create an endpoint and do not have globus cli installed or cannot login, you can create the endpoint in the globus.org portal. Navigate to the endpoints link:

Screenshot

Choose globus connect personal Screenshot

Select generate setup key for a selected name (in our case rich116-f39-15-004) Screenshot

Copy the setup key and use in the globusconnectpersonal -setup <the key you copied> Screenshot

Step 3: Set Endpoints for transfer

  • The following sections describe the procedure for transferring files between the Hive Cluster and your personal computer.

Tip

To transfer files between clusters set the cluster you wish to transfer to as the second endpoint instead of your personal computer. Once done, the rest of the procedure is the same.

Note

If you are asked for authentication when connecting to an endpoint, please enter your regular GT credentials.

Set Endpoints for Hive

  • Used to transfer files between your local file system and the Hive Cluster file system

Important

Make sure the Globus Personal Client is running on your computer before setting up endpoints

  • Log in, and you will be taken to the main file manager screen

Screenshot

  • To set up your computer as one endpoint:
    • Click on the Collection Bar at the top
    • Click on "Your Collections"
    • If you set up your endpoint correctly following the documentation linked above, the endpoint you created should show up
    • Select it and you should be taken back to the file manager page, with your computer's files available to transfer showing up on the left

Screenshot

  • To set up the cluster as the other endpoint:
    • Click "Transfer or Sync to" which should open up a blank endpoint screen right next to your computer's files

Screenshot

  • Select the top bar labeled "collection" (next to the name of your computer endpoint)
  • Search for "PACE Hive", and select it when it shows up.

Screenshot

  • You will be prompted for a username and password, enter your GT username and password

Screenshot

  • Your cluster account and all your files on the cluster should now show up on the right
  • You can now transfer files between your account on the cluster and your personal computer

Tip

After the first time setting the endpoints, they will be saved under recent, so you can easily select them. The Globus Personal Client must be running before setting any endpoints

Transfer files and folders

  • Select the folder / file you want to transfer
  • Select Transfer or Sync to... on the menu in the middle
  • Select the folder you want to transfer to
  • Select one of the Start button at the bottom of the screen, with the arrow direction corresponding to the direction of transfer (i.e personal machine to cluster or cluster to personal machine)
  • Status of the transfer can be viewed in Activity located in the menu on the left (click on top left corner if it isnt shown")
  • Globus should send you an email when the transfer is complete
  • Example: Here I have selected the 3D Objects folder on my laptop and my test folder on the cluster
    • Transfer from my laptop to the cluster: To transfer the 3D Objects folder to the cluster, I would select the start arrow in the bottom left
    • Transfer from the cluster to my laptop: To transfer my test folder from the cluster to my laptop, I would select the start arrow in the bottom right

Screenshot

Using Globus Recap

Warning

Always make sure the Globus Personal Client is running before you go to transfer files

  1. Start globus personal client (using its program gui if on Windows or Mac)
    • On linux, navigate to where you installed the globuspersonalclient files, cd globuspersonalclient-x.y.z (x.y.x is version number) and then run ./globusconnectpersonal -start &
  2. Log into Globus, you should then be directed to the main file manager page. Other pages including activity page and endpoint manager are available in the drop down menu on the left (click top left if the menu is not visible)
  3. To set your personal machine as an endpoint, click on the "collection" bar on the top left (on file manager page), select your personal computer name (personal endpoint name you made). It should be under "recents" or in "your collections". To set the cluster as the other endpoint, click on the collection bar on the top right, search for PACE -> select PACE Hive. Log in with your GT username and password.
  4. You can now transfer files and folders as you wish. Monitor transfer status with the Activity tab
  5. When you are done transferring files, logout of globus online, and logout of the personal client on you computer. For linux to logout run globus logout

Using globus-url-copy command line tool

You will want to do the first part of the globus-toolkit install as outlined at . Install Globus Connect Server namely:

sudo curl -LOs https://downloads.globus.org/toolkit/globus-connect-server/globus-connect-server-repo-latest.noarch.rpm
sudo rpm --import https://downloads.globus.org/toolkit/gt6/stable/repo/rpm/RPM-GPG-KEY-Globus
sudo yum install globus-connect-server-repo-latest.noarch.rpm
#Install EPEL repo
sudo curl -LOs https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install epel-release-latest-7.noarch.rpm

and then the specific globus-url-copy and myproxy software

yum install tcllib
yum install globus-gass-copy-progs
yum install myproxy
yum install globus-proxy-utils

You may also want to setup globusconnectpersonal, for managing a globus connect personal endpoint that you can access via the globus.org portal or via the globus-cli. It is not necessary for using globus-url-copy to upload files and directories to a remote endpoint.

wget https://downloads.globus.org/globus-connect-personal/linux/stable/globusconnectpersonal-latest.tgz
tar xzf globusconnectpersonal-latest.tgz
cd globusconnectpersonal-x.y.z
# setup globus connect personal in ~/.globusonline/lta, or pass -dir ~/.somethingelse to setup in ~/.somethingelse
./globusconnect -setup <whatever key you got from globus-cli or create endpoint on globus portal> # -dir ~/.somethingelse
#add -restrict-paths if you want to add additional shared paths, or edit ~/.globusonline/lta/config-paths
./globusconnect -start -restrict-paths RW~/,RW/writable/directory,R/read/only/directory,N/none/accessible/directory &
 # -dir ~/.somethingelse

Log in to a local globus endpoint such as globus-research.pace.gatech.edu or iw-dm-4.pace.gatech.edu. In this example I use the PACE Research internal high speed endpoint connected to the VAPOR networks for servers connected to scientific instruments that are connected to the VAPOR network. The argument -t 8760 sets the proxy life time for 8760 hours or a year. The -b argument bootstraps the certificate directory from the endpoint (also implies -T trustroots from that server) and -s argument is the hostname of the server.

[amcneill3@rich116-f39-15 globusconnectpersonal-2.3.6]$ myproxy-logon -t 8760 -b -T -s globus-research
Bootstrapping MyProxy server root of trust.
New trusted MyProxy server: /C=US/O=Globus Consortium/OU=Globus Connect Service/CN=2bfacbe2-2af2-11e9-9fa4-0a06afd4a22e
New trusted CA (a059cd44.0): /C=US/O=Globus Consortium/CN=Globus Connect CA 3
Server authorization failed.  Server identity does not match expected identity.
If the server identity is acceptable, set
MYPROXY_SERVER_DN="/C=US/O=Globus Consortium/OU=Globus Connect Service/CN=2bfacbe2-2af2-11e9-9fa4-0a06afd4a22e"
and try again.
`[amcneill3@rich116-f39-15 globusconnectpersonal-2.3.6]$ export MYPROXY_SERVER_DN="/C=US/O=Globus Consortium/OU=Globus Connect Service/CN=2bfacbe2-2af2-11e9-9fa4-0a06afd4a22e"
[amcneill3@rich116-f39-15 globusconnectpersonal-2.3.6]$ myproxy-logon -t 8760 -b -T -s globus-research
Enter MyProxy pass phrase:
A credential has been received for user amcneill3 in /tmp/x509up_u296017.
Trust roots have been installed in /nv/hp1/amcneill3/.globus/certificates/.

It is a good idead to same the environmental variable MYPROXY_SERVER_DN for future use, by placing it in your .bashrc or .bash_profile file as:

export MYPROXY_SERVER_DN="/C=US/O=Globus Consortium/OU=Globus Connect Service/CN=2bfacbe2-2af2-11e9-9fa4-0a06afd4a22e

To list the contents of a remote directory using globus-url-copy, (-ss sets source DN, we want to set this to the DN for globus-research, otherwise, it will fail and complain that the hostname globus-research does not match the DN) do:

[amcneill3@rich116-f39-15 globus]$ globus-url-copy -ss "$MYPROXY_SERVER_DN" -list gsiftp://globus-research/~/scratch/5gbzd/

To upload a single file using globus-url-copy (here we want to set destination DN with -ds), do:

[amcneill3@rich116-f39-15 globus]$ globus-url-copy -ds "$MYPROXY_SERVER_DN" -fast -rst -rst-retries 0 -tcp-bs 1G -p 16 1G.dat gsiftp://globus-research/~/scratch/1Gb.dat
  • -fast = optimize transfer
  • -rst = reset if there is an interruption
  • -rst-restries 0 = retry an infinite amount of times if there are resets
  • -tcp-bs = set buffer size for transfer cache
  • -p 16 = set parrallesim (can be 1 to 16)
  • 1G.dat = source file
  • gsiftp://globus-research/~/scratch/1Gb.dat = destination file on globus-server "PACE Research" endpoint

In order to upload a directory using globus-url-copy, use the -r (recursive) and -cd (create destination) arguments

[amcneill3@rich116-f39-15 globus]$ globus-url-copy -ds "$MYPROXY_SERVER_DN" -fast -rst -rst-retries 0 -tcp-bs 1G -p 16 -r -cd 5gbz/ gsiftp://globus-research/~/scratch/5gbzd/

If you want to verify-checksum of the transfer then add the -verify-checksum argument

[amcneill3@rich116-f39-15 globus]$ globus-url-copy -ds "$MYPROXY_SERVER_DN" -verify-checksum -fast -rst -rst-retries 0 -tcp-bs 1G -p 16 -r -cd 5gbz/ gsiftp://globus-research/~/scratch/5gbzd/

(to be completed)


This material is based upon work supported by the National Science Foundation under grant number 1828187. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.