Updated 2023-01-30
Launcher¶
Introduction¶
Launcher is a utility for performing simple, data parallel, high throughput computing (HTC) workflow on clusters, massively parallel processor (MPP) systems, workgroups of computers, and personal machines. It is a product of the Texas Advanced Computing Center (TACC) and is primarily written by Lucas A. Wilson.
Launcher is written in Python/3.9.12. The files are available via GitHub, and is MIT-licensed.
License¶
Launcher is MIT licensed. The MIT license gives permission for a person to deal in the software without restriction as long at the MIT license is included in all copies or substantial derivatives of the software on an "as is" basis. Licensed works, modifications, and larger works may be distributed under different terms and without source code. However, this is not legal advice.
Why use Launcher¶
High-throughput computing (HTC) is a computing model where many independent jobs need to be executed in a rapid manner. HTC is sometimes also called bag-of-tasks (BOT) parallel, pleasantly parallel, embarrassingly parallel, and more terms.
The key scenario in HTC is that there are many tasks that need to be executed on a separate input set, which in turn generate separate output. HTC tends to find use in applications such as
- Sequence alignment scoring of DNA/RNA sequences
- Protein/ligand docking
- Drug development
- Statistical analysis on unknown variables
- Parameter sweeps
- Text analysis
- Multi-start simulations
- Monte Carlo methods
Impediments to HTC on HPC systems¶
There are a number of potential problems faced by users
- Runtime limits
- Job-in-queue limits
- Dynamic job submission restrictions
Launcher allows each of these problems to be addressed to aid researchers in HTC systems.
General information regarding the use of Launcher¶
Components of Launcher¶
paramrun
Top-level script which interfaces with the resource manager, ensures the environment, and starts multi-node executiontskserver
dynamic scheduling serviceinit_launcher
script responsible for process management for multi-core and many-core processorslauncher
script responsible for executing appropriate jobs from the job file, timing jobs, and providingstdout/stderr
direction.
Control file¶
The control (also referred to as a job) file is a listing of the programs that the user wants to be executed by Launcher. Any of the environment variables listing in the Environment Variables section can be used. Here is an example of a control file:
echo 'Hello world' > hello.o$LAUNCHER_JID
./a.out $LAUNCHER_TSK_ID
./a.out $LAUNCHER_JID
echo $LAUNCHER_PPN
echo $LAUNCHER_NHOSTS
Environment Variables¶
Variable | Meaning/Usage |
---|---|
LAUNCHER_NHOSTS |
Number of host on which the instance is executing |
LAUNCHER_PPN |
Number of processes per node/host |
LAUNCHER_NPROCS |
Total number of process (NHOSTS*PPN ) |
LAUNCHER_TSK_ID |
Launcher task number (0 ..NPROCS-1 ) |
LAUNCHER_NJOBS |
Number of jobs for this bundle (NJOBS = wc -l | control_file ) |
LAUNCHER_JID |
The launcher job that is being executed. Corresponds to a particular line in the control file (1..NJOBS) |
LAUNCHER_SCHED |
Controls the type of scheduling Launcher uses (default is dynamic) |
Scheduling models¶
Launcher can schedule jobs using three methods, dynamic and two types of static scheduling: interleaved and block. On PACE, you should not currently use dynamic.
Dynamic scheduling¶
In the default dynamic scheduling, Launcher uses a client-server model to actively assign jobs as jobs are reported complete.
Dynamic scheduling is most efficient when the time for execution of the component jobs is variable, so that idle time processors is decreased.
In order to explicitly request dynamic scheduling:
export LAUNCHER_SCHED='dynamic'
within your SBATCH submission script or interactive session.
Static scheduling¶
Static scheduling is recommended for cases where the executation time is known and it is estimated to be very close to equal. Static scheduling involves the calculation of the schedule at the beginning of the Launcher job, so it does not require a server-client relationship.
There are two algorithms for job distribution.
Interleaved scheduling¶
In interleaved scheduling, a simple method each job (in the job file) to a particular task (i.e. reserved process on a node). As an example, the processor designated as the first one (task 1) in a system where 1 node and 4 processors have been request will get job (line) 1,5, ..., while task 2 gets 2,6,..., and so on.
In order to explicitly request interleaved static scheduling:
export LAUNCHER_SCHED='interleaved'
within your SBATCH submission script or interactive session.
Block scheduling¶
In blocked scheduling, an equal (or as close to equal) set of jobs is sent to each task. For example, task 1 may get jobs 1...10, task 2 will get jobs 11...21, and so on.
In order to explicitly request block static scheduling:
export LAUNCHER_SCHED='block'
within your SBATCH submission script or interactive session.
Usage on PACE resources¶
Setting up Launcher on PACE¶
On PACE, simply load the module for Launcher: module load launcher
Examples on different types of clusters¶
Red Hat Enterprise Linux 6¶
SBATCH scripts¶
On Red Hat Enterprise Linux 6, the default version of Python is not sufficient for Launcher to function. Please include a module load python/3.9.12
within the SBATCH submission script or interactive environment.
Due to a policy which places which may place multiples number of processes on
a single node (i.e. a -N 2 --ntasks-per-node=8
request may effectively become -N 1 --ntasks-per-node=16
request), more than half of the available processors on a node should be requested (i.e. -N 2 --ntasks-per-node=10
if there are 16 processors on each node). Otherwise, it is possible that launcher will not be able to use all processors since it assumes the same number of processes on each node.
Dynamic schedule does not function as it should for multiple node job currently because of communication limitations between nodes.
Basic script¶
#!/bin/bash
#
# Simple SBATCH script for submitting multiple serial
# jobs (e.g. parametric studies) using a script wrapper
# to launch the jobs.
#
# To use, build the launcher executable and your
# serial application(s) and place them in your WORKDIR
# directory. Then, edit the CONTROL_FILE to specify
# each executable per process.
#
#-------------------------------------------------------
# Setup Parameters
#SBATCH -J PACE_example
#SBATCH -N 1 --ntasks-per-node=4
#SBATCH -o $SBATCH_JOBNAME.o$SBATCH_JOBID
#SBATCH -t 5
#SBATCH -p inferno
#------------------------------------------------------
module load python/3.9.12
module load launcher
cd $SLURM_SUBMIT_DIR # optional: change to directory that the job script is submitted from
export LAUNCHER_SCHED='interleaved'
export LAUNCHER_WORKDIR=$SLURM_SUBMIT_DIR
export LAUNCHER_JOB_FILE=$LAUNCHER_DIR/extras/examples/pacemulti
paramrun
Multiple-node script¶
#!/bin/bash
#
# Simple SBATCH script for submitting multiple serial
# jobs (e.g. parametric studies) using a script wrapper
# to launch the jobs.
#
# To use, build the launcher executable and your
# serial application(s) and place them in your WORKDIR
# directory. Then, edit the CONTROL_FILE to specify
# each executable per process.
#
#-------------------------------------------------------
#SBATCH -J PACE_example_mn
#SBATCH -N 2 --ntasks-per-node=10
#SBATCH -o $SBATCH_JOBNAME.o$SBATCH_JOBID
#SBATCH -t 5
#SBATCH -p inferno
#------------------------------------------------------
module load python/3.9.12
module load launcher
cd $SLURM_SUBMIT_DIR # optional: change to directory that the job script is submitted from
export LAUNCHER_SCHED='block'
export LAUNCHER_WORKDIR=$SLURM_SUBMIT_DIR
# the user can bring their own job script, but here is an example
export LAUNCHER_JOB_FILE=$LAUNCHER_DIR/extras/examples/pacemulti
paramrun
Red Hat Enterprise Linux 7¶
SBATCH scripts¶
Due to a policy which places which may place multiples number of processes on a
single node (i.e. a -N 2 --ntasks-per-node=8
request may effectively become -N 1 --ntasks-per-node=16
request), more than half of the available processors on a node should be requested (i.e. -N 2 --ntasks-per-node=10
if there are 16 processors on
each node). Otherwise, it is possible that launcher will not be able to use all
processors since it assumes the same number of processes on each node.
Dynamic schedule does not function as it should for multiple node jobs currently because of communication limitations between nodes.
Single node dynamic scheduling¶
#!/bin/bash
#
# Simple SBATCH script for submitting multiple serial
# jobs (e.g. parametric studies) using a script wrapper
# to launch the jobs.
#
# To use, build the launcher executable and your
# serial application(s) and place them in your WORKDIR
# directory. Then, edit the CONTROL_FILE to specify
# each executable per process.
#
#-------------------------------------------------------
#SBATCH -J testflight_singlenode
#SBATCH -A [Account]
#SBATCH -N 1 --ntasks-per-node=8
#SBATCH -o $SBATCH_JOBNAME.o$SBATCH_JOBID
#SBATCH -t 5
#SBATCH -p testflight
#------------------------------------------------------
module load launcher
cd $SLURM_SUBMIT_DIR # optional: change to directory that the job script is submitted from
export LAUNCHER_SCHED='dynamic' # this is the default
export LAUNCHER_WORKDIR=$SLURM_SUBMIT_DIR
# the user can bring their own job script, but here is an example
export LAUNCHER_JOB_FILE=$LAUNCHER_DIR/extras/examples/pacemulti
paramrun
Multiple-node static scheduling¶
#!/bin/bash
#
# Simple SBATCH script for submitting multiple serial
# jobs (e.g. parametric studies) using a script wrapper
# to launch the jobs.
#
# To use, build the launcher executable and your
# serial application(s) and place them in your WORKDIR
# directory. Then, edit the CONTROL_FILE to specify
# each executable per process.
#
#-------------------------------------------------------
#SBATCH -J testflight-mn-interleaved
#SBATCH -N 2 --ntasks-per-node=20
#SBATCH -o $SBATCH_JOBNAME.o$SBATCH_JOBID
#SBATCH -t 5
#SBATCH -p testflight
#------------------------------------------------------
module load launcher
cd $SLURM_SUBMIT_DIR # optional: change to directory that the job script is submitted from
export LAUNCHER_SCHED='interleaved'
export LAUNCHER_WORKDIR=$SLURM_SUBMIT_DIR
# the user can bring their own job script, but here is an example
export LAUNCHER_JOB_FILE=$LAUNCHER_DIR/extras/examples/pacemulti
echo 'Unique nodes: '$(sort $SBATCH_NODEFILE | uniq)
paramrun