Updated 2022-02-18
Performance Profiling with ARM Map¶
Overview¶
ARM Map is a performance profiling tool that allows profiling of serial and parallel C, C++, and Fortran code. In particular, it has excellent support for MPI and OpenMP profiling. It is part of the ARM Forge suite of parallel profilers and debuggers. The full doccumentation for Map can be found here.
Important
PACE systems have version 20.2 of Forge, which is not the most recent release. When you look up documentation or download the local client (see below), be sure you are choose v20.2.
Phoenix Tutorial¶
1. One-Time Remote Client Setup¶
On your computer, download and install the ARM Forge remote client for your OS here. Be sure to download v20.2 of the client.
After it's installed, open the "ARM Forge Client" app. Click on "arm MAP" in the navigation on the left. Then, from the "Remote Launch" dropdown menu, select "Configure ..."
After clicking on "Add" in the "Configure Remote Connections" window, you will see the "Remote Launch Settings" window. Enter the information shown in the screenshot below:
* Host Name: Enter username@login-phoenix.pace.gatech.edu
, substituting your PACE usernae
* Remote Installation Directory: Enter /usr/local/pace-apps/manual/packages/forge/20.2
* Everything else: Uncheck or leave blank
You can click "Test Remote Launch" to make sure all your settings are correct. Then click "OK" to finish configuring the connection
2. Generating Profile for an MPI Program¶
On Phoenix, you can use MAP to profile any C, C++, or Fortran programs you have compiled yourself, as well as many pre-built programs available as PACE modules.
Optional: Compile Example Program¶
For this tutorial, we will be using a simple MPI program that runs an n-body simulation. The source file, nbodypipe.c
is adapted from Using MPI, 3rd Ed. and can be found here. To download and compile it on Phoenix, run:
$ mpicc -g nbodypipe.c -o nbodypipe
The -g
flag inserts debugging info, which can be helpful when viewing the profiles. This program can be run with mpirun
and takes one argument for the number of particles (for example, use mpirun -np 4 ./nbodypipe 10000
to simulate 10000 particles on 4 MPI ranks)
Running Map on Command Line and in Jobs¶
On Phoenix, after loading the forge
module, you can run Map and the other ARM
tools. An MPI program can be profiled with Map using commands such as:
$ module load forge
$ map --profile --np=4 PROGRAM [ARGUMENTS]..
This will produce a *.map
file that can be loaded in the remote client for visualization. By default, the name of the map file will have the form:
<program_name>_<num_procs>_<num_threads>_<timestamp>.map
.
This same commands can (and should!) be run from a typical job script. Using the
n-body program we compiled above, we can submit the following script, which
will produce a *.map
file named nbodypipe_4p_1n_<timestamp>.map
.
#PBS -N nbodypipe
#PBS -A pace-admins
#PBS -l nodes=1:ppn=4
#PBS -l pmem=8gb
#PBS -l walltime=00:15:00
#PBS -q inferno
#PBS -j oe
module load forge
cd $PBS_O_WORKDIR
map --profile --np=4 ./nbodypipe 50000
If you are profiling a program from a PACE module,
load the other necessary modules along with forge
.
3. Visualizing the Profiles¶
On your computer, open the "ARM Forge Client". Select "arm Map" from the navigation
on the left. From the "Remote Launch" dropdown menu, select the "login-phoenix" connection that you configured earlier. Finally, click on "LOAD PROFILE DATA FILE".
This will let you load files directly from Phoenix without copying them
to your own computer. To proceed, open the *.map
file that you generated above.
This will open a window representing a wealth of runtime measurements from your program. A good place to start is the "Main Thread Stacks" pane,
which represents total execution times organized
by function calls. If you compiled your code with debugging info (using -g
), then clicking on an item in "Main Thread Stacks" highlights the line in the code viewer.
In the results from our n-body simulation, the line fy -= mj * ry / r
takes 38.3% of our program's CPU time. Overall, in our simulation, computing the forces on
the particles (lines 163 - 165) takes the vast
majority of the CPU time.
To visualize MPI performance, you can start by selecting "Metrics --> Preset MPI" from the menu at the top of the window. This will change the top panels to show bandwidth usage, function calls, and other data over time (often called "sparklines"). The bottom two panels have the same purpose as before.
In our results, the sparklines show relatively few MPI operations. Indeed, from the bottom panels, we can
see that MPI_Allreduce
only takes 0.4% of our program's CPU time.
There are many other metrics, such as memory and I/O utilization, that can be selected from the "Metrics" menu at the top of the window.