Performance Profiling with ARM Map¶
ARM Map is a performance profiling tool that allows profiling of serial and parallel C, C++, and Fortran code. In particular, it has excellent support for MPI and OpenMP profiling. It is part of the ARM Forge suite of parallel profilers and debuggers. The full doccumentation for Map can be found here.
PACE systems have version 20.2 of Forge, which is not the most recent release. When you look up documentation or download the local client (see below), be sure you are choose v20.2.
1. One-Time Remote Client Setup¶
On your computer, download and install the ARM Forge remote client for your OS here. Be sure to download v20.2 of the client.
After it's installed, open the "ARM Forge Client" app. Click on "arm MAP" in the navigation on the left. Then, from the "Remote Launch" dropdown menu, select "Configure ..."
After clicking on "Add" in the "Configure Remote Connections" window, you will see the "Remote Launch Settings" window. Enter the information shown in the screenshot below:
* Host Name: Enter
email@example.com, substituting your PACE usernae
* Remote Installation Directory: Enter
* Everything else: Uncheck or leave blank
You can click "Test Remote Launch" to make sure all your settings are correct. Then click "OK" to finish configuring the connection
2. Generating Profile for an MPI Program¶
On Phoenix, you can use MAP to profile any C, C++, or Fortran programs you have compiled yourself, as well as many pre-built programs available as PACE modules.
Optional: Compile Example Program¶
For this tutorial, we will be using a simple MPI program that runs an n-body simulation. The source file,
nbodypipe.c is adapted from Using MPI, 3rd Ed. and can be found here. To download and compile it on Phoenix, run:
$ mpicc -g nbodypipe.c -o nbodypipe
-g flag inserts debugging info, which can be helpful when viewing the profiles. This program can be run with
mpirun and takes one argument for the number of particles (for example, use
mpirun -np 4 ./nbodypipe 10000 to simulate 10000 particles on 4 MPI ranks)
Running Map on Command Line and in Jobs¶
On Phoenix, after loading the
forge module, you can run Map and the other ARM
tools. An MPI program can be profiled with Map using commands such as:
$ module load forge $ map --profile --np=4 PROGRAM [ARGUMENTS]..
This will produce a
*.map file that can be loaded in the remote client for visualization. By default, the name of the map file will have the form:
This same commands can (and should!) be run from a typical job script. Using the
n-body program we compiled above, we can submit the following script, which
will produce a
*.map file named
#PBS -N nbodypipe #PBS -A pace-admins #PBS -l nodes=1:ppn=4 #PBS -l pmem=8gb #PBS -l walltime=00:15:00 #PBS -q inferno #PBS -j oe module load forge cd $PBS_O_WORKDIR map --profile --np=4 ./nbodypipe 50000
If you are profiling a program from a PACE module,
load the other necessary modules along with
3. Visualizing the Profiles¶
On your computer, open the "ARM Forge Client". Select "arm Map" from the navigation
on the left. From the "Remote Launch" dropdown menu, select the "login-phoenix" connection that you configured earlier. Finally, click on "LOAD PROFILE DATA FILE".
This will let you load files directly from Phoenix without copying them
to your own computer. To proceed, open the
*.map file that you generated above.
This will open a window representing a wealth of runtime measurements from your program. A good place to start is the "Main Thread Stacks" pane,
which represents total execution times organized
by function calls. If you compiled your code with debugging info (using
-g), then clicking on an item in "Main Thread Stacks" highlights the line in the code viewer.
In the results from our n-body simulation, the line
fy -= mj * ry / r takes 38.3% of our program's CPU time. Overall, in our simulation, computing the forces on
the particles (lines 163 - 165) takes the vast
majority of the CPU time.
To visualize MPI performance, you can start by selecting "Metrics --> Preset MPI" from the menu at the top of the window. This will change the top panels to show bandwidth usage, function calls, and other data over time (often called "sparklines"). The bottom two panels have the same purpose as before.
In our results, the sparklines show relatively few MPI operations. Indeed, from the bottom panels, we can
MPI_Allreduce only takes 0.4% of our program's CPU time.
There are many other metrics, such as memory and I/O utilization, that can be selected from the "Metrics" menu at the top of the window.