Updated 2022-02-18

Performance Profiling with ARM Map

Overview

ARM Map is a performance profiling tool that allows profiling of serial and parallel C, C++, and Fortran code. In particular, it has excellent support for MPI and OpenMP profiling. It is part of the ARM Forge suite of parallel profilers and debuggers. The full doccumentation for Map can be found here.

Important

PACE systems have version 20.2 of Forge, which is not the most recent release. When you look up documentation or download the local client (see below), be sure you are choose v20.2.

Phoenix Tutorial

1. One-Time Remote Client Setup

On your computer, download and install the ARM Forge remote client for your OS here. Be sure to download v20.2 of the client.

After it's installed, open the "ARM Forge Client" app. Click on "arm MAP" in the navigation on the left. Then, from the "Remote Launch" dropdown menu, select "Configure ..."

Screenshot

After clicking on "Add" in the "Configure Remote Connections" window, you will see the "Remote Launch Settings" window. Enter the information shown in the screenshot below: * Host Name: Enter username@login-phoenix.pace.gatech.edu, substituting your PACE usernae * Remote Installation Directory: Enter /usr/local/pace-apps/manual/packages/forge/20.2 * Everything else: Uncheck or leave blank

Screenshot

You can click "Test Remote Launch" to make sure all your settings are correct. Then click "OK" to finish configuring the connection

2. Generating Profile for an MPI Program

On Phoenix, you can use MAP to profile any C, C++, or Fortran programs you have compiled yourself, as well as many pre-built programs available as PACE modules.

Optional: Compile Example Program

For this tutorial, we will be using a simple MPI program that runs an n-body simulation. The source file, nbodypipe.c is adapted from Using MPI, 3rd Ed. and can be found here. To download and compile it on Phoenix, run:

$ mpicc -g nbodypipe.c -o nbodypipe

The -g flag inserts debugging info, which can be helpful when viewing the profiles. This program can be run with mpirun and takes one argument for the number of particles (for example, use mpirun -np 4 ./nbodypipe 10000 to simulate 10000 particles on 4 MPI ranks)

Running Map on Command Line and in Jobs

On Phoenix, after loading the forge module, you can run Map and the other ARM tools. An MPI program can be profiled with Map using commands such as:

$ module load forge
$ map --profile --np=4 PROGRAM [ARGUMENTS]..

This will produce a *.map file that can be loaded in the remote client for visualization. By default, the name of the map file will have the form: <program_name>_<num_procs>_<num_threads>_<timestamp>.map.

This same commands can (and should!) be run from a typical job script. Using the n-body program we compiled above, we can submit the following script, which will produce a *.map file named nbodypipe_4p_1n_<timestamp>.map.

#PBS -N nbodypipe
#PBS -A pace-admins
#PBS -l nodes=1:ppn=4
#PBS -l pmem=8gb
#PBS -l walltime=00:15:00
#PBS -q inferno
#PBS -j oe

module load forge

cd $PBS_O_WORKDIR
map --profile --np=4 ./nbodypipe 50000

If you are profiling a program from a PACE module, load the other necessary modules along with forge.

3. Visualizing the Profiles

On your computer, open the "ARM Forge Client". Select "arm Map" from the navigation on the left. From the "Remote Launch" dropdown menu, select the "login-phoenix" connection that you configured earlier. Finally, click on "LOAD PROFILE DATA FILE". This will let you load files directly from Phoenix without copying them to your own computer. To proceed, open the *.map file that you generated above.

Screenshot

This will open a window representing a wealth of runtime measurements from your program. A good place to start is the "Main Thread Stacks" pane, which represents total execution times organized by function calls. If you compiled your code with debugging info (using -g), then clicking on an item in "Main Thread Stacks" highlights the line in the code viewer.

Screenshot

In the results from our n-body simulation, the line fy -= mj * ry / r takes 38.3% of our program's CPU time. Overall, in our simulation, computing the forces on the particles (lines 163 - 165) takes the vast majority of the CPU time.

To visualize MPI performance, you can start by selecting "Metrics --> Preset MPI" from the menu at the top of the window. This will change the top panels to show bandwidth usage, function calls, and other data over time (often called "sparklines"). The bottom two panels have the same purpose as before.

Screenshot

In our results, the sparklines show relatively few MPI operations. Indeed, from the bottom panels, we can see that MPI_Allreduce only takes 0.4% of our program's CPU time.

There are many other metrics, such as memory and I/O utilization, that can be selected from the "Metrics" menu at the top of the window.