Updated 2022-10-31
Intel VTune Profiler¶
Overview¶
Intel VTune Profiler is a powerful tool for measuring the performance of parallel codes on the cluster. It can profile CPU usage, memory usage, and inter-process communcation. Unlike many profilers, VTune is highly compatible with MPI. It is also comptabile with multithreaded codes, including hybrid MPI/multithreaded codes. It is best-suited for codes written in C, C++, and/or Fortran.
In this guide, we will cover its usage with an MPI C program on the Slurm clusters.
Basic Steps¶
The basic steps are:
- Compile your program with debugging symbols (the
-g
flag plus any-O
flags) - Run your program on the cluster with the
vtune
command-line tool. This will generate a performance profile. - In a graphical OnDemand session, use the VTune GUI to view the profile.
Detailed Example¶
In this example, we will be analyzing a self-scheduling matrix-vector multiplication that uses MPI. It is based on an example from Gropp, Lusk and Skjellum, Using MPI, 3rd Ed.. It is a very inefficient matrix-vector implementation, but it illustrates a common parallel communication pattern and gives predictable performance profiles. The code is shown in Example Code.
Expected Profile¶
The figure below represents the memory and communication pattern on 4 ranks. The overall operation is Ax = y where A is an n-by-n matrix; and x and y are n-by-1 column vectors. The "boss" (rank 3) is intialized with the full operand A and will collect the full solution y. Each "worker" (ranks 0, 1, 2) is initialized with the full operand x and will have a receive buffer for one row of A. From the perspective of each worker, the following will happen:
- Receive one row Aₖ from the boss
- Compute the dot product Aₖ ⋅ x to get one element of the solution yₖ
- Send yₖ back to the boss
- Repeat while there are remaining rows of A
Note that, because the workers are self-scheduling, the workers may process the rows in a different order than shown.
From this model, we might expect the following:
-
Assuming that the arrays are using double-precision floats (8 bytes each):
- Rank 3 will have (8n² + 8n) bytes of data
- Ranks 0, 1, 2 will have (16n) bytes of data
-
The compute time on each worker (the dot product) will be greater than the compute time on the boss.
-
The send/receive for each row Aₖ will require more time than the send/recive for yₖ
Compiling Code¶
To use Intel Vtune, you should compile your code with Intel and MVAPICH2. To load the modules:
module load intel/20.0.4
module load mvapich2/2.3.6
You should also compile your code with debugging symbols using the -g
flag
and any optimization flags you require.. Be aware that -g
turns off
many optimizations by default, but you can explicitlly enable both at the same
time (for example, use both -g -O2
).
Collecting Profiles¶
To collect your profile in a batch job, pass it to the vtune
command. For
example, if our program was called matvec
and took one argument (in this case,
16384, the size n of the matrix), we could use these options to profile memory:
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=4GB
module load intel/20.0.4
module load mvapich2/2.3.6
srun vtune -collect=hotspots -data-limit=0 -r results_hotspots ./matvec 16384
The options we used for vtune
are:
-collect=hotspots
: This specifies profiling metric that we will collect.
hotspots
shows a breakdown of CPU time by rank and function. There are many others; runvtune -help collect
to see all available.-r results_memory
: Specifies the output directory for the profiles. You can use any valid directory path.-data-limit=0
: Recommended for MPI profiling.
In a separate run, we will also collect "memory-consumption" data (for profiling runtime).
We recommend collecting only one category of data per run.
srun vtune -collect=memory-consumption -data-limit=0 -r results_memory ./matvec 16384
Viewing Profiles¶
To view your profile data, first start an Interactive Desktop session in Open OnDemand. You only need one node and one task, since you will only be viewing data from the previous run; however, you may need to request a larger amount of memory (such as 16 or 32 GB). In your interative desktop session, open the terminal, and run the following:
$ module load intel/20.0.4
$ module load mvapich2/2.3.6
$ vtune-gui
A welcome screen for the GUI will pop-up (see screenshot below). From there, navigate to File -> Open
-> Result; navigate to the directory containing your results (specified by the -r
flag in your batch script); and open the file with the .vtune
extension.
Hotspots¶
First, we will look at the "hotspots" results (see screenshot below). The GUI can be overwhelming, so we will start by looking at the "Top-down Tree"; and the "CPU Time: Total" column. We observe the following:
- The
MPI_Recv
inworker_task
and thePMPI_Send
in theboss_task
take a considerable amount of time. This was expeced from our performance model. - The
boss_setup
also takes a considerable amount of time. This was not expected from our initial performance model. But by re-examing our code, we can see thatboss_setup
loops through every element of the A matrix, which would indeed be very time-consuming.
Memory Consumption¶
Next, we will look at the "memory-consumption" results (see screenshot below). We will look at the "Bottom-up" View, group by "Module/Function/Stack", and sort by "Allocation Size". Recall that we ran with a 16384-by-16384 A matrix. From the "Allocation Size" column, we can observe the following:
- We know that
boss_setup
allocates the A matrix. Given that A was a 16384-by-16384 matrix of doubles, we can expect it to allocate 16384 × 16384 × 8 bytes = 2 GiB. Likewise, VTune reports thatboss_setup
allocated 2 GiB of data. - We know that
worker_setup
allocates the x vector. Given that x was a 16384-by-1 column vector of doubles, we can expect the 3 workers to allocate 3 × 16384 × 8 bytes = 384 KiB. Likewise, VTune reports thatworker_setup
allocated 384 KiB. - We know that
worker_task
allocates the receive buffer for one row of A. Given that each row is a 1-by-16384 row vector, we can expect the 3 workers to allocate 3 × 16384 × 8 bytes = 384 KiB. Likewise, VTune reports thatworker_setup
allocated 384 KiB.
Example Code¶
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
// Better way is to query MPI_TAG_UB
const int TERM_TAG = 99999999;
void boss_setup(double **A, double **y, int n) {
*A = malloc(n * n * sizeof(double));
*y = malloc(n * sizeof(double));
for (int i = 0; i < n * n; ++i) {
(*A)[i] = i;
}
}
void worker_setup(double **x, int n) {
*x = malloc(n * sizeof(double));
for (int i = 0; i < n; ++i) {
(*x)[i] = i;
}
}
void boss_task(const double* A, double* y, int n, int n_workers) {
MPI_Status stat;
// Boss sends the first nranks-1 rows. Tag is the row index.
for (int i = 0; i < n_workers; ++i) {
MPI_Send(&A[i * n], n, MPI_DOUBLE, i, i, MPI_COMM_WORLD);
}
int next_row = n_workers;
for (int i = 0; i < n; ++i) {
// Boss receives a dot product from any worker. Tag is the row that the worker
// computed.
double dot;
MPI_Recv(&dot, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &stat);
y[stat.MPI_TAG] = dot;
// If there are any rows left, boss sends a row back to the worker. Tag is row index.
if (next_row < n) {
MPI_Send(&A[next_row * n], n, MPI_DOUBLE, stat.MPI_SOURCE, next_row, MPI_COMM_WORLD);
++next_row;
}
// If no rows left, boss sends the termination tag (instead of row index)
else {
MPI_Send(A, n, MPI_DOUBLE, stat.MPI_SOURCE, TERM_TAG, MPI_COMM_WORLD);
}
}
}
void worker_task(const double* x, int n, int boss_rank) {
MPI_Status stat;
double* row = malloc(n * sizeof(double));
while (1) {
// Worker receives rows until it gets a term tag
MPI_Recv(row, n, MPI_DOUBLE, boss_rank, MPI_ANY_TAG, MPI_COMM_WORLD, &stat);
if (stat.MPI_TAG == TERM_TAG)
break;
// The worker computes the dot product and sends it back to the boss.
// The tag from the boss was the row idx, but the worker doesn't need it, so the
// worker sends it back to the boss
double dot = 0;
for (int i = 0; i < n; ++i) {
dot += row[i] * x[i];
}
MPI_Send(&dot, 1, MPI_DOUBLE, boss_rank, stat.MPI_TAG, MPI_COMM_WORLD);
}
free(row);
}
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int n_ranks, rank;
MPI_Comm_size(MPI_COMM_WORLD, &n_ranks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int boss_rank = n_ranks - 1;
// n is size of matrix and arrays
// TODO: Check argc and assert n_ranks >= n
int n = atoi(argv[1]);
// Allocate matrix and arrays
double *A, *x, *y;
if (rank == boss_rank) {
boss_setup(&A, &y, n);
boss_task(A, y, n, n_ranks - 1);
}
else {
worker_setup(&x, n);
worker_task(x, n, boss_rank);
}
if (rank == boss_rank) {
for (int i = 0; i < n; ++i) {
printf("%0.0f ", y[i]);
}
printf("\n");
}
if (rank == boss_rank) {
free(A);
free(y);
} else {
free(x);
}
MPI_Finalize();
}