Updated 2022-10-31

Intel VTune Profiler


Intel VTune Profiler is a powerful tool for measuring the performance of parallel codes on the cluster. It can profile CPU usage, memory usage, and inter-process communcation. Unlike many profilers, VTune is highly compatible with MPI. It is also comptabile with multithreaded codes, including hybrid MPI/multithreaded codes. It is best-suited for codes written in C, C++, and/or Fortran.

In this guide, we will cover its usage with an MPI C program on the Slurm clusters.

Basic Steps

The basic steps are:

  1. Compile your program with debugging symbols (the -g flag plus any -O flags)
  2. Run your program on the cluster with the vtune command-line tool. This will generate a performance profile.
  3. In a graphical OnDemand session, use the VTune GUI to view the profile.

Detailed Example

In this example, we will be analyzing a self-scheduling matrix-vector multiplication that uses MPI. It is based on an example from Gropp, Lusk and Skjellum, Using MPI, 3rd Ed.. It is a very inefficient matrix-vector implementation, but it illustrates a common parallel communication pattern and gives predictable performance profiles. The code is shown in Example Code.

Expected Profile

The figure below represents the memory and communication pattern on 4 ranks. The overall operation is Ax = y where A is an n-by-n matrix; and x and y are n-by-1 column vectors. The "boss" (rank 3) is intialized with the full operand A and will collect the full solution y. Each "worker" (ranks 0, 1, 2) is initialized with the full operand x and will have a receive buffer for one row of A. From the perspective of each worker, the following will happen:

  1. Receive one row Aₖ from the boss
  2. Compute the dot product Aₖ ⋅ x to get one element of the solution yₖ
  3. Send yₖ back to the boss
  4. Repeat while there are remaining rows of A

Note that, because the workers are self-scheduling, the workers may process the rows in a different order than shown.


From this model, we might expect the following:

  • Assuming that the arrays are using double-precision floats (8 bytes each):

    • Rank 3 will have (8 + 8n) bytes of data
    • Ranks 0, 1, 2 will have (16n) bytes of data
  • The compute time on each worker (the dot product) will be greater than the compute time on the boss.

  • The send/receive for each row Aₖ will require more time than the send/recive for yₖ

Compiling Code

To use Intel Vtune, you should compile your code with Intel and MVAPICH2. To load the modules:

module load intel/20.0.4
module load mvapich2/2.3.6

You should also compile your code with debugging symbols using the -g flag and any optimization flags you require.. Be aware that -g turns off many optimizations by default, but you can explicitlly enable both at the same time (for example, use both -g -O2).

Collecting Profiles

To collect your profile in a batch job, pass it to the vtune command. For example, if our program was called matvec and took one argument (in this case, 16384, the size n of the matrix), we could use these options to profile memory:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=4GB

module load intel/20.0.4
module load mvapich2/2.3.6

srun vtune -collect=hotspots -data-limit=0 -r results_hotspots ./matvec 16384

The options we used for vtune are:

  • -collect=hotspots: This specifies profiling metric that we will collect.
    hotspots shows a breakdown of CPU time by rank and function. There are many others; run vtune -help collect to see all available.
  • -r results_memory: Specifies the output directory for the profiles. You can use any valid directory path.
  • -data-limit=0: Recommended for MPI profiling.

In a separate run, we will also collect "memory-consumption" data (for profiling runtime).
We recommend collecting only one category of data per run.

srun vtune -collect=memory-consumption -data-limit=0 -r results_memory ./matvec 16384

Viewing Profiles

To view your profile data, first start an Interactive Desktop session in Open OnDemand. You only need one node and one task, since you will only be viewing data from the previous run; however, you may need to request a larger amount of memory (such as 16 or 32 GB). In your interative desktop session, open the terminal, and run the following:

$ module load intel/20.0.4
$ module load mvapich2/2.3.6
$ vtune-gui

A welcome screen for the GUI will pop-up (see screenshot below). From there, navigate to File -> Open -> Result; navigate to the directory containing your results (specified by the -r flag in your batch script); and open the file with the .vtune extension.



First, we will look at the "hotspots" results (see screenshot below). The GUI can be overwhelming, so we will start by looking at the "Top-down Tree"; and the "CPU Time: Total" column. We observe the following:

  • The MPI_Recv in worker_task and the PMPI_Send in the boss_task take a considerable amount of time. This was expeced from our performance model.
  • The boss_setup also takes a considerable amount of time. This was not expected from our initial performance model. But by re-examing our code, we can see that boss_setup loops through every element of the A matrix, which would indeed be very time-consuming.


Memory Consumption

Next, we will look at the "memory-consumption" results (see screenshot below). We will look at the "Bottom-up" View, group by "Module/Function/Stack", and sort by "Allocation Size". Recall that we ran with a 16384-by-16384 A matrix. From the "Allocation Size" column, we can observe the following:

  • We know that boss_setup allocates the A matrix. Given that A was a 16384-by-16384 matrix of doubles, we can expect it to allocate 16384 × 16384 × 8 bytes = 2 GiB. Likewise, VTune reports that boss_setup allocated 2 GiB of data.
  • We know that worker_setup allocates the x vector. Given that x was a 16384-by-1 column vector of doubles, we can expect the 3 workers to allocate 3 × 16384 × 8 bytes = 384 KiB. Likewise, VTune reports that worker_setup allocated 384 KiB.
  • We know that worker_task allocates the receive buffer for one row of A. Given that each row is a 1-by-16384 row vector, we can expect the 3 workers to allocate 3 × 16384 × 8 bytes = 384 KiB. Likewise, VTune reports that worker_setup allocated 384 KiB.


Example Code

#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>

// Better way is to query MPI_TAG_UB
const int TERM_TAG = 99999999;

void boss_setup(double **A, double **y, int n) {
    *A = malloc(n * n * sizeof(double));
    *y = malloc(n * sizeof(double));
    for (int i = 0; i < n * n; ++i) {
      (*A)[i] = i;

void worker_setup(double **x, int n) {
    *x = malloc(n * sizeof(double));
    for (int i = 0; i < n; ++i) {
      (*x)[i] = i;

void boss_task(const double* A, double* y, int n, int n_workers) {
  MPI_Status stat;
  // Boss sends the first nranks-1 rows.  Tag is the row index.
  for (int i = 0; i < n_workers; ++i) {
    MPI_Send(&A[i * n], n, MPI_DOUBLE, i, i, MPI_COMM_WORLD);

  int next_row = n_workers;
  for (int i = 0; i < n; ++i) {
    // Boss receives a dot product from any worker. Tag is the row that the worker
    // computed.
    double dot;
    y[stat.MPI_TAG] = dot;

    // If there are any rows left, boss sends a row back to the worker.  Tag is row index.
    if (next_row < n) {
      MPI_Send(&A[next_row * n], n, MPI_DOUBLE, stat.MPI_SOURCE, next_row, MPI_COMM_WORLD);
    // If no rows left, boss sends the termination tag (instead of row index)
    else {

void worker_task(const double* x, int n, int boss_rank) {
  MPI_Status stat;
  double* row = malloc(n * sizeof(double));

  while (1) {
    // Worker receives rows until it gets a term tag
    MPI_Recv(row, n, MPI_DOUBLE, boss_rank, MPI_ANY_TAG, MPI_COMM_WORLD, &stat);
    if (stat.MPI_TAG == TERM_TAG)

    // The worker computes the dot product and sends it back to the boss.
    // The tag from the boss was the row idx, but the worker doesn't need it, so the
    // worker sends it back to the boss
    double dot = 0;
    for (int i = 0; i < n; ++i) {
      dot += row[i] * x[i];
    MPI_Send(&dot, 1, MPI_DOUBLE, boss_rank, stat.MPI_TAG, MPI_COMM_WORLD);

int main(int argc, char* argv[]) {
  MPI_Init(&argc, &argv);

  int n_ranks, rank;
  MPI_Comm_size(MPI_COMM_WORLD, &n_ranks);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  int boss_rank = n_ranks - 1;

  // n is size of matrix and arrays
  // TODO: Check argc and assert n_ranks >= n
  int n = atoi(argv[1]);

  // Allocate matrix and arrays
  double *A, *x, *y;
  if (rank == boss_rank) {
    boss_setup(&A, &y, n);
    boss_task(A, y, n, n_ranks - 1);
  else {
    worker_setup(&x, n);
    worker_task(x, n, boss_rank);

  if (rank == boss_rank) {
    for (int i = 0; i < n; ++i) {
      printf("%0.0f ", y[i]);

  if (rank == boss_rank) {
  } else {