Guide – How to: gather performance data with the ESP Monitors API
Latest update: 2023-03-09
Each ESP tile contains a set of distributed performance monitors (i.e. counters) that can be used to measure the occurences of particular phenomena, such as cache hits/misses, backpressure on the NoC, off-chip memory accesses, accelerator execution cycles, and more. This guide illustrates how to use ESP’s Monitors API to easily access these performance monitors from software running on the SoC itself.
1. Available Performance Monitors
There are currently 59 32-bit performance monitors implemented in each tile. All of these monitors are always instantiated in all tiles, but certain monitor types are only well-defined for certain types of tiles; reading monitors that are not well-defined in a particular type of tile is possible, but will always return a value of 0. Below is a list of all of the current performance monitors along with information on the tiles where they are applicable and their corresponding index in the tile.
DDR Accesses (0) - Number of off-chip memory requests. Valid in every memory tile.
Coherence Traffic - Broken down into each type of coherence message served by the last-level cache (LLC) listed below. Valid in memory tiles when the cache hierarchy is enabled.
- Coherence requests received (1)
- Coherence forwards sent (2)
- Coherence responses received (3)
- Coherence responses sent (4)
- DMA requests received (5)
- DMA responses sent (6)
- Coherent-DMA requests received (7)
- Coherent-DMA responses sent (8)
L2 Cache hits (9) and misses (10) - Valid in the CPU tile when the cache hierarchy is enabled.
LLC hits (11) and misses (12) - Valid in the memory tile when the cache hierarchy is enabled.
Accelerator Statistics - Valid in all accelerator tiles.
- Accelerator TLB loading cycles (13) - only valid for accelerators designed with ESP.
- Accelerator communication cycles (14 - LSBs and 15 - MSBs) - cycles spent issuing memory requests or awaiting responses. Split into 2 32-bit registers.
- Accelerator total execution cycles (16 - LSBs and 17 MSBs) - split into 2 32-bit registers.
- Accelerator Invocations (18).
DVFS Operating points (19 - 22) - shows usage of each V/F operating point. See Mantovani, DAC’16 for more information. Valid in CPU and accelerator tiles when DVFS emulation is enabled.
NoC injections (23-28, increasing in order by plane number) - one register for each NoC plane (by default, ESP architectures feature 6 NoC planes). Valid in all tiles.
NoC backpressure (29-58, increasing first by direction (Local, East, West, South, North) and then by plane number) - monitors each cycle where a NoC queue in the tile is full. One register for each direction/plane combination (i.e. 30 registers in total per tile). Valid in all tiles.
2. ESP Monitors API
The ESP Monitors API provides a simple way to access the aforementioned performance monitors from software running on an ESP-based SoC. In particular, we envision using these monitors in an adaptive manner, such that applications can tune certain parameters (e.g. utilized tiles, coherence modes, data allocation policies, etc.) based on the dynamic utilization of the system. For one example of the usage of the ESP performance monitors, please see Zuckerman, MICRO’21.
esp_monitor
The main function call in the Monitors API is esp_monitor
which samples a
subset of the monitors throughout the SoC as specified by an
esp_monitor_args_t
argument that is passed to the function call. The values
are returned through an esp_monitor_vals_t
structure that is passed by
reference. The definition of esp_monitor
, esp_monitor_args_t
,
esp_monitor_vals_t
, and all dependent types are shown below.
unsigned int esp_monitor(esp_monitor_args_t args, esp_monitor_vals_t *vals);
typedef struct esp_monitor_args {
esp_monitor_mode_t read_mode;
uint16_t read_mask;
uint8_t tile_index;
uint8_t acc_index;
uint8_t mon_index;
uint8_t noc_index;
} esp_monitor_args_t;
typedef struct esp_monitor_vals {
unsigned int ddr_accesses[SOC_NMEM];
esp_mem_reqs_t mem_reqs[SOC_NMEM];
esp_cache_stats_t l2_stats[SOC_NTILES];
esp_cache_stats_t llc_stats[SOC_NMEM];
esp_acc_stats_t acc_stats[SOC_NACC];
unsigned int dvfs_op[SOC_NTILES][DVFS_OP_POINTS];
unsigned int noc_injects[SOC_NTILES][NOC_PLANES];
unsigned int noc_queue_full[SOC_NTILES][NOC_PLANES][NOC_QUEUES];
} esp_monitor_vals_t;
typedef struct esp_mem_reqs {
unsigned int coh_reqs;
unsigned int coh_fwds;
unsigned int coh_rsps_rcv;
unsigned int coh_rsps_snd;
unsigned int dma_reqs;
unsigned int dma_rsps;
unsigned int coh_dma_reqs;
unsigned int coh_dma_rsps;
} esp_mem_reqs_t;
typedef struct esp_cache_stats {
unsigned int hits;
unsigned int misses;
} esp_cache_stats_t;
typedef struct acc_stats_t {
unsigned int acc_tlb;
unsigned int acc_mem_lo;
unsigned int acc_mem_hi;
unsigned int acc_tot_lo;
unsigned int acc_tot_hi;
unsigned int acc_invocations;
} esp_acc_stats_t;
Since esp_monitor
samples the current values of all performance monitors, it
should be called twice – once before and once after a period of interest – to
determine the number of each phenomenon that occurred during the duration between
the two function calls. These two calls to esp_monitor
should use the same
configuration of the args
variable.
The first value in the esp_monitor_args_t
struct is the read_mode
, which can take the values ESP_MON_READ_ALL
,
ESP_MON_READ_SINGLE
, or ESP_MON_READ_MANY
. The remaining arguments are used
depending on the selected mode.
ESP_MON_READ_ALL
This mode reads all of the performance monitors throughout the SoC, so it does
not require any additional arguments. Values are returned through the
esp_monitor_vals_t
structure passed by reference.
ESP_MON_READ_SINGLE
This mode reads a single specified performance monitor and returns it as the
return value from the function. Two additional fields of the
esp_monitor_args_t
structure must be specified. First, the tile_index
should be set to the index of the tile that contains the target performance
monitor. Second, the mon_index
should be set to the index of the desired
monitor (a value between 0 and 57). The indices of each monitor in the tile are
listed in Section 1.
ESP_MON_READ_MANY
This mode reads a subset of the performance monitors as specified by the read_mask
in the esp_monitors_arg_t
structure. There are nine different sets of monitors that
can be read by enabling their corresponding bit in the read_mask
. A description of
each set and its position in the bitmask are listed below.
ESP_MON_READ_DDR_ACCESSES
(0) - reads the DDR access monitor in all memory tiles.ESP_MON_READ_MEM_REQS
(1) - reads the coherence traffic monitors in all memory tiles.ESP_MON_READ_L2_STATS
(2) - reads the L2 hit and miss counts in all CPU tiles and all accelerator tiles with an L2 cache enabled.ESP_MON_READ_LLC_STATS
(3) - reads the LLC hit and miss counts in all memory tiles.ESP_MON_READ_ACC_STATS
(4) - reads the accelerator performance monitors in the accelerator specified byacc_index
(accelerator index, not tile index).ESP_MON_READ_DVFS_OP
(5) - reads the DVFS operating points monitors.ESP_MON_READ_NOC_INJECTS
(6) - reads the number of NoC injections on all NoC planes from the tile specified bytile_index
.ESP_MON_READ_NOC_QUEUE_FULL_TILE
(7) - reads all NoC backpressure monitors (all directions, all planes) from the tile specified bytile_index
.ESP_MON_READ_NOC_QUEUE_FULL_PLANE
(8) - reads all NoC backpressure monitors (all directions, all tiles) from the plane specified bynoc_index
.
Note: Calls to
esp_monitor
are not intended to give exact values for performance monitors during a defined period of software execution. There is an inherent delay in accessing these monitors, which may cause some slight inaccuracy in the values obtained; for example, a monitor may continue to increment while a read message intended for it is traversing the NoC. To mitigate this problem when accessing multiple monitors, modes that access multiple monitors (e.g.ESP_MON_READ_ALL
andESP_MON_READ_MANY
) sample the current values of all monitors in a single tile before attempting to read them. This effectively synchronizes the values obtained from reading multiple monitors, such that the values all correspond to the same point in time. However, this sampling is done at the granularity of a tile and is serialized when multiple tiles are the target of an API call. Hence, values obtained from accesses to monitors in different tiles may not be synchronized to the same exact point in time. These sources of imprecision should be relatively small and can be further mitigated by using longer periods of sampling.
esp_monitor_diff
esp_monitor_vals_t esp_monitor_diff(esp_monitor_vals_t vals_start, esp_monitor_vals_t vals_end);
esp_monitor_diff
takes in two esp_monitor_vals_t
structures and returns a
new esp_monitor_vals_t
that contains the difference between them, accounting
for any potential overflow in each monitor. Hence, this should be used to
determine the number of each phenomenon that has occurred between successive
calls to esp_monitor
. Note that the correctness of this function is only
guaranteed if each monitor has incremented less than 232 (or
264 for the two 64-bit monitors) times between successive calls to
esp_monitor
. Hence, these two function calls should be placed sufficiently
close to prevent this from happening. When using the ESP_MON_READ_SINGLE
mode, you should instead use the sub_monitor_vals
function that merely
subtracts two 32-bit integers, accounting for overflow.
uint32_t sub_monitor_vals (uint32_t val_start, uint32_t val_end);
esp_monitor_print
#Baremetal
void esp_monitor_print(esp_monitor_args_t args, esp_monitor_vals_t vals);
#Linux
void esp_monitor_print(esp_monitor_args_t args, esp_monitor_vals_t vals, FILE *fp);
esp_monitor_print
prints the formatted contents of an esp_monitor_vals_t
structure to the console for baremetal or to a specified file within Linux. The
esp_monitor_args_t
passed to the corresponding esp_monitor
calls should
also be passed to this function in order to only print the relevant values.
esp_monitor_vals_alloc
esp_monitor_vals_t* esp_monitor_vals_alloc();
Only available when running in Linux. esp_monitor_vals_alloc
returns a
pointer to a new esp_monitor_vals_t
structure in case the programmer would
like to allocate these structures dynamically. While this function is optional,
and malloc
can be used instead, its use offers the advantage that the
allocated structures are automatically cleaned up when esp_monitor_free
is
called.
esp_monitor_free
void esp_monitor_free();
Only available when running in Linux, and should be called once after all calls
to esp_monitor
. In Linux, esp_monitor
requires that the memory addresses of
the performance monitors are mapped with mmap
– this is automatically done
by the first call to esp_monitor
. esp_monitor_free
unmaps the region that
was mapped with mmap
. It also cleans up any esp_monitor_vals_t
structures
allocated by esp_monitor_vals_alloc
.
3. Example Code
Here, we present three example uses of the ESP monitors API, where each example
leverages a different mode of the API. This code can be found in the ESP
release
in soft/common/apps/examples/multifft_mon/multifft_mon.c
. This example is the
same as the multifft
example application but also makes use of the Monitors
API.
Example 1
//ESP MONITORS: EXAMPLE #1
//read a single monitor from the tile number and monitor offset
//statically declare monitor arg structure
esp_monitor_args_t mon_args;
//set up argument structure using READ_SINGLE mode
//the off-chip memory accesses are read from the memory tile at the DDR_WORD_TRANSFER monitor
const int MEM_TILE_IDX = 0;
mon_args.read_mode = ESP_MON_READ_SINGLE;
mon_args.tile_index = MEM_TILE_IDX;
mon_args.mon_index = MON_DDR_WORD_TRANSFER_INDEX;
//in the READ_SINGLE mode, the monitor value is returned directly
//read before and after
unsigned int ddr_accesses_start, ddr_accesses_end;
ddr_accesses_start = esp_monitor(mon_args, NULL);
esp_run(cfg_nc, 1);
ddr_accesses_end = esp_monitor(mon_args, NULL);
printf("\n ** DONE **\n");
//calculate differnce, accounting for overflow
unsigned int ddr_accesses_diff;
ddr_accesses_diff = sub_monitor_vals(ddr_accesses_start, ddr_accesses_end);
printf("\tOff-chip memory accesses: %d\n", ddr_accesses_diff);
This example uses the ESP_MON_READ_SINGLE
mode to read the DDR accesses
monitor from the memory tile before and after an accelerator invocation. Hence,
we specify the tile index of the memory tile and the index of the DDR access
monitor. In this mode, the function returns the current value of the monitor,
so the caller can pass NULL
for the esp_monitor_vals_t
structure, since it
is not used. Finally, we use sub_monitor_vals
to determine the difference
between the two values and use a regular printf
to display the result.
Example 2
//ESP MONITORS: EXAMPLE #2
//read all monitors on the SoC
//statically declare monitor vals structures
esp_monitor_vals_t vals_start, vals_end;
//set read_mode to ALL
mon_args.read_mode = ESP_MON_READ_ALL;
cfg_llc[0].hw_buf = buf[0];
printf("\n ** DONE **\n");
//values written into vals struct argument
esp_monitor(mon_args, &vals_start);
esp_run(cfg_llc, 1);
esp_monitor(mon_args, &vals_end);
//calculate difference of all values
esp_monitor_vals_t vals_diff;
vals_diff = esp_monitor_diff(vals_start, vals_end);
FILE *fp = fopen("multifft_esp_mon_all.txt", "w");
esp_monitor_print(mon_args, vals_diff, fp);
fclose(fp);
This example uses the ESP_MON_READ_ALL
mode to read all of the monitors
throughout the SoC. This mode does not require any additional arguments. In
this case, the esp_monitor_diff
function is used to compute the difference of
all the values, and esp_monitor_print
is used to write the differences to a file.
Example 3
//ESP MONITORS: EXAMPLE #3
//read a specified subset of the monitors on the SoC
//dynamically allocate monitor arg structure
esp_monitor_vals_t *vals_start_ptr = esp_monitor_vals_alloc();
esp_monitor_vals_t *vals_end_ptr = esp_monitor_vals_alloc();
//set read_mode to MANY
mon_args.read_mode = ESP_MON_READ_MANY;
mon_args.read_mask = 0;
//enable reading memory accesses
mon_args.read_mask |= 1 << ESP_MON_READ_DDR_ACCESSES;
//enable reading L2 statistics
mon_args.read_mask |= 1 << ESP_MON_READ_L2_STATS;
//enable reading LLC statistics
mon_args.read_mask |= 1 << ESP_MON_READ_LLC_STATS;
//enable reading acc statistics - requires the index of the accelerator
mon_args.acc_index = 0;
mon_args.read_mask |= 1 << ESP_MON_READ_ACC_STATS;
//enable reading noc injections - requires the index of the tile
const int ACC_TILE_INDEX = 0;
mon_args.tile_index = ACC_TILE_INDEX;
mon_args.read_mask |= 1 << ESP_MON_READ_NOC_INJECTS;
//enable reading noc backpressure on a plane - requires the index of the noc plane
const int NOC_PLANE = 0;
mon_args.noc_index = NOC_PLANE;
mon_args.read_mask |= 1 << ESP_MON_READ_NOC_QUEUE_FULL_PLANE;
cfg_fc[0].hw_buf = buf[0];
//values written into vals struct argument
esp_monitor(mon_args, vals_start_ptr);
esp_run(cfg_fc, 1);
esp_monitor(mon_args, vals_end_ptr);
printf("\n ** DONE **\n");
//calculate difference of all values
vals_diff = esp_monitor_diff(*vals_start_ptr, *vals_end_ptr);
//write results to file
fp = fopen("multifft_esp_mon_many.txt", "w");
esp_monitor_print(mon_args, vals_diff, fp);
fclose(fp);
//when done with monitors, free all allocated structures, and unmap the address space
esp_monitor_free();
This final example uses the ESP_MON_READ_MANY
mode to read a specified subset
of the performance monitors of the SoC. In this case, the read_mask
is
configured to read the DDR accesses, L2 hits and misses, LLC hits and misses,
accelerator stats from a particular accelerator, NoC injections from a
particular tile, and NoC backpressure on a particular plane. This example also
makes use of esp_monitor_vals_alloc
to dynamically allocate the
esp_monitor_vals_t
structure. Because this example is the last use of
esp_monitor
in this application, esp_monitor_free
is called to unmap the
address space associated with the performance monitors and free the allocated
esp_monitor_vals_t
structures.