Guide – How to: gather performance data with the ESP Monitors API

Latest update: 2023-03-09

Each ESP tile contains a set of distributed performance monitors (i.e. counters) that can be used to measure the occurences of particular phenomena, such as cache hits/misses, backpressure on the NoC, off-chip memory accesses, accelerator execution cycles, and more. This guide illustrates how to use ESP’s Monitors API to easily access these performance monitors from software running on the SoC itself.

1. Available Performance Monitors
2. ESP Monitors API
3. Example Code

1. Available Performance Monitors

There are currently 59 32-bit performance monitors implemented in each tile. All of these monitors are always instantiated in all tiles, but certain monitor types are only well-defined for certain types of tiles; reading monitors that are not well-defined in a particular type of tile is possible, but will always return a value of 0. Below is a list of all of the current performance monitors along with information on the tiles where they are applicable and their corresponding index in the tile.

DDR Accesses (0) - Number of off-chip memory requests. Valid in every memory tile.

Coherence Traffic - Broken down into each type of coherence message served by the last-level cache (LLC) listed below. Valid in memory tiles when the cache hierarchy is enabled.

Coherence requests received (1)
Coherence forwards sent (2)
Coherence responses received (3)
Coherence responses sent (4)
DMA requests received (5)
DMA responses sent (6)
Coherent-DMA requests received (7)
Coherent-DMA responses sent (8)

L2 Cache hits (9) and misses (10) - Valid in the CPU tile when the cache hierarchy is enabled.

LLC hits (11) and misses (12) - Valid in the memory tile when the cache hierarchy is enabled.

Accelerator Statistics - Valid in all accelerator tiles.

Accelerator TLB loading cycles (13) - only valid for accelerators designed with ESP.
Accelerator communication cycles (14 - LSBs and 15 - MSBs) - cycles spent issuing memory requests or awaiting responses. Split into 2 32-bit registers.
Accelerator total execution cycles (16 - LSBs and 17 MSBs) - split into 2 32-bit registers.
Accelerator Invocations (18).

DVFS Operating points (19 - 22) - shows usage of each V/F operating point. See Mantovani, DAC’16 for more information. Valid in CPU and accelerator tiles when DVFS emulation is enabled.

NoC injections (23-28, increasing in order by plane number) - one register for each NoC plane (by default, ESP architectures feature 6 NoC planes). Valid in all tiles.

NoC backpressure (29-58, increasing first by direction (Local, East, West, South, North) and then by plane number) - monitors each cycle where a NoC queue in the tile is full. One register for each direction/plane combination (i.e. 30 registers in total per tile). Valid in all tiles.

2. ESP Monitors API

The ESP Monitors API provides a simple way to access the aforementioned performance monitors from software running on an ESP-based SoC. In particular, we envision using these monitors in an adaptive manner, such that applications can tune certain parameters (e.g. utilized tiles, coherence modes, data allocation policies, etc.) based on the dynamic utilization of the system. For one example of the usage of the ESP performance monitors, please see Zuckerman, MICRO’21.

esp_monitor

The main function call in the Monitors API is esp_monitor which samples a subset of the monitors throughout the SoC as specified by an esp_monitor_args_t argument that is passed to the function call. The values are returned through an esp_monitor_vals_t structure that is passed by reference. The definition of esp_monitor, esp_monitor_args_t, esp_monitor_vals_t, and all dependent types are shown below.

unsigned int esp_monitor(esp_monitor_args_t args, esp_monitor_vals_t *vals);

typedef struct esp_monitor_args {
        esp_monitor_mode_t read_mode;
        uint16_t read_mask;
        uint8_t tile_index;
        uint8_t acc_index;
        uint8_t mon_index;
        uint8_t noc_index;
} esp_monitor_args_t;

typedef struct esp_monitor_vals {
        unsigned int ddr_accesses[SOC_NMEM];
        esp_mem_reqs_t mem_reqs[SOC_NMEM];
        esp_cache_stats_t l2_stats[SOC_NTILES];
        esp_cache_stats_t llc_stats[SOC_NMEM];
        esp_acc_stats_t acc_stats[SOC_NACC];
        unsigned int dvfs_op[SOC_NTILES][DVFS_OP_POINTS];
        unsigned int noc_injects[SOC_NTILES][NOC_PLANES];
        unsigned int noc_queue_full[SOC_NTILES][NOC_PLANES][NOC_QUEUES];
} esp_monitor_vals_t;

typedef struct esp_mem_reqs {
        unsigned int coh_reqs;
        unsigned int coh_fwds;
        unsigned int coh_rsps_rcv;
        unsigned int coh_rsps_snd;
        unsigned int dma_reqs;
        unsigned int dma_rsps;
        unsigned int coh_dma_reqs;
        unsigned int coh_dma_rsps;
} esp_mem_reqs_t;

typedef struct esp_cache_stats {
        unsigned int hits;
        unsigned int misses;
} esp_cache_stats_t;

typedef struct acc_stats_t {
        unsigned int acc_tlb;
        unsigned int acc_mem_lo;
        unsigned int acc_mem_hi;
        unsigned int acc_tot_lo;
        unsigned int acc_tot_hi;
        unsigned int acc_invocations;
} esp_acc_stats_t;

Since esp_monitor samples the current values of all performance monitors, it should be called twice – once before and once after a period of interest – to determine the number of each phenomenon that occurred during the duration between the two function calls. These two calls to esp_monitor should use the same configuration of the args variable.

The first value in the esp_monitor_args_t struct is the read_mode, which can take the values ESP_MON_READ_ALL,
ESP_MON_READ_SINGLE, or ESP_MON_READ_MANY. The remaining arguments are used depending on the selected mode.

ESP_MON_READ_ALL

This mode reads all of the performance monitors throughout the SoC, so it does not require any additional arguments. Values are returned through the esp_monitor_vals_t structure passed by reference.

ESP_MON_READ_SINGLE

This mode reads a single specified performance monitor and returns it as the return value from the function. Two additional fields of the esp_monitor_args_t structure must be specified. First, the tile_index should be set to the index of the tile that contains the target performance monitor. Second, the mon_index should be set to the index of the desired monitor (a value between 0 and 57). The indices of each monitor in the tile are listed in Section 1.

ESP_MON_READ_MANY

This mode reads a subset of the performance monitors as specified by the read_mask in the esp_monitors_arg_t structure. There are nine different sets of monitors that can be read by enabling their corresponding bit in the read_mask. A description of each set and its position in the bitmask are listed below.

ESP_MON_READ_DDR_ACCESSES (0) - reads the DDR access monitor in all memory tiles.
ESP_MON_READ_MEM_REQS (1) - reads the coherence traffic monitors in all memory tiles.
ESP_MON_READ_L2_STATS (2) - reads the L2 hit and miss counts in all CPU tiles and all accelerator tiles with an L2 cache enabled.
ESP_MON_READ_LLC_STATS (3) - reads the LLC hit and miss counts in all memory tiles.
ESP_MON_READ_ACC_STATS (4) - reads the accelerator performance monitors in the accelerator specified by acc_index (accelerator index, not tile index).
ESP_MON_READ_DVFS_OP (5) - reads the DVFS operating points monitors.
ESP_MON_READ_NOC_INJECTS (6) - reads the number of NoC injections on all NoC planes from the tile specified by tile_index.
ESP_MON_READ_NOC_QUEUE_FULL_TILE (7) - reads all NoC backpressure monitors (all directions, all planes) from the tile specified by tile_index.
ESP_MON_READ_NOC_QUEUE_FULL_PLANE (8) - reads all NoC backpressure monitors (all directions, all tiles) from the plane specified by noc_index.

Note: Calls to esp_monitor are not intended to give exact values for performance monitors during a defined period of software execution. There is an inherent delay in accessing these monitors, which may cause some slight inaccuracy in the values obtained; for example, a monitor may continue to increment while a read message intended for it is traversing the NoC. To mitigate this problem when accessing multiple monitors, modes that access multiple monitors (e.g. ESP_MON_READ_ALL and ESP_MON_READ_MANY) sample the current values of all monitors in a single tile before attempting to read them. This effectively synchronizes the values obtained from reading multiple monitors, such that the values all correspond to the same point in time. However, this sampling is done at the granularity of a tile and is serialized when multiple tiles are the target of an API call. Hence, values obtained from accesses to monitors in different tiles may not be synchronized to the same exact point in time. These sources of imprecision should be relatively small and can be further mitigated by using longer periods of sampling.

esp_monitor_diff

esp_monitor_vals_t esp_monitor_diff(esp_monitor_vals_t vals_start, esp_monitor_vals_t vals_end);

esp_monitor_diff takes in two esp_monitor_vals_t structures and returns a new esp_monitor_vals_t that contains the difference between them, accounting for any potential overflow in each monitor. Hence, this should be used to determine the number of each phenomenon that has occurred between successive calls to esp_monitor. Note that the correctness of this function is only guaranteed if each monitor has incremented less than 2³² (or 2⁶⁴ for the two 64-bit monitors) times between successive calls to esp_monitor. Hence, these two function calls should be placed sufficiently close to prevent this from happening. When using the ESP_MON_READ_SINGLE mode, you should instead use the sub_monitor_vals function that merely subtracts two 32-bit integers, accounting for overflow.

uint32_t sub_monitor_vals (uint32_t val_start, uint32_t val_end);

esp_monitor_print

#Baremetal
void esp_monitor_print(esp_monitor_args_t args, esp_monitor_vals_t vals);
#Linux
void esp_monitor_print(esp_monitor_args_t args, esp_monitor_vals_t vals, FILE *fp);

esp_monitor_print prints the formatted contents of an esp_monitor_vals_t structure to the console for baremetal or to a specified file within Linux. The esp_monitor_args_t passed to the corresponding esp_monitor calls should also be passed to this function in order to only print the relevant values.

esp_monitor_vals_alloc

esp_monitor_vals_t* esp_monitor_vals_alloc();

Only available when running in Linux. esp_monitor_vals_alloc returns a pointer to a new esp_monitor_vals_t structure in case the programmer would like to allocate these structures dynamically. While this function is optional, and malloc can be used instead, its use offers the advantage that the allocated structures are automatically cleaned up when esp_monitor_free is called.

esp_monitor_free

void esp_monitor_free();

Only available when running in Linux, and should be called once after all calls to esp_monitor. In Linux, esp_monitor requires that the memory addresses of the performance monitors are mapped with mmap – this is automatically done by the first call to esp_monitor. esp_monitor_free unmaps the region that was mapped with mmap. It also cleans up any esp_monitor_vals_t structures allocated by esp_monitor_vals_alloc.

3. Example Code

Here, we present three example uses of the ESP monitors API, where each example leverages a different mode of the API. This code can be found in the ESP release in soft/common/apps/examples/multifft_mon/multifft_mon.c. This example is the same as the multifft example application but also makes use of the Monitors API.

Example 1

	//ESP MONITORS: EXAMPLE #1
	//read a single monitor from the tile number and monitor offset

	//statically declare monitor arg structure
	esp_monitor_args_t mon_args;

	//set up argument structure using READ_SINGLE mode
	//the off-chip memory accesses are read from the memory tile at the DDR_WORD_TRANSFER monitor
	const int MEM_TILE_IDX = 0;
	mon_args.read_mode = ESP_MON_READ_SINGLE;
	mon_args.tile_index = MEM_TILE_IDX;
	mon_args.mon_index = MON_DDR_WORD_TRANSFER_INDEX;

	//in the READ_SINGLE mode, the monitor value is returned directly
	//read before and after
	unsigned int ddr_accesses_start, ddr_accesses_end;
	ddr_accesses_start = esp_monitor(mon_args, NULL);
	esp_run(cfg_nc, 1);
	ddr_accesses_end = esp_monitor(mon_args, NULL);

	printf("\n	** DONE **\n");

	//calculate differnce, accounting for overflow
	unsigned int ddr_accesses_diff;
	ddr_accesses_diff = sub_monitor_vals(ddr_accesses_start, ddr_accesses_end);
	printf("\tOff-chip memory accesses: %d\n", ddr_accesses_diff);

This example uses the ESP_MON_READ_SINGLE mode to read the DDR accesses monitor from the memory tile before and after an accelerator invocation. Hence, we specify the tile index of the memory tile and the index of the DDR access monitor. In this mode, the function returns the current value of the monitor, so the caller can pass NULL for the esp_monitor_vals_t structure, since it is not used. Finally, we use sub_monitor_vals to determine the difference between the two values and use a regular printf to display the result.

Example 2

	//ESP MONITORS: EXAMPLE #2
	//read all monitors on the SoC

	//statically declare monitor vals structures
	esp_monitor_vals_t vals_start, vals_end;

	//set read_mode to ALL
	mon_args.read_mode = ESP_MON_READ_ALL;

	cfg_llc[0].hw_buf = buf[0];

	printf("\n	** DONE **\n");

	//values written into vals struct argument
	esp_monitor(mon_args, &vals_start);
	esp_run(cfg_llc, 1);
	esp_monitor(mon_args, &vals_end);

	//calculate difference of all values
	esp_monitor_vals_t vals_diff;
	vals_diff = esp_monitor_diff(vals_start, vals_end);

	FILE *fp = fopen("multifft_esp_mon_all.txt", "w");
	esp_monitor_print(mon_args, vals_diff, fp);
	fclose(fp);

This example uses the ESP_MON_READ_ALL mode to read all of the monitors throughout the SoC. This mode does not require any additional arguments. In this case, the esp_monitor_diff function is used to compute the difference of all the values, and esp_monitor_print is used to write the differences to a file.

Example 3

	//ESP MONITORS: EXAMPLE #3
	//read a specified subset of the monitors on the SoC

	//dynamically allocate monitor arg structure
	esp_monitor_vals_t *vals_start_ptr = esp_monitor_vals_alloc();
	esp_monitor_vals_t *vals_end_ptr = esp_monitor_vals_alloc();

	//set read_mode to MANY
	mon_args.read_mode = ESP_MON_READ_MANY;
	mon_args.read_mask = 0;

	//enable reading memory accesses
	mon_args.read_mask |= 1 << ESP_MON_READ_DDR_ACCESSES;

	//enable reading L2 statistics
	mon_args.read_mask |= 1 << ESP_MON_READ_L2_STATS;

	//enable reading LLC statistics
	mon_args.read_mask |= 1 << ESP_MON_READ_LLC_STATS;

	//enable reading acc statistics - requires the index of the accelerator
	mon_args.acc_index = 0;
	mon_args.read_mask |= 1 << ESP_MON_READ_ACC_STATS;

	//enable reading noc injections - requires the index of the tile
	const int ACC_TILE_INDEX = 0;
	mon_args.tile_index = ACC_TILE_INDEX;
	mon_args.read_mask |= 1 << ESP_MON_READ_NOC_INJECTS;

	//enable reading noc backpressure on a plane - requires the index of the noc plane
	const int NOC_PLANE = 0;
	mon_args.noc_index = NOC_PLANE;
	mon_args.read_mask |= 1 << ESP_MON_READ_NOC_QUEUE_FULL_PLANE;

	cfg_fc[0].hw_buf = buf[0];

	//values written into vals struct argument
	esp_monitor(mon_args, vals_start_ptr);
	esp_run(cfg_fc, 1);
	esp_monitor(mon_args, vals_end_ptr);

	printf("\n	** DONE **\n");

        //calculate difference of all values
	vals_diff = esp_monitor_diff(*vals_start_ptr, *vals_end_ptr);

	//write results to file
	fp = fopen("multifft_esp_mon_many.txt", "w");
	esp_monitor_print(mon_args, vals_diff, fp);
	fclose(fp);

	//when done with monitors, free all allocated structures, and unmap the address space
	esp_monitor_free();

This final example uses the ESP_MON_READ_MANY mode to read a specified subset of the performance monitors of the SoC. In this case, the read_mask is configured to read the DDR accesses, L2 hits and misses, LLC hits and misses, accelerator stats from a particular accelerator, NoC injections from a particular tile, and NoC backpressure on a particular plane. This example also makes use of esp_monitor_vals_alloc to dynamically allocate the esp_monitor_vals_t structure. Because this example is the last use of esp_monitor in this application, esp_monitor_free is called to unmap the address space associated with the performance monitors and free the allocated esp_monitor_vals_t structures.