Guide – How to: design an accelerator in SystemC (Cadence Stratus HLS)

Latest update: 2020-12-21

This guide illustrates how to create and integrate an accelerator with the ESP high-level synthesis (HLS) flow, using SystemC as the specification language and Cadence’s Stratus HLS to generate a corresponding RTL implementation.

1. Accelerator design
2. Accelerator integration

Make sure to complete the prequisite tutorials before getting started with this one. This tutorial assumes that accelerator designers are familiar with the ESP infrastructure and know how to run basic make targets to create a simple instance of ESP, integrating just a single core.

In this guide and in the corresponding prebuilt material we integrate an accelerator that performs multiply and accumulate (MAC) on integer vectors of configurable length. Specifically, we implement the following small kernel of computation.

// MAC
int *data_in = new int[mac_vec][mac_len];
int *acc  = new int[mac_vec];
for (int iterations = 0; iterations < mac_n; iterations++) {
	load_input_data(data_in);
	for (int j = 0; j < mac_vec; j++) {
		acc[j] = 0;
		for (int i = 0; i < mac_len; i += 2)
			acc[j] += data_in[j][i] * data_in[j][i+1]
	}
	store_output_data(acc)
}

This is a simple example that will introduce users to the automation mechanisms offered by ESP. The tutorial, instead, will not explore the capabilities of the HLS tool for design-space exploration.

Note: The users have access to prebuilt material to run the tutorial on an FPGA, without executing all the previous steps. See the ‘FPGA prototyping with prebuilt material’ section at the end of this guide.

1. Accelerator design

Accelerator skeleton

ESP provides an interactive script that generates all of the hardware and software sockets to quickly integrate a new accelerator in a full SoC.

Note: The script and the generated skeleton can be very helpful to the accelerator designer. The generated skeletons have a simple structure, it is left to the designers to modify them to their needs. The script may generate an incorrect skeleton if entered unsupported inputs, e.g. input data size equal to zero. You can verify the correctness of the skeleton by testing it as described in the rest of the guide, before editing it.

# Move to the ESP root folder
cd <esp>
# Run the accelerator initialization script and respond as follows
./tools/accgen/accgen.sh
=== Initializing ESP accelerator template ===

  * Enter accelerator name [dummy]: mac
  * Select design flow (Stratus HLS, Vivado HLS) [S]:
  * Enter ESP path [/space/esp-master]:
  * Enter unique accelerator id as three hex digits [04A]: 053
  * Enter accelerator registers
    - register 0 name [size]: mac_len
    - register 0 default value [1]: 64
    - register 0 max value [64]:
    - register 1 name []: mac_vec
    - register 1 default value [1]: 100
    - register 1 max value [100]:
    - register 2 name []: mac_n
    - register 2 default value [1]:
    - register 2 max value [1]: 16
    - register 3 name []:
  * Configure PLM size and create skeleton for load and store:
    - Enter data bit-width (8, 16, 32, 64) [32]:
    - Enter input data size in terms of configuration registers (e.g. 2 * mac_len}) [mac_len]: mac_len * mac_vec
      data_in_size_max = 6400
    - Enter output data size in terms of configuration registers (e.g. 2 * mac_len) [mac_len]: mac_vec
      data_out_size_max = 100
    - Enter an integer chunking factor (use 1 if you want PLM size equal to data size) [1]:
      Input PLM has 6400 32-bits words
      Output PLM has 100 32-bits words
    - Enter number of input data to be processed in batch (can be function of configuration registers) [1]: mac_n
      batching_factor_max = 16

=== Generated accelerator skeleton for mac ===

The description of each parameter configured by the accelerator initialization script is reported below.

Accelerator name: choose a unique name for the accelerator. Enter mac.
Design flow: choose the target HLS tool. Leave blank to select the default: Stratus HLS.
ESP path: if you run the script from the ESP root folder, leave blank to confirm. Otherwise, enter the path to ESP.
Accelerator ID: enter a unique device ID between 0x040 and 0x3FF for the accelerator. This is a three digit hexadecimal number. Note that if any two accelerators have the same ID, the SoC generation step will fail. Enter 053.
Accelerator registers: list up to 14 accelerator-specific configuration registers. Each entry translates into a 32-bit configuration register exposed to software through a data structure called descriptor. For each register a corresponding field is added to the descriptor data structure. For every register you can specify a default configuration value and the maximum value that it can be set to. After specifying the last register, leave the next name blank to move forward. Please note that the script requires at least one register to be specified, but you are free not to use it when specifying the accelerator behavior.
- register 0: enter mac_len for name; 64 as default value; leave blank the max value.
- register 1: enter mac_vec for name; 100 as default value; leave blank the max value.
- register 2: enter mac_n for name; 1 as default value; 16 as max value
- register 3: leave blank to move forward
Data bitwidth: select the size of the data word for the accelerator. This can be one byte (8), half word (16), word (32), or double word (64). This information is used to generate appropriate data transaction requests for the ESP accelerator socket. Note that while the script only asks for the default bitwidth, it is possible to apply minor edits to the generated code later on so that the accelerator can issue requests for different types of data. Enter 32.
Input data size: specify the size of the input data token. This is the unit of data that your kernel of computation processes per invocation. Typically the input size is a function of the configuration registers. For instance, the kernel of computation for the MAC accelerator processes a mac_vec vectors, each of mac_len integer numbers at each iteration. Enter mac_len * mac_vec.
Output data size: specify the size of the output data token. The MAC returns mac_vec accumulated values at each iteration. Enter mac_vec.
Chunk factor: determine how data transactions and computation should be split with respect to the input and output data tokens. A chunking factor of 1 implies no split computation, i.e. an entire input dataset can be held in the accelerator private memory (on chip) and computation is not split across multiple load and store transactions. Larger chunking factors, instead, incur more input and output transactions to load the entire dataset and multiple output transactions to store the results. Users are expected to modify accordingly the kernel of computation to handle split computation and data transfers. Enter 1.

Note: the chunking factor directly affects the size of the local memory required by the accelerator’s data path. The script prints the size of input and output memory elements after entering the chunking factor.
Batching factor: specify the number of iterations of the computation kernel to be executed at each invocation of the accelerator. A batching factor larger than 1 implies that the accelerator has to process multiple input data sets and return as many output data sets. This parameter can be a function of the configuration registers. Enter mac_n.

Note: the batching factor has no impact on area, nor on the memory elements. Instead, it affects the accelerator latency. In general, larger batching factors yield accelerators with longer execution latency, but higher throughput. This is due to the fewer invocations required to process a large amount of data.
In place operation: when input and output data sets have the same size, specify whether the output should be written in place, thus overwriting the input data. Does not apply to MAC.

Executing the initialization script with the above parameters, generates the accelerator source files and testbench in SystemC, together with the HLS scripts and the information for the private memory generator. These files are located at the path accelerators/stratus_hls/mac_stratus/hw.

In addition, the accelerator’s device driver, bare metal application and user-space linux application are generated at the path accelerators/stratus_hls/mac_stratus/sw.

# Complete list of files generated and modified
<esp>/accelerators/stratus_hls/mac_stratus
├── hw
│   ├── mac.xml                # Accelerator description and register list
│   ├── hls                    # HLS and RTL-SystemC cosimulation folder
│   │   ├── Makefile
│   │   └── project.tcl        # Accelerator-specific HLS script
│   ├── memlist.txt            # List and shape of private-local memory elements
│   ├── sim                    # SystemC simulation folder
│   │   └── Makefile
│   ├── src                    # Accelerator source files
│   │   ├── mac_conf_info.hpp  # Configuration class definition
│   │   ├── mac.cpp            # Main SystemC processes description (load, compute, store)
│   │   ├── mac_debug_info.hpp # Optional debug class definition
│   │   ├── mac_directives.hpp # HLS directives
│   │   ├── mac_functions.hpp  # Optional helper functions for comutation
│   │   └── mac.hpp            # ESP accelerator definition and memory binding
│   └── tb                     # SystemC testbench
│       ├── sc_main.cpp
│       ├── system.cpp
│       └── system.hpp
└── sw
    ├── baremetal              # Bare metal test application
    │   ├── mac.c
    │   └── Makefile
    └── linux
        ├── app                # Linux test application
        │   ├── cfg.h
        │   ├── mac.c
        │   └── Makefile
        ├── driver             # Linux device driver
        │   ├── mac_stratus.c
        │   ├── Kbuild
        │   └── Makefile
        └── include
            └── mac_stratus.h

Accelerator behavior implementation

When selecting the SystemC HLS flow, this step consists in editing the compute portion of the ESP accelerator skeleton.

Source files for the MAC accelerator are generated at the path accelerators/stratus_hls/mac_stratus/hw/src/. From this folder, open mac.cpp and locate the definition of the SystemC process compute_kernel(). Scroll down to the code section marked by the comment // Compute and replace the remaining code with the following.

// Compute
bool ping = true;
bool out_ping = true;
{
    for (uint16_t b = 0; b < mac_n; b++)
    {
        uint32_t in_length = mac_len * mac_vec;
        uint32_t out_length = mac_vec;

        uint32_t vector_index = 0;
        uint32_t vector_number = 0;
        int32_t acc = 0;

        for (int in_rem = in_length; in_rem > 0; in_rem -= PLM_IN_WORD)
        {

            uint32_t in_len  = in_rem  > PLM_IN_WORD  ? PLM_IN_WORD  : in_rem;

            this->compute_load_handshake();

            // Computing phase implementation
            for (int i = 0; i < in_len; i += 2) {

                // Multiply and accumulate
                if (ping)
                    acc += plm_in_ping[i] * plm_in_ping[i + 1];
                else
                    acc += plm_in_pong[i] * plm_in_pong[i + 1];

                vector_index += 2;

                // Write accumulated result
                if (vector_index == mac_len) {
                    if (out_ping)
                        plm_out_ping[vector_number] = acc;
                    else
                        plm_out_pong[vector_number] = acc;

                    acc = 0;
                    vector_index = 0;
                    vector_number++;
                }
            }

            ping = !ping;
        }

        this->compute_store_handshake();
        out_ping = !out_ping;
    }
    // Conclude
    {
        this->process_done();
    }
}

NOTE: The prebuilt material contains the complete source code of the MAC accelerator.

Without editing the code, the generated accelerator implements an identity function that moves data from input to output. With respect to this skeleton, the snippet above implements the following changes.

The inner loop processes two elements of the input data per iteration (i += 2).
A new variable acc is used to accumulate one vector of length in_len.
Writes to the output memory occur only when the accumulation for one vector is completed.
Two additional counters, vector_index and vector_number, are used to keep track of the position in the current vector and of the number of the vector that is being processed.
Switching between the output ping-pong buffers occurs at a different rate with respect to input buffers. Hence a new variable out_ping is used to control which output buffer should be written to.
Unused variables, such as out_len are removed.

Please take a moment to understand and get familiar with these changes in the code above.

Testbench implementation

Most of the testbench code is generated at the path accelerators/stratus_hls/mac_stratus/hw/tb/. To complete and specialize it for the target accelerator, open system.cpp and locate the initialization of the input array in and of the golden output array gold. Replace the default initialization code with the following.

// Initialize input
in = new int32_t[in_size];
for (int i = 0; i < mac_n; i++)
    for (int j = 0; j < mac_len * mac_vec; j++)
        in[i * in_words_adj + j] = j % mac_vec;

// Compute golden output
gold = new int32_t[out_size];
for (int i = 0; i < mac_n; i++)
    for (int j = 0; j < mac_vec; j++) {
        gold[i * out_words_adj + j] = 0;
        for (int k = 0; k < mac_len; k += 2)
            gold[i * out_words_adj + j] +=
                in[i * in_words_adj + j * mac_len + k] * in[i * in_words_adj + j * mac_len + k + 1];
    }

For the purpose of this tutorial, the input array can be initialized with any dataset, including random numbers. However, make sure your MAC compute body doesn’t overflow the integer representation to avoid validation errors.

HLS configuration

The HLS script is fully generated at the path accelerators/stratus_hls/mac_stratus/hw/hls and it defines synthesis directives for all of the FPGAs supported by ESP. For every target FPGA, two default HLS configurations are defined: basic_dma32 and basic_dma64. These two configurations are necessary for integration with both 32-bits and 64-bits architectures. You are free to define more implementations which may or may not exist for both 32 and 64 bits systems, but the suffix _dma32 or _dma64 must be used when naming the HLS configurations.

In this tutorial we only need to adjust the target clock period for the Virtex7 FPGA target to make sure we reach proper timing closure. Open the project.tcl file and replace the CLOCK_PERIOD setting for Virtex7 from 10.0 to 12.5. The time unit is determined by the vendor library, which is ns in this case.

Simulation and RTL implementation

Choose one of the supported boards to create your new SoC instance. Design paths in this tutorial refer to the Xilinx VC707 evaluation board, but all instructions are valid for any of the supported boards.

After creating the MAC accelerator, ESP discovers it in the library of components and generates a set of make targets to test your specification, generate RTL and run a cosimulation of the SystemC testbench for every RTL implementation generated with HLS.

# Move to the Xilinx VC707 working folder
cd <esp>/socs/xilinx-vc707-xc7vx485t

# Run behavioral simulation
make mac_stratus-exe

# Generate RTL with HLS
make mac_stratus-hls

# Simulate all RTL implementations
make mac_stratus-sim

Accelerator debug

At every simulation stage, you may encounter issues that require debugging.

If the simulation output is incorrect at the behavioral level, you can debug your implementation as you would debug any C++ program. If you need to change compile flags, in order to run a debugger, you can do so by modifying the Makefile located at accelerators/stratus_hls/common/systemc.mk.

In case of simulation errors during the RTL simulation that do not occur in behavioral simulation, you have two main options for debugging:

Use print statement generation during HLS: any call to printf() in the accelerator source code gets translated by Stratus HLS into a call to $display in Verilog.

Note: print statement generation affects the final scheduling and quality of result. It is therefore recommended to remove print statements from the accelerator’s specification when debugging is complete.

Leverage RTL simulator to visualize waveforms.

# Move the HLS working folder
cd <esp>/accelerators/stratus_hls/mac_stratus/hw/hls-work-virtex7
# Set variables for simulation (for the VCU118 FPGA board TECH=virtexup)
export ESP_ROOT=../../../../.. TECH=virtex7 ACCELERATOR=mac_stratus
# Run RTL simulation with GUI support
make debug_BASIC_DMA64_V
# or
make debug_BASIC_DMA32_V

Typical issues in case of SystemC vs. RTL mismatches occur when there are non-initialized variables or when arrays that are mapped to static RAM memory are accessed differently from what’s specified in the list of memories to be generated (memlist.txt). Details about the content of the memory list will be given in the tutorial about design-space exploration with HLS (coming soon).

2. Accelerator integration

User application implementation

In this tutorial we select the RISC-V Ariane core and use the corresponding paths to the software source code. Please note, however, that all instructions are valid for a Leon3 system as well.

Both baremetal and Linux test applications for the MAC accelerator are generated at the path <esp>/accelerators/stratus_hls/mac_stratus/sw. To complete them, you need to apply the same edit to both baremetal and Linux applications. The changes consist in initializing inputs and golden outputs, similarly to what’s done for the SystemC testbench.

Move to the path <esp>/accelerators/stratus_hls/mac_stratus/sw/baremetal, open mac.c and locate the init_buf() function and replace its body with the following code.

int i;
int j;
int k;

for (i = 0; i < mac_n; i++)
	for (j = 0; j < mac_len * mac_vec; j++)
		in[i * in_words_adj + j] = j % mac_vec;

// Compute golden output
for (i = 0; i < mac_n; i++)
	for (j = 0; j < mac_vec; j++) {
		gold[i * out_words_adj + j] = 0;
		for (k = 0; k < mac_len; k += 2)
			gold[i * out_words_adj + j] +=
				in[i * in_words_adj + j * mac_len + k] * in[i * in_words_adj + j * mac_len + k + 1];
    }

Now move to <esp>/accelerators/stratus_hls/mac_stratus/sw/linux/app, open mac.c and replace the body of init_buffer() with the same code shown above.

Note: this code is just a port to C of the C++ code used for the SystemC testbench.

SoC configuration

The final steps of the tutorial coincide with those presented in the tutorial about designing a single core SoC. We recommend you review those steps if you are not familiar with ESP.

# Move to the Xilinx VC707 working folder
cd <esp>/socs/xilinx-vc707-xc7vx485t

Follow the “Debug link configuration” instructions from the “How to: design a single-core SoC” guide. Then configure the SoC using the ESP configuration GUI.

# Run the ESP configuration GUI
make esp-xconfig

Select Ariane in the “CPU Architecture” frame and disable the caches from the “Cache configuration” frame. Select a 2x2 layout and set 1 memory tile, 1 processor tile, 1 I/O tile and 1 MAC tile. The implementation for MAC will default to basic_dma64.

RTL simulation

Users can run a full-system RTL simulation of the MAC accelerator driven by the baremetal application running on the processor tile.

The bare-metal simulation is slow, to shorten it you may want to reduce the default values of mac_len and mac_vec in the bare-metal C application.

# Compile baremetal application
make mac_stratus-baremetal

# Modelsim
TEST_PROGRAM=./soft-build/<cpu>/baremetal/mac_stratus.exe make sim[-gui]

# Incisive
TEST_PROGRAM=./soft-build/<cpu>/baremetal/mac_stratus.exe make ncsim[-gui]

<cpu> corresponds to ariane because we selected the Ariane core in the “SoC Configuration” step.

FPGA prototyping

Follow the “FPGA prototyping” instructions from the “How to: design a single-core SoC” guide.

The only difference is that, just like for the RTL simulation, you need to specify the TEST_PROGRAM variable when launching the bare-metal test on FPGA:

TEST_PROGRAM=./soft-build/<cpu>/baremetal/mac_stratus.exe make fpga-run

For what concerns the execution of the Linux application, after logging into Linux from the ESP Linux terminal run the MAC test application:

$ cd /applications/test/
$ ./mac_stratus.exe

====== mac_stratus.0 ======

  .mac_n = 1
  .mac_vec = 100
  .mac_len = 64

  ** START **
  > Test time: 13575640 ns
    - mac_stratus.0 time: 1134480 ns

  ** DONE **
+ Test PASSED

====== mac_stratus.0 ======

FPGA prototyping with prebuilt material

With the provided prebuilt material, you can run the tutorial on FPGA directly. Each packet is marked with the first digits of the Git revision it was created and tested with.

The packet contains the following:

The source code, testbench and HLS scripts for the MAC accelerator (accelerators/stratus_hls/mac)
The bare-metal test application and the Linux device driver and test application for the MAC accelerator (soft/[ariane|leon3]/drivers/mac)
Two working folders for Xilinx VCU118 and Xilinx VC707, each including:
- The Linux image (linux.bin)
- The Baremetal application (mac.bin)
- The boot loader image (prom.bin)
- The FPGA bitstream (top.bit)
- The hidden configuration files for the design (.grlib_config and .esp_config)
- A script to run the design on FPGA (runme.sh)

Decompress the content of the packet from the ESP root folder to make sure all files are extracted to the right location.

cd <esp>
tar xf ESP_SystemcAcc_GitRev.ddaca94.tar.gz

Enter one of the soc instances extracted from the packet.

cd socs/systemc_acc_vc707

Follow the “UART interface” instructions from the “How to: design a single-core SoC” guide, then launch the runme.sh script

# Execute baremetal test
./runme.sh mac
# Boot Linux
./runme.sh

Finally From the ESP Linux terminal run the MAC test application

$ cd /applications/test/
$ ./mac.exe