Guide – How to: design an accelerator in SystemC (Cadence Stratus HLS)
Latest update: 2020-12-21
This guide illustrates how to create and integrate an accelerator with the ESP high-level synthesis (HLS) flow, using SystemC as the specification language and Cadence’s Stratus HLS to generate a corresponding RTL implementation.
Make sure to complete the prequisite tutorials before getting started with this one. This tutorial assumes that accelerator designers are familiar with the ESP infrastructure and know how to run basic make targets to create a simple instance of ESP, integrating just a single core.
In this guide and in the corresponding prebuilt material we integrate an accelerator that performs multiply and accumulate (MAC) on integer vectors of configurable length. Specifically, we implement the following small kernel of computation.
// MAC
int *data_in = new int[mac_vec][mac_len];
int *acc = new int[mac_vec];
for (int iterations = 0; iterations < mac_n; iterations++) {
load_input_data(data_in);
for (int j = 0; j < mac_vec; j++) {
acc[j] = 0;
for (int i = 0; i < mac_len; i += 2)
acc[j] += data_in[j][i] * data_in[j][i+1]
}
store_output_data(acc)
}
This is a simple example that will introduce
users to the automation mechanisms offered by ESP. The tutorial,
instead, will not explore the capabilities of the HLS
tool for design-space exploration.
Note: The users have access to prebuilt material to run the tutorial on an FPGA, without executing all the previous steps. See the ‘FPGA prototyping with prebuilt material’ section at the end of this guide.
1. Accelerator design
Accelerator skeleton
ESP provides an interactive script that generates all of the hardware and software sockets to quickly integrate a new accelerator in a full SoC.
Note: The script and the generated skeleton can be very helpful to the accelerator designer. The generated skeletons have a simple structure, it is left to the designers to modify them to their needs. The script may generate an incorrect skeleton if entered unsupported inputs, e.g. input data size equal to zero. You can verify the correctness of the skeleton by testing it as described in the rest of the guide, before editing it.
# Move to the ESP root folder
cd <esp>
# Run the accelerator initialization script and respond as follows
./tools/accgen/accgen.sh
=== Initializing ESP accelerator template ===
* Enter accelerator name [dummy]: mac
* Select design flow (Stratus HLS, Vivado HLS) [S]:
* Enter ESP path [/space/esp-master]:
* Enter unique accelerator id as three hex digits [04A]: 053
* Enter accelerator registers
- register 0 name [size]: mac_len
- register 0 default value [1]: 64
- register 0 max value [64]:
- register 1 name []: mac_vec
- register 1 default value [1]: 100
- register 1 max value [100]:
- register 2 name []: mac_n
- register 2 default value [1]:
- register 2 max value [1]: 16
- register 3 name []:
* Configure PLM size and create skeleton for load and store:
- Enter data bit-width (8, 16, 32, 64) [32]:
- Enter input data size in terms of configuration registers (e.g. 2 * mac_len}) [mac_len]: mac_len * mac_vec
data_in_size_max = 6400
- Enter output data size in terms of configuration registers (e.g. 2 * mac_len) [mac_len]: mac_vec
data_out_size_max = 100
- Enter an integer chunking factor (use 1 if you want PLM size equal to data size) [1]:
Input PLM has 6400 32-bits words
Output PLM has 100 32-bits words
- Enter number of input data to be processed in batch (can be function of configuration registers) [1]: mac_n
batching_factor_max = 16
=== Generated accelerator skeleton for mac ===
The description of each parameter configured by the accelerator initialization
script is reported below.
- Accelerator name: choose a unique name for the accelerator. Enter mac.
- Design flow: choose the target HLS tool. Leave blank to select the default: Stratus HLS.
- ESP path: if you run the script from the ESP root folder, leave blank to confirm. Otherwise, enter the path to ESP.
- Accelerator ID: enter a unique device ID between 0x040 and 0x3FF for the accelerator. This is a three digit hexadecimal number. Note that if any two accelerators have the same ID, the SoC generation step will fail. Enter 053.
- Accelerator registers: list up to 14 accelerator-specific
configuration registers. Each entry translates into a 32-bit
configuration register exposed to software through a data structure
called descriptor. For each register a corresponding field is
added to the descriptor data structure.
For every register you can specify a default configuration value
and the maximum value that it can be set to. After specifying the
last register, leave the next name blank to move forward. Please
note that the script requires at least one register to be
specified, but you are free not to use it when specifying the
accelerator behavior.
- register 0: enter mac_len for name; 64 as default value; leave blank the max value.
- register 1: enter mac_vec for name; 100 as default value; leave blank the max value.
- register 2: enter mac_n for name; 1 as default value; 16 as max value
- register 3: leave blank to move forward
- Data bitwidth: select the size of the data word for the accelerator. This can be one byte (8), half word (16), word (32), or double word (64). This information is used to generate appropriate data transaction requests for the ESP accelerator socket. Note that while the script only asks for the default bitwidth, it is possible to apply minor edits to the generated code later on so that the accelerator can issue requests for different types of data. Enter 32.
- Input data size: specify the size of the input data token. This is the unit of data that your kernel of computation processes per invocation. Typically the input size is a function of the configuration registers. For instance, the kernel of computation for the MAC accelerator processes a mac_vec vectors, each of mac_len integer numbers at each iteration. Enter mac_len * mac_vec.
- Output data size: specify the size of the output data token. The MAC returns mac_vec accumulated values at each iteration. Enter mac_vec.
- Chunk factor: determine how data transactions and
computation should be split with respect to the input and output
data tokens. A chunking factor of 1 implies no split computation,
i.e. an entire input dataset can be held in the accelerator private
memory (on chip) and computation is not split across multiple load
and store transactions. Larger chunking factors, instead, incur
more input and output transactions to load the entire dataset and multiple
output transactions to store the results. Users are expected to
modify accordingly the kernel of computation to handle split
computation and data transfers. Enter
1.
Note: the chunking factor directly affects the size of the local memory required by the accelerator’s data path. The script prints the size of input and output memory elements after entering the chunking factor.
- Batching factor: specify the number of iterations of the
computation kernel to be executed at each invocation of the
accelerator. A batching factor larger than 1 implies that the
accelerator has to process multiple input data sets and return as
many output data sets. This parameter can be a
function of the configuration registers. Enter mac_n.
Note: the batching factor has no impact on area, nor on the memory elements. Instead, it affects the accelerator latency. In general, larger batching factors yield accelerators with longer execution latency, but higher throughput. This is due to the fewer invocations required to process a large amount of data.
- In place operation: when input and output data sets have the same size, specify whether the output should be written in place, thus overwriting the input data. Does not apply to MAC.
Executing the initialization script with the above parameters,
generates the accelerator source files and testbench in SystemC,
together with the HLS scripts and the information for the private
memory generator. These files are located at the path
accelerators/stratus_hls/mac_stratus/hw
.
In addition, the accelerator’s device driver, bare metal application
and user-space linux application are generated at the path
accelerators/stratus_hls/mac_stratus/sw
.
# Complete list of files generated and modified
<esp>/accelerators/stratus_hls/mac_stratus
├── hw
│ ├── mac.xml # Accelerator description and register list
│ ├── hls # HLS and RTL-SystemC cosimulation folder
│ │ ├── Makefile
│ │ └── project.tcl # Accelerator-specific HLS script
│ ├── memlist.txt # List and shape of private-local memory elements
│ ├── sim # SystemC simulation folder
│ │ └── Makefile
│ ├── src # Accelerator source files
│ │ ├── mac_conf_info.hpp # Configuration class definition
│ │ ├── mac.cpp # Main SystemC processes description (load, compute, store)
│ │ ├── mac_debug_info.hpp # Optional debug class definition
│ │ ├── mac_directives.hpp # HLS directives
│ │ ├── mac_functions.hpp # Optional helper functions for comutation
│ │ └── mac.hpp # ESP accelerator definition and memory binding
│ └── tb # SystemC testbench
│ ├── sc_main.cpp
│ ├── system.cpp
│ └── system.hpp
└── sw
├── baremetal # Bare metal test application
│ ├── mac.c
│ └── Makefile
└── linux
├── app # Linux test application
│ ├── cfg.h
│ ├── mac.c
│ └── Makefile
├── driver # Linux device driver
│ ├── mac_stratus.c
│ ├── Kbuild
│ └── Makefile
└── include
└── mac_stratus.h
Accelerator behavior implementation
When selecting the SystemC HLS flow, this step consists in editing the compute portion of the ESP accelerator skeleton.
Source files for the MAC accelerator are generated at the path
accelerators/stratus_hls/mac_stratus/hw/src/
.
From this folder, open mac.cpp and locate the definition of the SystemC
process compute_kernel(). Scroll down to the code section marked by the
comment // Compute
and replace the remaining code with the following.
// Compute
bool ping = true;
bool out_ping = true;
{
for (uint16_t b = 0; b < mac_n; b++)
{
uint32_t in_length = mac_len * mac_vec;
uint32_t out_length = mac_vec;
uint32_t vector_index = 0;
uint32_t vector_number = 0;
int32_t acc = 0;
for (int in_rem = in_length; in_rem > 0; in_rem -= PLM_IN_WORD)
{
uint32_t in_len = in_rem > PLM_IN_WORD ? PLM_IN_WORD : in_rem;
this->compute_load_handshake();
// Computing phase implementation
for (int i = 0; i < in_len; i += 2) {
// Multiply and accumulate
if (ping)
acc += plm_in_ping[i] * plm_in_ping[i + 1];
else
acc += plm_in_pong[i] * plm_in_pong[i + 1];
vector_index += 2;
// Write accumulated result
if (vector_index == mac_len) {
if (out_ping)
plm_out_ping[vector_number] = acc;
else
plm_out_pong[vector_number] = acc;
acc = 0;
vector_index = 0;
vector_number++;
}
}
ping = !ping;
}
this->compute_store_handshake();
out_ping = !out_ping;
}
// Conclude
{
this->process_done();
}
}
NOTE: The prebuilt material contains the complete source code of the MAC accelerator.
Without editing the code, the generated accelerator implements an identity function that moves data from input to output. With respect to this skeleton, the snippet above implements the following changes.
- The inner loop processes two elements of the input data per iteration (
i += 2
). - A new variable
acc
is used to accumulate one vector of lengthin_len
. - Writes to the output memory occur only when the accumulation for one vector is completed.
- Two additional counters,
vector_index
andvector_number
, are used to keep track of the position in the current vector and of the number of the vector that is being processed. - Switching between the output ping-pong buffers occurs at a different rate with
respect to input buffers. Hence a new variable
out_ping
is used to control which output buffer should be written to. - Unused variables, such as
out_len
are removed.
Please take a moment to understand and get familiar with these changes in the code above.
Testbench implementation
Most of the testbench code is generated at the path
accelerators/stratus_hls/mac_stratus/hw/tb/
. To complete and specialize it for the target
accelerator, open system.cpp and locate the initialization of the input array
in
and of the golden output array gold
. Replace the default initialization
code with the following.
// Initialize input
in = new int32_t[in_size];
for (int i = 0; i < mac_n; i++)
for (int j = 0; j < mac_len * mac_vec; j++)
in[i * in_words_adj + j] = j % mac_vec;
// Compute golden output
gold = new int32_t[out_size];
for (int i = 0; i < mac_n; i++)
for (int j = 0; j < mac_vec; j++) {
gold[i * out_words_adj + j] = 0;
for (int k = 0; k < mac_len; k += 2)
gold[i * out_words_adj + j] +=
in[i * in_words_adj + j * mac_len + k] * in[i * in_words_adj + j * mac_len + k + 1];
}
For the purpose of this tutorial, the input array can be initialized with any dataset, including random numbers. However, make sure your MAC compute body doesn’t overflow the integer representation to avoid validation errors.
HLS configuration
The HLS script is fully generated at the path
accelerators/stratus_hls/mac_stratus/hw/hls
and it defines synthesis directives for
all of the FPGAs supported by ESP. For every target FPGA, two default HLS
configurations are defined: basic_dma32
and basic_dma64
. These two
configurations are necessary for integration with both 32-bits and 64-bits
architectures. You are free to define more implementations which may or may not
exist for both 32 and 64 bits systems, but the suffix _dma32
or _dma64
must be used
when naming the HLS configurations.
In this tutorial we only need to adjust the target clock period for the
Virtex7 FPGA target to make sure we reach proper timing closure. Open the
project.tcl file and replace the CLOCK_PERIOD
setting for Virtex7 from
10.0 to 12.5. The time unit is determined by the vendor library, which is ns
in this case.
Simulation and RTL implementation
Choose one of the supported boards to create your new SoC instance. Design paths in this tutorial refer to the Xilinx VC707 evaluation board, but all instructions are valid for any of the supported boards.
After creating the MAC accelerator, ESP discovers it in the library of components and generates a set of make targets to test your specification, generate RTL and run a cosimulation of the SystemC testbench for every RTL implementation generated with HLS.
# Move to the Xilinx VC707 working folder
cd <esp>/socs/xilinx-vc707-xc7vx485t
# Run behavioral simulation
make mac_stratus-exe
# Generate RTL with HLS
make mac_stratus-hls
# Simulate all RTL implementations
make mac_stratus-sim
Accelerator debug
At every simulation stage, you may encounter issues that require debugging.
If the simulation output is incorrect at the behavioral level, you can debug
your implementation as you would debug any C++ program. If you need to change
compile flags, in order to run a debugger, you can do so by modifying the
Makefile located at accelerators/stratus_hls/common/systemc.mk
.
In case of simulation errors during the RTL simulation that do not occur in behavioral simulation, you have two main options for debugging:
- Use print statement generation during HLS: any call to printf() in the
accelerator source code gets translated by Stratus HLS into a call to
$display in Verilog.
Note: print statement generation affects the final scheduling and quality of result. It is therefore recommended to remove print statements from the accelerator’s specification when debugging is complete.
- Leverage RTL simulator to visualize waveforms.
# Move the HLS working folder cd <esp>/accelerators/stratus_hls/mac_stratus/hw/hls-work-virtex7 # Set variables for simulation (for the VCU118 FPGA board TECH=virtexup) export ESP_ROOT=../../../../.. TECH=virtex7 ACCELERATOR=mac_stratus # Run RTL simulation with GUI support make debug_BASIC_DMA64_V # or make debug_BASIC_DMA32_V
Typical issues in case of SystemC vs. RTL mismatches occur when there are non-initialized variables or when arrays that are mapped to static RAM memory are accessed differently from what’s specified in the list of memories to be generated (memlist.txt). Details about the content of the memory list will be given in the tutorial about design-space exploration with HLS (coming soon).
2. Accelerator integration
User application implementation
In this tutorial we select the RISC-V Ariane core and use the corresponding paths to the software source code. Please note, however, that all instructions are valid for a Leon3 system as well.
Both baremetal and Linux test applications for the MAC accelerator are generated
at the path <esp>/accelerators/stratus_hls/mac_stratus/sw
.
To complete them, you need to apply the same edit to both baremetal and Linux
applications. The changes consist in initializing inputs and golden outputs,
similarly to what’s done for the SystemC testbench.
Move to the path <esp>/accelerators/stratus_hls/mac_stratus/sw/baremetal
, open mac.c and locate the
init_buf() function and replace its body with the following code.
int i;
int j;
int k;
for (i = 0; i < mac_n; i++)
for (j = 0; j < mac_len * mac_vec; j++)
in[i * in_words_adj + j] = j % mac_vec;
// Compute golden output
for (i = 0; i < mac_n; i++)
for (j = 0; j < mac_vec; j++) {
gold[i * out_words_adj + j] = 0;
for (k = 0; k < mac_len; k += 2)
gold[i * out_words_adj + j] +=
in[i * in_words_adj + j * mac_len + k] * in[i * in_words_adj + j * mac_len + k + 1];
}
Now move to <esp>/accelerators/stratus_hls/mac_stratus/sw/linux/app
, open mac.c and replace the body of
init_buffer() with the same code shown above.
Note: this code is just a port to C of the C++ code used for the SystemC testbench.
SoC configuration
The final steps of the tutorial coincide with those presented in the tutorial about designing a single core SoC. We recommend you review those steps if you are not familiar with ESP.
# Move to the Xilinx VC707 working folder
cd <esp>/socs/xilinx-vc707-xc7vx485t
Follow the “Debug link configuration” instructions from the “How to:
design a single-core SoC” guide.
Then configure the SoC using the ESP configuration GUI.
# Run the ESP configuration GUI
make esp-xconfig
Select Ariane in the “CPU Architecture” frame and disable the caches from the
“Cache configuration” frame.
Select a 2x2 layout and set 1 memory tile, 1 processor tile, 1 I/O tile and 1
MAC tile. The implementation for MAC will default to basic_dma64.
RTL simulation
Users can run a full-system RTL simulation of the MAC accelerator driven by the baremetal application running on the processor tile.
The bare-metal simulation is slow, to shorten it you may want to reduce the
default values of mac_len
and mac_vec
in the bare-metal C application.
# Compile baremetal application
make mac_stratus-baremetal
# Modelsim
TEST_PROGRAM=./soft-build/<cpu>/baremetal/mac_stratus.exe make sim[-gui]
# Incisive
TEST_PROGRAM=./soft-build/<cpu>/baremetal/mac_stratus.exe make ncsim[-gui]
<cpu>
corresponds to ariane
because we selected the Ariane core in the “SoC Configuration” step.
FPGA prototyping
Follow the “FPGA prototyping” instructions from the “How to: design a single-core SoC” guide.
The only difference is that, just like for the RTL simulation, you need to specify the TEST_PROGRAM
variable when launching the bare-metal test on FPGA:
TEST_PROGRAM=./soft-build/<cpu>/baremetal/mac_stratus.exe make fpga-run
For what concerns the execution of the Linux application, after logging into Linux from the ESP Linux terminal run the MAC test application:
$ cd /applications/test/
$ ./mac_stratus.exe
====== mac_stratus.0 ======
.mac_n = 1
.mac_vec = 100
.mac_len = 64
** START **
> Test time: 13575640 ns
- mac_stratus.0 time: 1134480 ns
** DONE **
+ Test PASSED
====== mac_stratus.0 ======
FPGA prototyping with prebuilt material
With the provided prebuilt material, you can run the tutorial on FPGA directly. Each packet is marked with the first digits of the Git revision it was created and tested with.
The packet contains the following:
- The source code, testbench and HLS scripts for the MAC accelerator (
accelerators/stratus_hls/mac
) - The bare-metal test application and the Linux device driver and test application for the MAC accelerator
(
soft/[ariane|leon3]/drivers/mac
) - Two working folders for Xilinx VCU118 and Xilinx VC707, each including:
- The Linux image (
linux.bin
) - The Baremetal application (
mac.bin
) - The boot loader image (
prom.bin
) - The FPGA bitstream (
top.bit
) - The hidden configuration files for the design (
.grlib_config
and.esp_config
) - A script to run the design on FPGA (
runme.sh
)
- The Linux image (
Decompress the content of the packet from the ESP root folder to make sure all files are extracted to the right location.
cd <esp>
tar xf ESP_SystemcAcc_GitRev.ddaca94.tar.gz
Enter one of the soc instances extracted from the packet.
cd socs/systemc_acc_vc707
Follow the “UART interface” instructions from the “How to: design a
single-core SoC” guide,
then launch the runme.sh script
# Execute baremetal test
./runme.sh mac
# Boot Linux
./runme.sh
Finally From the ESP Linux terminal run the MAC test application
$ cd /applications/test/
$ ./mac.exe